Data scientists took a tool originally developed for Uber and built a new prediction model to help make sense of emerging variants.

Until recently, there was no scientific way to predict which COVID-19 variants would be the most transmissible, and therefore no way to guide public policy as new strains emerged. For the past year and a half, public health experts have had to base their planning on simple observation, and to some extent, guesswork: which variants were becoming dominant in other regions or countries, and how soon could we expect to see them here?

Now, though, a collaborative team of data scientists, biologists, and infectious disease experts has applied machine learning advances originally designed, believe it or not, for the ride-sharing industry to this challenge. The result: a new tool that can actually predict the transmissibility of variants well ahead of time, accurately forecasting variant transmission patterns for the next one to two months. (Note: this tool and a description of its scientific validation have been posted as a preprint, which is a scientific paper that has not yet undergone the rigors of peer review.)


The tool would not have been possible without an unusual pairing. In the summer of 2020, data scientists who had previously worked for Uber joined one of the world’s leading genomic institutes, teaming up with scientists dedicated to fighting the COVID-19 pandemic. Last year, the Broad Institute (in this case, Broad rhymes with “rode”) in Cambridge, Mass., quickly converted some of its industrial-scale genomics lab capacity into a pandemic testing facility. In addition to determining whether samples were positive or negative for the SARS-CoV-2 virus, the team also sequenced tens of thousands of viral genomes.

Around the world, many laboratories are contributing to the database of viral genomes as well; the GISAID repository has had 3.7 million submissions. That’s a wealth of data, but running any kind of comparison across so many genomes is prohibitively costly in computational terms.

At the Broad, scientists wanted to do more with this data, and they had just the team to make it happen — three data scientists recruited from Uber’s AI team who had created a machine learning tool called Pyro to help customize models of traffic patterns and other elements for cities or regions. The tool was particularly good for building new models that contained many uncertain variables. When it was publicly released by Uber as an open-source platform, it got a surprising amount of uptake in the life science community, where it could be used for probabilistic modeling of biological experiments. “It’s actually more useful for science than it is for a ride-sharing company,” says Fritz Obermeyer, one of the developers who formerly worked at Uber.


At the Broad, Obermeyer and his colleagues quickly took up the challenge of mining the millions of available SARS-CoV-2 genomes to try to forecast the transmissibility of new variants. Rather than comparing every genome to every other genome, they streamlined the process by analyzing clusters of closely related variants. Their preprint describes the analysis of 2.1 million genomes, clustered into nearly 1,300 lineages representing more than 1,000 different regions around the world.

The machine learning tool they built is based on the original Pyro framework — this one is called PyR0, a play on the R0 metric used to assess disease transmissibility. It models variant patterns based on specific mutations in the viral genome. “The predictive capability of this model relies on the repeated emergence of the same mutation in different strains independently,” Obermeyer says. “That allows us to predict the growth rate of a particular strain based on the new mutations it has acquired.”

While the model relies on mutations that have been seen before, one of its most important features is that it does not need to know what any given mutation does. Typically, scientists seeking to assess transmissibility of a variant have to perform a series of lab experiments to tease out the precise function of each new mutation. For Obermeyer’s tool, those time-consuming functional tests aren’t necessary for forecasting. The model has access to all of the mutations from genome sequence data, and can infer from the data which ones are associated with increased transmissibility. That is a huge leap in capability for epidemiological researchers focusing on the COVID-19 pandemic.

According to Bronwyn MacInnis, an infectious disease scientist at the Broad who described this work in a presentation at the recent AGBT Precision Health conference, the PyR0 tool accurately predicted both the explosive growth of the Delta variant and the relatively minor emergence of the Mu variant (originally detected in Colombia earlier this year), long before conventional scientific approaches could have. Using genomic data for epidemiology and infectious disease has “really come of age” in the pandemic, she said. But genomic tools were not built for this kind of use. “The field really needs some great and quick innovation to keep up with the data,” she added, pointing to the former Uber team’s work as a great example.

Obermeyer points out that the model only works as well as it does because it has access to such an enormous trove of genomic data collected around the world. “It’s really important to be able to share observations [of mutations] across countries and across cities,” he says.


Now that the tool is available, public health experts have one more arrow in the quiver to help guide the pandemic response. Mask mandates, indoor capacity limits, and other measures can all be used in a more targeted manner if we can predict the likelihood of the spread of specific new COVID-19 variants. “As soon as we see that there’s a more highly transmissible strain in a particular region, then we [can] react to that by changing these intervention measures,” Obermeyer says.