UCSC’s Million-COVID-Genome Tree Could be a First
(Source: Genomics Institute)
April 27, 2021 — Santa Cruz, CA
Solving a computational puzzle, a UCSC team created a dynamic evolutionary tree to enable real-time genomic contact tracing
Early in the pandemic, UCSC knew they wanted to help researchers tracking the virus. During the 2013 Ebola outbreak, the seasoned Browser team had used their coding skills to build a virus browser. Since Ebola, a new era of fast, cheap sequencing has created a mountain of genomic data, changing the research landscape. Traditional display code just wouldn’t keep pace with this novel coronavirus.
Moreover, while the UCSC Genome Browser is a reliable research workhorse and has helped scientists around the globe make visual sense of genomic data for more than 20 years, its team was not well-versed in efficient display of phylogenetic — or evolutionary — trees. The virus’s phylogenetic tree shows the relationships between virus samples and the order in which mutations happened along various lineages as the virus has been evolving.
While familiar to virus researchers, phylogenetic trees were not a feature of the UCSC Genome Browser — that is, until very recently.
This means it is even more remarkable that the Genome Browser is now home to the first phylogenetic or evolutionary tree ever to connect more than one million genetic relatives. “I’m sure this is the first time in history that over a million sequences of members of the same species have been available, let alone arranged in a comprehensive tree,” said Angie Hinrichs, the Browser’s Senior Software Architect. “This is possible only because scientists worldwide have been contributing their SARS-CoV-2 genome sequences to public repositories,” Hinrichs noted.
Virologists who are used to clicking around evolutionary trees to review relationships then separately looking up papers about specific mutations can now upload new SARS-CoV-2 genome sequences; they can have them placed in near real time in a massive, global phylogenetic tree; and view the local subtrees containing their sequences, and the most closely related sequences with mutations plotted along the whole SARS-CoV-2 virus genome (or their favorite part, like the Spike protein Receptor Binding Domain) in the UCSC SARS-CoV-2 Genome Browser, cross-referenced with relevant papers alongside.
When a virologist can identify similarities among the strings of nucleotide bases A, U, C and Gs of all available SARS-CoV-2 virus samples, they can see which are most closely related to each other. Showing relationships among sequences is a way to conduct real-time, genomic-based contact tracing, which is seen as a way to strengthen this foundational tool of public health.
While building phylogenetic trees is a new pursuit for the UCSC Genome Browser team, a phylogenetic or evolutionary tree is actually a fundamental concept of modern biology, going back to sketches in Darwin’s notebooks. (Try searching for “Darwin’s tree” in Google images and you’ll see something that also looks like a family tree.)
Phylogenetic trees have traditionally been used to describe relationships between different species, such as Darwin’s famous finches. Computational tools that display trees from sequence data often deal with highly diverged sequences – sequences of organisms that have been evolving separately for a long time, like hundreds of millions of years in the case of familiar vertebrates. But SARS-CoV-2 sequences are different because they are so much closer in time, and therefore much more similar to each other, genetically speaking.
Scientists estimate that a significant COVID virus mutation happens as often as every 11 days within a chain of infection. This relative frequency and volume of replication versus comparable mutations in humans or mice generated a tangle of sequences and their relationships to document on the virus’s evolutionary tree – new branches sprouting out with each new mutation.
“Because the SARS-CoV-2 sequences are so densely sampled, we don’t see each and every individual mutation happen from one sequence to the next – but we see enough sequences to get a good guess of the order in which widely transmitted mutations happened,” added UCSC faculty member and project collaborator Russell Corbett-Detig.
While each coronavirus genome is only about 30,000 nucleotide bases long – quite short compared with other species – displaying billions of bases resulting from over a million samples, their interconnectivity and valuable information regarding those connections quickly slows things down.
This is where UCSC postdoctoral scholar and incoming UCSD faculty member Yatish Turakhia came in. Turakhia wrote UShER, a novel phylogenetics approach designed to help smooth this tangled canopy of branches and make an evolutionary tree more feasible.
“UShER, which stands for Ultrafast Sample placement on Existing tRees, was designed to help the Browser team move rapidly towards a real-time viral phylogenetics solution,” Turakhia said. “My goal in writing UShER was to empower research and enable real-time virus genomic surveillance and contact tracing for laboratories worldwide.”
UShER is not only the engine underlying the Genome Browser’s web interface for placing new sequences in the phylogenetic tree, but also an essential tool for extending the phylogenetic tree with new sequences that are pouring into public repositories.
In addition to building a simple web interface for UShER, Hinrichs still had the hard work of creating an automated pipeline that could handle hundreds of thousands of sequences and metadata contributed by researchers in laboratories all over the world, with the inevitable quirks of data entered by people working heroic hours at full speed to sequence as many virus samples as possible.
Every night, Hinrichs’s automated processes fetch data from the National Center for Biotechnology Information (NCBI) and COVID-19 Genomics UK (COG-UK) consortium; identify correspondences with sequences manually downloaded from GISAID, a genomic data hub headquartered in Munich; and use UShER to add tens of thousands of new sequences to the phylogenetic tree.
“Even in our tree of over a million sequences, UShER takes less than a second on average to find out where a new sequence fits in,” Hinrichs noted.
“Hinrichs’s web interface makes it straightforward to upload sequences and then view their placement with closely related sequences using Nextstrain.org‘s awesome interactive tree display” in addition to the Genome Browser, added Corbett-Detig.
Because of restrictions on the use of GISAID’s data, users cannot download the full tree. However, Hinrichs has also built a tree of over half a million unrestricted publically available SARS-CoV-2 genome sequences from GenBank®, the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences; and from COG-UK, the UK’s COVID-19 Genomics Consortium that is available for download.
To learn more about the UCSC SARS-CoV-2 Genome Browser, see its SARS-CoV-2 Genome Browser tutorial, a tutorial on real-time phylogenetics with UShER that is part of the CDC’s COVID-19 Genomic Epidemiology Toolkit, and the UCSC Browser team’s SARS-CoV-2 presentation.
Originally published here: https://ucscgenomics.soe.ucsc.edu/ucscs-million-covid-genome-tree-could-be-a-first/