Researchers in the United States have developed a new pipeline for high-throughput automated annotation of genes, proteins and functional domains in the genome of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) – the agent responsible for the 2019 coronavirus disease (COVID-19).
The team at IBM Almaden Research Center in San Jose, Calif., Says the tool offers the advantage of not having to rely on the use of a single reference genome, which can have limitations then. as the virus continues to evolve new variants.
Since the start of the COVID-19 epidemic in Wuhan, China at the end of December 2019, intense research efforts have been made around the world to sequence the SARS-CoV-2 genomes observed in patients infected with near real-time efficiency.
“In order to take advantage of this large and growing body of data, high-throughput computational methods must be developed for rapid and high-precision analysis to provide the molecular targets that are currently being evaluated for the development of drugs, vaccine specificity and diagnostic tests, âthe team says.
Now, Kristen Beck and her colleagues have developed a new annotation pipeline that has generated gene, protein, and domain data on 66,905 publicly available SARS-CoV-2 sequences.
The data provides efficient and precise molecular targets across the entire SARS-CoV-2 proteome and all genomes analyzed.
A pre-printed version of the research paper is available on the bioRxiv server, while the article is subject to peer review.
Tools that rely on a single reference sequence are limited
Commonly referred to as the âWuhan Reference Genome,â the first sequenced SARS-CoV-2 genome was published in January 2020 and quickly became the accepted reference standard.
However, since then tens of thousands of SARS-CoV-2 genomes have been released every week.
Several viral genome annotation methods such as VAPiD, Prokka and InterProScan are available which aim to provide autonomous annotation (no reference genome required) of genes and proteins.
“Yet many of these tools do not provide sufficient precision with ‘standard’ use and have not yet been applied on a large scale as the available data on the sequence of SARS-CoV-2 increases,” assert Researchers.
In addition, several variants of SARS-CoV-2 have emerged, including the D614G variant, which appeared earlier in the pandemic, and the more recently emerged B.1.1.7 variant which now accounts for the majority of new cases in the United States. . The B.1.17 variant contains an N501Y mutation which enhances the binding of the viral spike protein to receptors in the host cell.
Mutations that occur in these variants can present challenges in the application of the autonomous genomic annotation method.
As an alternative method, alignment to the Wuhan reference genome can be done using tools such as NextStrain’s Augur or the UCSC SARS-CoV-2 genome browser.
This type of “supervised” analysis uses published genetic data to extract sequences from the genome of interest on the basis of position and sequence similarity to a reference genome.
However, a reference-dependent approach has limitations, as the course of SARS-CoV-2 is currently estimated to mutate about twice a month.
What did the researchers do?
The researchers used a combination of cutting-edge tools and custom calibration tools to develop a semi-supervised genome annotation pipeline. They applied this method to 66,905 SARS-CoV-2 genomes to identify the sequences of genes, proteins and functional domains in each genome.
The team identified a complete set of known proteins with an average membership accuracy of 98.5%.
âWe were able to achieve complete or near complete adhesion of the protein set for all genomes,â says Beck and his colleagues. “Each protein is a translated gene sequence, and thus the equivalent gene identification precision is also obtained.”
How did the method compare to other tools?
Compared to other published tools such as Prokka and VAPiD, the approach identified 6.4 and 1.8 times more protein annotations, respectively.
The method has produced nearly 13 million new molecular target sequences accessible through the IBM Functional Genomics platform, a tool made available free of charge to the global research community.
Some of the identified sequences have been conserved across time and geographic location, while others represented emerging variants.
In addition, for the spike protein domains, the team achieved greater than 97.9% sequence identity with the references and identified variants of the spike receptor binding domain.
“Our pipeline correctly identified key variants of D614G and N501Y that were previously observed and experimentally validated, further indicating their accuracy,” the team writes.
The method could be used to inform the specificity of the vaccine
“Here, we present a novel semi-supervised pipeline to annotate the molecular targets of genes, proteins and functional domains of SARS-CoV-2 genomes and demonstrate the resulting precision against known benchmarks and to d ‘other bioinformatics tools “, explain the researchers.
Beck and colleagues say that as the vaccine rollout continues during the current pandemic, this method could be used to effectively monitor and track emerging protein variants to inform vaccine specificity and binding affinity. host proteins.
“Additionally, as future work, confirmation of predicted sequences in silico using a structural model will allow refinement of protein sequences and key domains to broaden our understanding of the interaction with the proteins of the ‘host, antivirals or diagnostics, âthe team concludes.
bioRxiv publishes preliminary scientific reports that are not peer reviewed and, therefore, should not be considered conclusive, guide clinical practice / health-related behaviors, or be treated as established information.