Lineage Browse - Resource for Coronavirus 2019

RCoV19

The result was calculated by pangolin (Pangolin 4.3.1, PangoLEARN v1.17.1). For details see lineages.
Reference A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology.

Sampling date distribution of the samples in selected lineages

Sampling country distribution of the samples in selected lineages

Sampling time and space distribution of the samples in selected lineages

Virus Strain Name	Accession ID	Lineage	Lineages Version	Age	Gender	Sequence Quality	Quality Assessment	Host	Sample Collection Date	Location

Note: Variation sample ratio computing is base on high quality human sequences.

Genome position	Base change	gene	Amino acids change	Samples number	Ratio of sample (%)

WHO Name	Lineage	Sample Count	Earliest Date	Lineage description	Variants（Sample #，Sample %）	Country（Sample #，Sample %）

Genome sequences derived from different resources worldwide (NCBI, GISAID, NGDC, NMDC, CNGB) are integrated and curated based on the meta data and sequence alignment results, and this table is daily updated during COVID-19 outbreak.

Related ID: We perform data curation to provide non-redundant genome sequences, especially the whole genome sequences, facilitating users to obtain correct analysis results such as variation frequencies, phylogenetic tree.
1. Redundancies between databases: A genome sequence may be submitted to more than one database. Redundancies are identified based on meta information, sequence alignment, or the report from data submitters. We preferentially provide genome information (“Accession ID” and “Virus Strain Name”) of the databases which are publicly open to all users. Accordingly, the accession ids of other databases are listed in “Related ID”. We encourage submitters to share the data with different data centers, which would greatly benefit the curation, annotation, bioinformatics analysis and experimental studies.
2. Redundancies within database: Within a database, genome sequences with the same virus name, sequence, sample collection date, patient information, virus passage history, etc., are considered to be redundancies. This may due to repeated submission or other mistakes. In this case, accession id of the latter one will be listed in “Related ID”.
3. Whole genome sequence and gene sequence: Only the whole genome sequence is included in the table if a virus has both whole genome sequence and gene sequences, while accession ids of gene sequences are not listed in “Related ID”. All the gene sequences will be listed as different rows in the table only if the whole genome is unavailable. However, gene sequences will be removed from the table and replaced by the whole genome sequence when the whole genome sequence is obtained.

Download: The meta information could be downloaded by all users, while only sequences that are originally public are available in our database. Because of usage rights, sequences from GISAID cannot be downloaded here. Please log into GISAID's website to retrieve them.

Nuc.Completeness: The “Complete” genome sequence should cover all the protein-coding region/CDS region of SARS-CoV-2, and its length should be larger than 29k. Otherwise, the sequence will be identified as “Partial”.

Quality Assessment: The “Complete” sequences are further analyzed in the following five aspects related with sequence quality, including unknown bases (Ns)’ number, degenerate bases’ number, total gaps (deletion, insertion, indel) when aligned to reference sequence MN908947, mutation number, and mutation density (mutation number/length of mutation region; mutation region<=20nt). We calculate the total number of mutations across the whole genome, while analyze the remaining four quality items within protein-coding region.
Quality control is performed based on criteria listed in the table; green represents “pass”, while red represents “fail”. Mouse-over to view details.
Sequences tagged with red dot(s) should be used with cautions. Large number of Ns or degenerate bases, or multiple gaps suggest that there may be quality issues due to low coverage/depth/technical issues. High mutations or high mutation density should be noted by the users and checked whether there are quality issues.
Variation analysis is not performed for sequences that fail the quality control for Ns or degenerate bases. On the other hand, the sequences of non-human viruses always exhibit large number of mutations when aligned to the reference sequence, we therefore only assess the number of Ns or degenerate bases for quality control.
In the download table, “Quality Assessment” lists the corresponding number of each item. For the last item, mutation density, YES/NO represents there is/no high mutation density region. If no variation analysis is performed, “NA” is provided.

	Unknown Base(s)	Degenerate Base(s)	Total Gap(s)	Total Mutation(s)	Mutation Density
Green	<=15	<=50	<=2 gaps	<=15	<0.25
Red	>15	>50	>2 gaps	>15	>=0.25
Details	Unknown base(s): number	Degenerate base(s): number	Total gap(s): number	Total Mutation(s): number	High mutation density: starting site~ending site (length of mutation region-total mutations-mutation density); "NO" is displayed if there is no high mutation density region

Sequence Quality: We consider a sequence to be of high quality if it could pass quality control for both Ns and degenerate bases. Otherwise, it is considered to be of low quality.

Quality Assessment: The “Complete” sequences are further analyzed in the following five aspects related with sequence quality, including unknown bases (Ns)’ number, degenerate bases’ number, total gaps (deletion, insertion, indel) when aligned to reference sequence MN908947, mutation number, and mutation density (mutation number/length of mutation region; mutation region<=20nt). We calculate the total number of mutations across the whole genome, while analyze the remaining four quality items within protein-coding region.
Quality control is performed based on criteria listed in the table; green represents “pass”, while red represents “fail”. Mouse-over to view details.
Sequences tagged with red dot(s) should be used with cautions. Large number of Ns or degenerate bases, or multiple gaps suggest that there may be quality issues due to low coverage/depth/technical issues. High mutations or high mutation density should be noted by the users and checked whether there are quality issues.
Variation analysis is not performed for sequences that fail the quality control for Ns or degenerate bases. On the other hand, the sequences of non-human viruses always exhibit large number of mutations when aligned to the reference sequence, we therefore only assess the number of Ns or degenerate bases for quality control.
In the download table, “Quality Assessment” lists the corresponding number of each item. For the last item, mutation density, YES/NO represents there is/no high mutation density region. If no variation analysis is performed, “NA” is provided.

	Unknown Base(s)	Degenerate Base(s)	Total Gap(s)	Total Mutation(s)	Mutation Density
	<=15	<=50	<=2	<=15	<0.25
	>15	>50	>2	>15	>=0.25
Details	Unknown base(s): number	Degenerate base(s): number	Total gap(s): number	Total Mutation(s): number	High mutation density: starting site~ending site (length of mutation region-total mutations-mutation density); "NO" is displayed if there is no high mutation density region

RCoV19 - Lineage Browse