Genome sequences derived from different resources worldwide (NCBI, GISAID, NGDC, NMDC, CNGB) are integrated and curated based on the meta data and sequence alignment results, and this table is daily updated during COVID-19 outbreak.
1. Redundancies between databases: A genome sequence may be submitted to more than one database. Redundancies are identified based on meta information, sequence alignment, or the report from data submitters. We preferentially provide genome information (“Accession ID” and “Virus Strain Name”) of the databases which are publicly open to all users. Accordingly, the accession ids of other databases are listed in “Related ID”. We encourage submitters to share the data with different data centers, which would greatly benefit the curation, annotation, bioinformatics analysis and experimental studies.
2. Redundancies within database: Within a database, genome sequences with the same virus name, sequence, sample collection date, patient information, virus passage history, etc., are considered to be redundancies. This may due to repeated submission or other mistakes. In this case, accession id of the latter one will be listed in “Related ID”.
3. Whole genome sequence and gene sequence: Only the whole genome sequence is included in the table if a virus has both whole genome sequence and gene sequences, while accession ids of gene sequences are not listed in “Related ID”. All the gene sequences will be listed as different rows in the table only if the whole genome is unavailable. However, gene sequences will be removed from the table and replaced by the whole genome sequence when the whole genome sequence is obtained.
Quality control is performed based on criteria listed in the table; green represents “pass”, while red represents “fail”. Mouse-over to view details.
Sequences tagged with red dot(s) should be used with cautions. Large number of Ns or degenerate bases, or multiple gaps suggest that there may be quality issues due to low coverage/depth/technical issues. High mutations or high mutation density should be noted by the users and checked whether there are quality issues.
Variation analysis is not performed for sequences that fail the quality control for Ns or degenerate bases. On the other hand, the sequences of non-human viruses always exhibit large number of mutations when aligned to the reference sequence, we therefore only assess the number of Ns or degenerate bases for quality control.
In the download table, “Quality Assessment” lists the corresponding number of each item. For the last item, mutation density, YES/NO represents there is/no high mutation density region. If no variation analysis is performed, “NA” is provided.
Unknown Base(s) | Degenerate Base(s) | Total Gap(s) | Total Mutation(s) | Mutation Density | |
---|---|---|---|---|---|
Green | <=15 | <=50 | <=2 gaps | <=15 | <0.25 |
Red | >15 | >50 | >2 gaps | >15 | >=0.25 |
Details | Unknown base(s): number | Degenerate base(s): number | Total gap(s): number | Total Mutation(s): number | High mutation density: starting site~ending site (length of mutation region-total mutations-mutation density); "NO" is displayed if there is no high mutation density region |