The type of genomes we provide may not belong to any real existed human being, but the structured analyzing result of tera-bases of sequencing data from hundreds of Chinese individuals, thus enough capable of describing what characters and preference of genome a Chinese individual would most likely to have, that differ from those of other populations.
We provide a new form of human reference genome. We separate and focus on every individual to see the nucleotide level variation between populations and individuals, filter the false negative dynamic positions and use parameters like comentropy to evaluate the dynamic level of the genomic positions. Finally we transform the static linear genome sequence into dynamic ones and store it in database for high-performance query.
Big Data Analysis Solution
Currently, about five pera-bytes of sequencing data are generated and filled into the sea of data explosion annually in average. VCGDB is a highly refined result set of tera-bases of massive scale of sequencing data. Having experienced handling this scale of data using customized data processing workflow and optimized algorithms, we put forward some solutions and thoughts for those who are annoying with limited resource and massive scale of data.
Genomic Annotation Integration
We integrate genomic annotation information like gene information, genomic duplications and GWAS traits into database, which combine dynamic variations with individual characters and clinical traits, which will help the users, no matter scientist or medical doctors with no bioinformatics background, to classify individuals or the patients into subpopulations that differ in their susceptibility to a particular disease of their response to a specific treatment.
Highly Interactive Genome Browser
VCGBrowser present a multi-dimensional view that directly demonstrate and compare the dynamic variations along the human genome in using of a consensus coordinate and a canvas to mark all dynamic variations no matter inner or between populations. The browser is highly flexible that support real-time seamless zooming to any resolution from the genomic level that shows the dynamic distribution of interested region, to the nucleotide level that all residues and detail information can be recognized clearly. Taking advantage of the highly structured and indexed VCGDB, the browser is implemented with real-time querying service that a simple click on the glyphs of dynamic genomic position would trigger an instant query to the database, in return of detail information for endpoint users.
Chinese Consensus Reference Sequence
We use the major allele information and major indel information in VCGDB, generate a consensus genome sequence of Chinese population, that could be recognized by mapping software. In this process, we find some conflicts in the adjustment of reference genome sequence. Some of conflicts are false alarms and are eliminated. The reference sequences we provide mapped and located slightly more reads than YH reference--another Chinese reference sequence, which shows a higher accuracy of VCGDB in describing genome of Chinese population, and also show the power of the large scale population based whole genome sequence in improving the reference human genome, the guiding principle.