Variome Data Standards(V1.0 beta)
- 3. Data analysis standards
- 4. Nomenclature standards
2. File format
2.1 VCF
We accept the common types of file associated with variation data in vcf or gvcf format. VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and data lines which containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position. A standard example of vcf file is shown in bellow [1] :
##fileformat=VCFv4.3
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | NA00001 | NA00002 | NA00003 |
20 | 14370 | rs6054257 | G | A | 29 | PASS | NS=3;DP=14;AF=0.5;DB;H2 | GT:GQ:DP:HQ | 0/0:48:1:51,51 | 1/0:48:8:51,51 | 1/1:43:5:.,. |
20 | 17330 | . | T | A | 3 | q10 | NS=3;DP=11;AF=0.017 | GT:GQ:DP:HQ | 0/0:49:3:58,50 | 0/1:3:5:65,3 | 0/0:41:3 |
20 | 1110696 | rs6040355 | A | G,T | 67 | PASS | NS=2;DP=10;AF=0.333,0.667; AA=T;DB | GT:GQ:DP:HQ | 1/2:21:6:23,27 | 2/1:2:0:18,2 | 2/2:35:4 |
20 | 1230237 | . | T | . | 47 | PASS | NS=3;DP=13;AA=T | GT:GQ:DP:HQ | 0/0:54:7:56,60 | 0/0:48:4:51,51 | 0/0:61:2 |
20 | 1234567 | microsat1 | GTC | G,GTCT | 50 | PASS | NS=3;DP=9;AA=G | GT:GQ:DP | 0/1:35:4 | 0/2:17:2 | 1/1:40:3 |
This example shows (in order): a good simple SNP, a possible SNP that has been filtered out because its quality is below 10, a site at which two alternate alleles are called, with one of them (T) being ancestral (possibly a reference sequencing error), a site that is called monomorphic reference (i.e. with no alternate alleles), and a microsatellite with two alternative alleles, one a deletion of 2 bases (TC), and the other an insertion of one base (T). Genotype data are given for three samples, two of which are phased and the third unphased, with per sample genotype quality, depth and haplotype qualities (the latter only for the phased samples) given as well as the genotypes. The microsatellite calls are unphased.
2.2 HapMap
We also accept the common types of file associated with variation data in hapmap format. The first eleven columns of a hapmap file is the SNPs’ attributes, and the follow column are nucleotides observed at each SNP for each individuals with tab split. The first row of the hapmap file is the file’s header and the other rows represents a SNP for each row. The missing genotype data is “NN” in a hapmap file. An example is shown below:
rs | allele (reference/alter) |
chrom | position | strand | assembly | center | protLSID | assayLSID | panel | Qccode | Sample1 | Sample2 | Sample3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1_2111 | C/T | 1 | 2111 | NA | NA | NA | NA | NA | NA | NA | TT | CT | CC |
1_3498 | A/G | 1 | 3498 | NA | NA | NA | NA | NA | NA | NA | AA | GG | AA |
rs12345 | G/A | 22 | 25459492 | + | GRCh38.p7 | 1000G | NA | NA | NA | NA | GA | GG | NN |