2. File format

  • We accept all common types of file associated with variation data including vcf, gcvf, and hapmap.
  • 2.1 VCF
    We accept the common types of file associated with variation data in vcf or gvcf format. VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and data lines which containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position. A standard example of vcf file is shown in bellow [1] :
    ##fileformat=VCFv4.3
    ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
    ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
    ##phasing=partial
    ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
    ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
    ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
    ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
    ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
    ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
    ##FILTER=<ID=q10,Description="Quality below 10">
    ##FILTER=<ID=s50,Description="Less than 50% of samples have data">
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
    ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
    ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
    #CHROMPOSIDREFALTQUALFILTERINFOFORMATNA00001NA00002NA00003
    20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0/0:48:1:51,51 1/0:48:8:51,51 1/1:43:5:.,.
    20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0/0:49:3:58,50 0/1:3:5:65,3 0/0:41:3
    20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667; AA=T;DB GT:GQ:DP:HQ 1/2:21:6:23,27 2/1:2:0:18,2 2/2:35:4
    20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0/0:54:7:56,60 0/0:48:4:51,51 0/0:61:2
    20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
    This example shows (in order): a good simple SNP, a possible SNP that has been filtered out because its quality is below 10, a site at which two alternate alleles are called, with one of them (T) being ancestral (possibly a reference sequencing error), a site that is called monomorphic reference (i.e. with no alternate alleles), and a microsatellite with two alternative alleles, one a deletion of 2 bases (TC), and the other an insertion of one base (T). Genotype data are given for three samples, two of which are phased and the third unphased, with per sample genotype quality, depth and haplotype qualities (the latter only for the phased samples) given as well as the genotypes. The microsatellite calls are unphased.
  • 2.2 HapMap
    We also accept the common types of file associated with variation data in hapmap format. The first eleven columns of a hapmap file is the SNPs’ attributes, and the follow column are nucleotides observed at each SNP for each individuals with tab split. The first row of the hapmap file is the file’s header and the other rows represents a SNP for each row. The missing genotype data is “NN” in a hapmap file. An example is shown below:
    rs allele
    (reference/alter)
    chrom position strand assembly center protLSID assayLSID panel Qccode Sample1 Sample2 Sample3
    1_2111 C/T 1 2111 NA NA NA NA NA NA NA TT CT CC
    1_3498 A/G 1 3498 NA NA NA NA NA NA NA AA GG AA
    rs12345 G/A 22 25459492 + GRCh38.p7 1000G NA NA NA NA GA GG NN