3. Data analysis standards

  • 3.1 SNP & small I/D
  • 3.1.1 Sequencing data analysis
    Pipeline
    Sequence reads (in fastq format) were mapped to the species reference genome with the Burrows-Wheeler Aligner (BWA; version 0.7.10-r789) [2]. BWA mem program is used for genome alignment with the –R and –M parameter. The "-R" flag adds the read group header line and "–M" flag to flag extra alignment hits as secondary. Three open-source packages were used for downstream processing and variant calling: Picard Tools, SAMtools [3], and Genome analysis toolkit [4] .
  • Schematic drawing for SNP calling
    1.For Species with known variations

    2.For Species without known variations

  • Software and parameters

    1. Reads mapping to reference genome
    bwa mem program is used for genome alignment with the –R and –M parameter. The "-R" flag adds the read group header line and "–M" flag to flag extra alignment hits as secondary.

    2. Mark Duplicates & Sort
    The software Picrd is employed for marking duplicates and sorting mapping result. MarkDuplicates.jar program of Picard is used for marking duplicate reads and SortSam.jar program is used for sorting the mapping result in coordinate (parameter SORT_ORDER=coordinate).

    3. INDEL Realign
    The program of "RealignerTargetCreator" and "IndelRealigner" of GATK are used for INDEL Realignment. The "RealignerTargetCreator" identifies what regions need to be realigned, with the known indel sites information. And the "IndelRealigner" performs the actual realignment.

    4. Base Recalibration
    In the Base Recalibration step, programs "BaseRecalibrator" and "PrintReads" of GATK are employed. Firstly the "BaseRecalibrator" program models the error modes in virtue of the known variation sites, secondly, the "PrintReads" program applies recalibration and writes to file.

    5. Variation Calling
    Variation calling is performed for individual sample by the "HaplotypeCaller" of GATK. A raw GVCF. format file is created for each sample. In this step, the parameter "-variant_index_type" is assigned LINEAR, the "ERC" is assigned GVCF and the "variant_index_parameter" equals to 12800.

    6. Joint Genotyping
    Mult-sample GCVF files are jointed genotyping to produce a squared-off matrix of genotypes for analysis. This is implementing by the "GenotypeGVCFs" program of GATK, with multiple single sample GVCF. file as input.

    7. Select SNP/INDEL
    "SelectVariants" program of the GATK is used for selecting variant for INDEL or SNP with parameter "selecType" sets to INDEL or SNP.

    8. SNP Filter
    In order to have high quality SNPs, SNP filtering is carried out by the "VariantFiltration" program of the GATK pipeline. The criterion is adopted by the recommending parameters, as "QD<2.0, FS>60.0, SOR>4.0, MQ<40.0, MQRankSum<-12.5, ReadPosRankSum < -8.0".

    9. INDEL Filter
    In order to have high quality INDELs, INDEL filtering is carried out by the "VariantFiltration" program of the GATK pipeline. The criterion is adopted by the recommending parameters, as "QD<2.0, FS>200.0, SOR>10.0, InbreedingCoeff<-8.0, ReadPosRankSum < -8.0".

    10. Prepare reference sites for species without known variation sites
    For species without known variations, we need construct solid computational variations to replace. The following are the commands and parameters:
      (1) Sample variation calling with GATK
      (2) Sample variation calling with samtools and bcftools
      (3) Combining of the GATK calling and samtools calling
      (4)Hard filter
    The criterion is adopted by strict parameters, as "QD<2.0, FS>60.0, SOR>10.0, InbreedingCoeff<-8.0, ReadPosRankSum < -8.0".

  • Input files
    The pipeline takes in Whole genome sequencing/Exome sequencing/Transcriptome sequencing reads in FASTQ format from paired-end or single-end. When a sample consisted of several fastq files of multiple lanes, we align each file independently and then merge them into one BAM file. Attention, if the sequence data is transcriptome sequencing reads, the mapping software is replaced by TopHat [6], a spliced read mapper for RNA-Seq.
    Output files
    Samples’ gvcf files and final vcf file are output with the pipeline.