STRsensor Short Tandem Repeats (STR) Detection

Manual

STRsensor v1.2.0


STR allele-typing with both WGS dataset and multiplex deep sequencing dataset.


PROGRAM: STRsensor
VERSION: 1.2.0
PLATFORM: Linux
COMPILER: gcc-4.8.5
AUTHOR: xiaolong zhang
EMAIL: xiaolongzhang2015@163.com
DATE:   2020-10-28
UPDATE: 2021-01-08
DEPENDENCE

  • GNU make and gcc

 

Description

  • STRsensor is designed for typing of STR loci, which including forensic CODIS STR locus and user-specified STR locus.
  • It is developed to work with both low-coverage WGS data and high-coverage target/amplicon sequencing data.
  • Both 'Kmer-based' and 'Cigar-based' matching algorithm are performed to extract enough alleles from spanning reads for final-allele determination.
  • The Maximum Likelihood Estimation (MLE) is performed to quantify the effect of PCR stutter induced by polymerase slippages on STR typing.
  • The Maximum A Posteriori estimation (MAP) is applied to obtain the most likely STR allele from the candidates.

 

Building

See INSTALL for complete details.

 

Usage

-h | --help   print help infomation
     
Required    
-i | --infile FILE bam file list, one sample per line [.txt]
-r | --region FILE region file of the STR locus [.txt]
-f | --fasta FILE reference genome (must be same as the mapping process used) [.fa]
-o | --outpath PATH the path for output file
     

Optional

input parameters

   
-s | --stutter FILE
the stutter parameters file for each STR locus [.txt]
-q | --frequency FILE the frequency file for each STR locus on each potential allele [.txt]
     

Optional

output parameters

   
-p|--params_out   output parameters learned by given samples (including stutter and allele frequency)
     

Optional

filter rules

   
-c | --min_prob FLOAT the minimal probability allowed to call an allele [0.00]
-n | --min_reads INT the minimum reads needed to genotype a STR locus for an individual [10]
-m | --mis_match INT

the maximum mismatch bases allowed in both 3' and 5' flanking sequence [2]

     
Optional    
-a | --allow_dup   duplication is allowed in target and amplicon sequencing, but not in WGS
-t | --threads INT the number of threads used [1]

 

 

Option

[-h|--help]
      Print the help infomation

 

[-i|--infile]
      A text file (without header) contains the full path of each samples, one sample per line. 
      Each BAM file (*.bam) should have a corresponding index file (*.bam.bai)
      e.g. 
            /home/xlzh/data/sample1.bam
            /home/xlzh/data/sample2.bam
            ...

 

[-r|--region]
      A text file (with header) contains the details of each STR locus.
      e.g.
            STRLocus  Chrom  Start     End       MotifLen  NotCountedBase  AsHaplotype
            D19S433   chr19  30417142  30417205  4         8               No

      [FIELDS]
            1. STRLocus: the locus name of the STR
            2. Chrom: the chromosome of the STR locus
            3. Start: the start position in the reference genome (1-based)
            4. End: the end position in the reference genome (1-based)
            5. MotifLen: the motif length of the locus.
            6. NotCountedBases: the number of bases should be excluded from the STR region
            7. AsHaplotype: regard the STR locus as haplotype or not ('Yes' | 'No')

      [NOTE]
            For STR locus with two fragments, such as DYS385ab, the two fragments need to be 
            defined in two lines and ended with the identifier of lowercase character of "a" 
            and "b" (DYS385a and DYS385b).
            e.g.
                DYS385a  chrY  20801599  20801642  4  0  No
                DYS385b  chrY  20842518  20842573  4  0  No

 

[-f|--fasta]
      Reference genome (FASTA), which should be the one used in sequence alignment.
      e.g. GRCH37_hg19.fa

 

[-o|--outpath]
      The OUT_PATH for output files.
      eg. /home/xlzh/result

 

[-s|--stutter]
      A text file (without header) contains the stutter parameters of each STR locus.
      e.g.
            DYS447  0.964309  0.007595  0.001781  0.002708  0.023607
      
      [FIELDS]
            1. STRLocus:  the locus name of the STR
            2. p_normal:  probability of NO STUTTER occured
            3. p_add1:    probability of one repeat unit adds
            4. p_add2:    probability of two repeat unit adds
            5. p_remove2: probability of two repeat unit removes
            6. p_remove1: probability of one repeat unit removes

 

[-q|--frequency]
      A text file (without header) contains the frequency distribution of each STR locus
      e.g.
            DYS437  14.0:0.645401  15.0:0.339763  16.0:0.005935  13.0:0.007418  17.0:0.001484

      [FIELDS]
            1. STRLocus: the locus name of the STR
            2. allele1:frequency1
            3. allele2:frequency2
            ...

 

[-p|--params_out]
      Output the parameters (stutter and frequeny) learned from the given samples.

      [Output]
          File1: OUT_PATH/Stutter.txt
          File2: OUT_PATH/Frequency.txt

      [Rules]
          1. parameters of the locus is not defined in stutter/frequency file
             (set by '--stutter' and '--frequency' options)
          2. the number of extracted alleles is greater than 100 for stutter evaluation
          3. the number of given samples is greater than 20 for frequency evaluation


      
[-c|--min_prob]
      The minimal probability allowed to call an allele.
      Default: 0.0

 

[-n|--min_reads]
      The minimum number of spanning reads needed to genotype a STR locus for an individual.
      Default: 10

 

[-m|--mis_match]
      The maximum number of mismatch (mismatch/insert/delete) bases allow in both 3' 
      and 5' flanking sequence.
      Default: 2


      
[-a|--allow_dup]
      Duplication is only recommended for amplicon/target sequencing dataset.
      For WGS dataset, NOT recommend to set this option.

 

[-t|--threads]
      The number of threads used by STRsensor at run time.
      Default: 1

 


STR Locus



One Fake STR Locus

            5' flanks                       STR Region                          3' flanks
        ++++++++++++++++++TCTATCTATCTATCTATCTA AACC TCTATCTATCTATCTATCTATCTA++++++++++++++++++
                                              |                                              \     /                                                      |
                                        STR_Start                      Not_Counted_Bases                                   STR_End
                                        (149347)                                                                                           (149394)


Locus.txt

    STRLocus<TAB>Chrom<TAB>Start<TAB>End<TAB>MotifLen<TAB>NotCountedBase<TAB>AsHaplotype
    FakeSTR<TAB>chr2<TAB>149347<TAB>149394<TAB>4<TAB>4<TAB>No

Description

  • The name of STRLocus should NOT EXCEED 64 characters!
  • The 'STR_Start' is the start position (1-based) of STR region in the reference genome
  • The 'STR_End' is the end position (1-based) of STR region in the reference genome
  • The motif is 'TCTA', therefore, MotifLen = 4
  • The sequence of 'AACC' should be excluded from allele determination, therefore, NotCountedBases = 4
  • The FakeSTR belong to chromosome 2, therefore, AsHaplotype = 'No'

 

Output


All the files generated by STRsensor will be output to OUT_PATH folder
e.g.

    --outpath /home/xlzh/result

(1) Stutter.txt & Frequency.txt (if '--params_out' is given)
      The FILE FORMAT is exactly the same as described in the options of '--stutter' and '--frequency'. 

 

(2) LOCUS_NAME.txt
Example: DYS19.txt

      #Locus  DYS19
      #Position       chrY:9521989-9522052
      #MotifLength    4
      #IsHaplotype    Yes
      #MinimalReads   30
      #MinimalProbability     0.99
      #CandidateAllele        13.0:0.033794,14.0:0.205837,15.0:0.491551,16.0:0.205837,17.0:0.061444,18.0:0.001536
      #StutterParams  d_2:0.004609,d_1:0.068775,n:0.920281,u_1:0.005676,u_2:0.000660
      #Sample Allele  Probability     TotalReads      ValidReads      SpanReads       AlleleList
      Sample_25.bam   15.0    1.000000        2548    2506    2290    15.0:2108,14.0:151,16.0:21,13.0:6,17.0:3,10.0:1
      Sample_65.bam   15.0    1.000000        3274    3253    3154    15.0:2912,14.0:212,16.0:18,13.0:12
      Sample_82.bam   17.0    1.000000        2776    2766    2679    17.0:2424,16.0:210,18.0:26,15.0:16,14.0:2,13.0:1

Description and Fields

      [Description]
            Allele information for each individual at the locus, one sample per line

      [StutterParams]
            d_2: probability of two repeat unit removes
            d_1: probability of one repeat unit removes
              n: probability of NO STUTTER occured
            u_1: probability of one repeat unit adds
            u_2: probability of two repeat unit adds

      [Fields]
            1. Sample: sample name that extracted from sample's path
            2. Allele: allele that determined by STRsensor
            3. Probability: the probability of the allele [0.0 ~ 1.0]
            4. TotalReads: the total number of reads that have overlap with STR region
            5. ValidReads: the number of reads that passed the filter rules
            6. SpanReads: the number of reads that fully span the entire STR region
            7. AlleleList: extracted allele and its corresponding number