TSomVar a tumor-only somatic variant detection method
Manual
Background
Somatic variants act as key players during cancer occurrence and development, thus an accurate and robust method to identify them is the foundation in the cutting-edge cancer genome research. However, due to low accessibility and high individual-/sample-specificity of the somatic variants in tumor samples, the detection is, to date, still crammed with challenges, particularly when there are no paired normal samples as control. To solve this burning issue, we developed a tumor-only somatic and germline variant identification method (TSomVar), using the random forest algorithm established on sample-specific variant datasets derived from genotype imputation, reads-mapping level annotation and functional annotation.
Installation
Requirements
Application
- Annovar
- Beagle5.1
- MosaicForecast
- GATK Mutect2
- Python 3.6
- Python module: Numpy
- Python module: sklearn
Database
(Note: all database files should be stored at $ {TSomVar_path}/database/)
- Haplotype reference panel of the 1000 Genomes Projct
- Plink format genetic map
- Annovar database: avsnp147, cadd, dbnsfp33a, eigen, icgc21, nci60, snp138NonFlagged
- human.fa
### database process
### hg19.fa
sed -i 's/^>chr/>/' hg19.fa
samtools faidx hg19.fa ##generate index file .fai
Picard CreateSequenceDictionary REFERENCE=hg19.fa OUTPUT=hg19.fa ##generate index file .dict
Running
./TSomVar \
/path/to/input_bam \
$ {sample_id_in_bam} \
/path/to/TSomVar \
/path/to/table_annovar.pl(annovar) \
/path/to/beagle.18May20.d20.jar \
/path/to/ReadLevel_Features_extraction.py(MosaicForecast) \
/path/to/gatk \
/path/to/hg19.fa \
$ {prefix} \
Output
- $ {prefix}.result
- variant and its classification: germline, uncertain, or somatic
- $ {prefix}.result.prob
- probability matrix of classification of variant
Maintainers
shishuo@big.ac.cn
Citations
To be continued ...