FVC as an adaptive and accurate method for filtering variants from popular NGS analysis pipelines.

Yongyong Ren, Yan Kong, Xiaocheng Zhou, Georgi Z Genchev, Chao Zhou, Hongyu Zhao, Hui Lu
Author Information
  1. Yongyong Ren: State Key Laboratory of Microbial metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China. ORCID
  2. Yan Kong: State Key Laboratory of Microbial metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China. ORCID
  3. Xiaocheng Zhou: State Key Laboratory of Microbial metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China. ORCID
  4. Georgi Z Genchev: Research Affairs, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand. ORCID
  5. Chao Zhou: State Key Laboratory of Microbial metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China. ORCID
  6. Hongyu Zhao: Department of Biostatistics, Yale University, New Haven, CT, USA. hongyu.zhao@yale.edu. ORCID
  7. Hui Lu: State Key Laboratory of Microbial metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China. huilu@sjtu.edu.cn. ORCID

Abstract

The quality control of variants from whole-genome sequencing data is vital in clinical diagnosis and human genetics research. However, current filtering methods (Frequency, Hard-Filter, VQSR, GARFIELD, and VEF) were developed to be utilized on particular variant callers and have certain limitations. Especially, the number of eliminated true variants far exceeds the number of removed false variants using these methods. Here, we present an adaptive method for quality control on genetic variants from different analysis pipelines, and validate it on the variants generated from four popular variant callers (GATK HaplotypeCaller, Mutect2, Varscan2, and DeepVariant). FVC consistently exhibited the best performance. It removed far more false variants than the current state-of-the-art filtering methods and recalled ~51-99% true variants filtered out by the other methods. Once trained, FVC can be conveniently integrated into a user-specific variant calling pipeline.

Associated Data

Dryad | 10.5061/dryad.hdr7sqvkm

References

  1. PLoS One. 2013 Jul 09;8(7):e67863 [PMID: 23874456]
  2. Nat Biotechnol. 2020 Mar;38(3):365-373 [PMID: 31819260]
  3. Nat Biotechnol. 2018 Nov;36(10):983-987 [PMID: 30247488]
  4. Nature. 2008 Nov 6;456(7218):53-9 [PMID: 18987734]
  5. Gigascience. 2018 Feb 1;7(2): [PMID: 29267927]
  6. Curr Protoc Bioinformatics. 2013;43:11.10.1-11.10.33 [PMID: 25431634]
  7. Cold Spring Harb Perspect Med. 2019 Mar 1;9(3): [PMID: 29844223]
  8. Am J Hum Genet. 2013 Oct 3;93(4):641-51 [PMID: 24075185]
  9. BMC Med Genet. 2019 Dec 30;20(1):204 [PMID: 31888525]
  10. PeerJ. 2018 Jul 31;6:e5362 [PMID: 30083469]
  11. Sci Rep. 2019 Nov 6;9(1):16156 [PMID: 31695094]
  12. Nat Commun. 2019 Jul 18;10(1):3163 [PMID: 31320640]
  13. Genome Med. 2021 Mar 17;13(1):40 [PMID: 33726816]
  14. Genome Res. 2010 Sep;20(9):1297-303 [PMID: 20644199]
  15. Bioinformatics. 2018 Sep 1;34(17):3038-3040 [PMID: 29668842]
  16. Cell Rep. 2020 Apr 7;31(1):107489 [PMID: 32268104]
  17. Am J Respir Crit Care Med. 2018 Jun 15;197(12):1552-1564 [PMID: 29509491]
  18. Nat Biotechnol. 2014 Mar;32(3):246-51 [PMID: 24531798]
  19. Front Genet. 2019 Aug 20;10:736 [PMID: 31481971]
  20. BMC Genomics. 2020 Jan 2;21(1):6 [PMID: 31898477]
  21. Nat Commun. 2019 Nov 20;10(1):5251 [PMID: 31748536]
  22. Am J Respir Crit Care Med. 2018 Jun 15;197(12):1513-1514 [PMID: 29578752]
  23. Sci Data. 2015 Mar 25;2:150011 [PMID: 25977816]
  24. Mol Genet Genomic Med. 2019 Jul;7(7):e00641 [PMID: 31127704]
  25. Nat Rev Genet. 2018 May;19(5):253-268 [PMID: 29398702]
  26. Genome Biol. 2021 Apr 16;22(1):111 [PMID: 33863366]
  27. Lancet. 2019 Aug 10;394(10197):533-540 [PMID: 31395441]
  28. Nat Genet. 2011 May;43(5):491-8 [PMID: 21478889]
  29. Haematologica. 2020 Jun;105(6):e290-e293 [PMID: 31649132]
  30. Nat Commun. 2019 Apr 2;10(1):1489 [PMID: 30940804]
  31. Nat Commun. 2015 Feb 25;6:6275 [PMID: 25711446]
  32. Nature. 2017 Oct 11;550(7675):239-243 [PMID: 29022581]
  33. Nature. 2019 Nov;575(7781):210-216 [PMID: 31645765]
  34. Clin Pharmacol Ther. 2012 Jun;91(6):1001-9 [PMID: 22549284]
  35. Annu Rev Genomics Hum Genet. 2013;14:535-55 [PMID: 23875800]
  36. Cell. 2018 Apr 5;173(2):355-370.e14 [PMID: 29625052]
  37. Nat Biotechnol. 2019 May;37(5):555-560 [PMID: 30858580]
  38. Nature. 2007 Oct 18;449(7164):851-61 [PMID: 17943122]
  39. Methods Mol Biol. 2018;1704:451-472 [PMID: 29277878]
  40. Bioinformatics. 2009 Nov 1;25(21):2865-71 [PMID: 19561018]
  41. Nat Biotechnol. 2019 May;37(5):561-566 [PMID: 30936564]
  42. Nat Methods. 2018 Aug;15(8):591-594 [PMID: 30013048]
  43. Bioinformatics. 2009 Sep 1;25(17):2283-5 [PMID: 19542151]
  44. Genome Res. 2012 Mar;22(3):568-76 [PMID: 22300766]
  45. Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]
  46. Nature. 2015 Jun 11;522(7555):167-72 [PMID: 26062507]
  47. BMC Bioinformatics. 2014 May 02;15:125 [PMID: 24884706]
  48. Bioinformatics. 2020 Apr 15;36(8):2328-2336 [PMID: 31873730]
  49. PLoS Comput Biol. 2019 Dec 18;15(12):e1007556 [PMID: 31851693]
  50. Cell Syst. 2015 Sep 23;1(3):210-223 [PMID: 26645048]

Grants

  1. UL1 TR001863/United States

MeSH Term

Exome
High-Throughput Nucleotide Sequencing
Humans
Polymorphism, Single Nucleotide
Software
Whole Genome Sequencing

Word Cloud

Created with Highcharts 10.0.0variantsmethodsfilteringvariantFVCqualitycontrolcurrentcallersnumbertruefarremovedfalseadaptivemethodanalysispipelinespopularwhole-genomesequencingdatavitalclinicaldiagnosishumangeneticsresearchHoweverFrequencyHard-FilterVQSRGARFIELDVEFdevelopedutilizedparticularcertainlimitationsEspeciallyeliminatedexceedsusingpresentgeneticdifferentvalidategeneratedfourGATKHaplotypeCallerMutect2Varscan2DeepVariantconsistentlyexhibitedbestperformancestate-of-the-artrecalled~51-99%filteredtrainedcanconvenientlyintegrateduser-specificcallingpipelineaccurateNGS

Similar Articles

Cited By