Limitations of current high-throughput sequencing technologies lead to biased expression estimates of endogenous retroviral elements.

Konstantina Kitsou, Aris Katzourakis, Gkikas Magiorkinis
Author Information
  1. Konstantina Kitsou: Department of Hygiene, Epidemiology and Medical Statistics, National and Kapodistrian University of Athens, Athens 11527, Greece.
  2. Aris Katzourakis: Department of Zoology, University of Oxford, Oxford OX1 4BH, UK.
  3. Gkikas Magiorkinis: Department of Hygiene, Epidemiology and Medical Statistics, National and Kapodistrian University of Athens, Athens 11527, Greece. ORCID

Abstract

Human endogenous retroviruses (HERVs), the remnants of ancient germline retroviral integrations, comprise almost 8% of the human genome. The elucidation of their biological roles is hampered by our inability to link HERV mRNA and protein production with specific HERV loci. To solve the riddle of the integration-specific RNA expression of HERVs, several bioinformatics approaches have been proposed; however, no single process seems to yield optimal results due to the repetitiveness of HERV integrations. The performance of existing data-bioinformatics pipelines has been evaluated against real world datasets whose true expression profile is unknown, thus the accuracy of widely-used approaches remains unclear. Here, we simulated mRNA production from specific HERV integrations to evaluate second and third generation sequencing technologies along with widely used bioinformatic approaches to estimate the accuracy in describing integration-specific expression. We demonstrate that, while a HERV-family approach offers accurate results, per-integration analyses of HERV expression suffer from substantial expression bias, which is only partially mitigated by algorithms developed for calculating the per-integration HERV expression, and is more pronounced in recent integrations. Hence, this bias could erroneously result into biologically meaningful inferences. Finally, we demonstrate the merits of accurate long-read high-throughput sequencing technologies in the resolution of per-locus HERV expression.

References

  1. Nat Methods. 2018 Jun;15(6):461-468 [PMID: 29713083]
  2. Proc Natl Acad Sci U S A. 2019 Jan 22;116(4):1337-1346 [PMID: 30610173]
  3. Genome Biol. 2001;2(6):REVIEWS1017 [PMID: 11423012]
  4. Mob DNA. 2020 Feb 07;11:9 [PMID: 32055257]
  5. PLoS Comput Biol. 2019 Mar 28;15(3):e1006564 [PMID: 30921327]
  6. Nat Biotechnol. 2019 Oct;37(10):1155-1162 [PMID: 31406327]
  7. Proc Natl Acad Sci U S A. 2020 Oct 20;117(42):26520-26530 [PMID: 33020268]
  8. Retrovirology. 2011 Nov 08;8:90 [PMID: 22067224]
  9. Front Immunol. 2018 Sep 10;9:2039 [PMID: 30250470]
  10. J Virol. 2014 Sep 1;88(17):9529-37 [PMID: 24920817]
  11. Bioinformatics. 2010 Mar 15;26(6):841-2 [PMID: 20110278]
  12. Mol Biol Evol. 2009 Nov;26(11):2617-26 [PMID: 19666991]
  13. Rheumatology (Oxford). 2011 Jul;50(7):1310-4 [PMID: 21343167]
  14. Viruses. 2020 Aug 13;12(8): [PMID: 32823517]
  15. Genome Biol. 2020 Jun 2;21(1):129 [PMID: 32487205]
  16. Sci Data. 2020 Nov 17;7(1):399 [PMID: 33203859]
  17. PLoS Comput Biol. 2019 Sep 30;15(9):e1006453 [PMID: 31568525]
  18. Proc Natl Acad Sci U S A. 2018 Dec 11;115(50):12565-12572 [PMID: 30455304]
  19. Nat Methods. 2012 Mar 04;9(4):357-9 [PMID: 22388286]
  20. Nat Biotechnol. 2021 Mar;39(3):302-308 [PMID: 33288906]
  21. Front Microbiol. 2020 Jul 17;11:1690 [PMID: 32765477]
  22. Viruses. 2020 Jun 11;12(6): [PMID: 32545287]
  23. Nature. 2015 Jun 11;522(7555):221-5 [PMID: 25896322]
  24. F1000Res. 2017 Feb 3;6:100 [PMID: 28868132]
  25. Proc Natl Acad Sci U S A. 2016 Apr 19;113(16):E2326-34 [PMID: 27001843]
  26. Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]
  27. Immunity. 2015 May 19;42(5):805-13 [PMID: 25992857]
  28. J Biomol Tech. 2016 Dec;27(4):125-128 [PMID: 27672352]
  29. Nat Rev Genet. 2019 Dec;20(12):760-772 [PMID: 31515540]
  30. Proc Natl Acad Sci U S A. 2016 Apr 19;113(16):4240-2 [PMID: 27071126]
  31. Front Aging Neurosci. 2023 Jul 06;15:1186470 [PMID: 37484691]
  32. Cell. 2015 Aug 27;162(5):974-86 [PMID: 26317466]

Word Cloud

Created with Highcharts 10.0.0expressionHERVintegrationsapproachessequencingtechnologiesendogenousHERVsretroviralmRNAproductionspecificintegration-specificresultsaccuracydemonstrateaccurateper-integrationbiashigh-throughputHumanretrovirusesremnantsancientgermlinecomprisealmost8%humangenomeelucidationbiologicalroleshamperedinabilitylinkproteinlocisolveriddleRNAseveralbioinformaticsproposedhoweversingleprocessseemsyieldoptimalduerepetitivenessperformanceexistingdata-bioinformaticspipelinesevaluatedrealworlddatasetswhosetrueprofileunknownthuswidely-usedremainsunclearsimulatedevaluatesecondthirdgenerationalongwidelyusedbioinformaticestimatedescribingHERV-familyapproachoffersanalysessuffersubstantialpartiallymitigatedalgorithmsdevelopedcalculatingpronouncedrecentHenceerroneouslyresultbiologicallymeaningfulinferencesFinallymeritslong-readresolutionper-locusLimitationscurrentleadbiasedestimateselements

Similar Articles

Cited By