Fast Context-Aware Analysis of Genome Annotation Colocalization.

Askar Gafurov, Tomáš VinaŘ, Paul Medvedev, BroŇa Brejová
Author Information
  1. Askar Gafurov: Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia.
  2. Tomáš VinaŘ: Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia.
  3. Paul Medvedev: Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, USA.
  4. BroŇa Brejová: Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia. ORCID

Abstract

An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes or their exons, sequence repeats, regions with a particular epigenetic state, and copy number variants. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing random unrelated annotations. To incorporate more background information into such analyses, we propose a new null model based on a Markov chain that differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or assembly gaps. We then develop a new algorithm for estimating -values by computing the exact expectation and variance of the test statistic and then estimating the -value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models. We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed -values for 450 pairs of human genome annotations using 24 threads in under three hours. Moreover, the use of genomic contexts to correct for GC bias resulted in the reversal of some previously published findings.

Keywords

References

  1. Genome Res. 2007 Jun;17(6):787-97 [PMID: 17567997]
  2. BMC Bioinformatics. 2018 Dec 14;19(1):481 [PMID: 30547739]
  3. Cell. 2007 Jul 13;130(1):77-88 [PMID: 17632057]
  4. Bioinformatics. 2019 May 1;35(9):1615-1624 [PMID: 30307532]
  5. J Grad Med Educ. 2012 Sep;4(3):279-82 [PMID: 23997866]
  6. Bioinformatics. 2016 Aug 1;32(15):2256-63 [PMID: 27153607]
  7. Semin Hematol. 2008 Jul;45(3):135-40 [PMID: 18582619]
  8. Bioinformatics. 2016 Jan 15;32(2):289-91 [PMID: 26424858]
  9. Nucleic Acids Res. 2023 Jan 6;51(D1):D1188-D1195 [PMID: 36420891]
  10. Science. 2022 Apr;376(6588):eabj5089 [PMID: 35357915]
  11. Cell Syst. 2019 Jun 26;8(6):523-529.e4 [PMID: 31202632]
  12. Science. 2022 Apr;376(6588):44-53 [PMID: 35357919]
  13. Bioinformatics. 2022 Jun 24;38(Suppl 1):i203-i211 [PMID: 35758770]
  14. Epigenomics. 2013 Aug;5(4):351-3 [PMID: 23895647]
  15. Bioinformatics. 2016 Feb 15;32(4):587-9 [PMID: 26508757]
  16. Nat Rev Genet. 2015 Mar;16(3):172-83 [PMID: 25645873]
  17. Bioinformatics. 2013 Aug 15;29(16):2046-8 [PMID: 23782611]
  18. J Mol Biol. 1997 Apr 25;268(1):78-94 [PMID: 9149143]
  19. Genome Biol. 2010;11(12):R121 [PMID: 21182759]
  20. Bioinformatics. 2010 Mar 15;26(6):841-2 [PMID: 20110278]

Grants

  1. R01 GM146462/NIGMS NIH HHS

MeSH Term

Algorithms
Molecular Sequence Annotation
Humans
Markov Chains
Genomics
Genome, Human
Base Composition
Models, Genetic

Word Cloud

Created with Highcharts 10.0.0algorithmgenomicparticularannotationsnullnewMarkovcontextscanannotationregionstwobasedmodelchainGCassemblyestimating-valuestestusingthreehandlehumangenomesetintervalssharingfunctionpropertyExamplesincludegenesexonssequencerepeatsepigeneticstatecopynumbervariantscommontaskcomparedetermineoneenricheddepletedcoveredstudyproblemassigningstatisticalsignificancecomparisonrepresentingrandomunrelatedincorporatebackgroundinformationanalysesproposedifferentiatesamongseveralcapturevariousconfoundingfactorscontentgapsdevelopcomputingexactexpectationvariancestatistic-valuenormalapproximationComparedpreviousGafurovetalprovidesadvances:1runningtimeimprovedquadraticlinearquasi-linear2differentstatistics3simplecontext-dependentmodelsdemonstrateefficiencyaccuracysyntheticrealdatasetsincludingrecenttelomere-to-telomerecomputed450pairs24threadshoursMoreoverusecorrectbiasresultedreversalpreviouslypublishedfindingsFastContext-AwareAnalysisGenomeAnnotationColocalizationchainscolocalization

Similar Articles

Cited By