Bayesian Correlation Analysis for Sequence Count Data.

Daniel Sánchez-Taltavull, Parameswaran Ramachandran, Nelson Lau, Theodore J Perkins
Author Information
  1. Daniel Sánchez-Taltavull: Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, Ontario, Canada.
  2. Parameswaran Ramachandran: Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, Ontario, Canada.
  3. Nelson Lau: Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, Ontario, Canada.
  4. Theodore J Perkins: Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, Ontario, Canada.

Abstract

Evaluating the similarity of different measured variables is a fundamental task of statistics, and a key part of many bioinformatics algorithms. Here we propose a Bayesian scheme for estimating the correlation between different entities' measurements based on high-throughput sequencing data. These entities could be different genes or miRNAs whose expression is measured by RNA-seq, different transcription factors or histone marks whose expression is measured by ChIP-seq, or even combinations of different types of entities. Our Bayesian formulation accounts for both measured signal levels and uncertainty in those levels, due to varying sequencing depth in different experiments and to varying absolute levels of individual entities, both of which affect the precision of the measurements. In comparison with a traditional Pearson correlation analysis, we show that our Bayesian correlation analysis retains high correlations when measurement confidence is high, but suppresses correlations when measurement confidence is low-especially for entities with low signal levels. In addition, we consider the influence of priors on the Bayesian correlation estimate. Perhaps surprisingly, we show that naive, uniform priors on entities' signal levels can lead to highly biased correlation estimates, particularly when different experiments have widely varying sequencing depths. However, we propose two alternative priors that provably mitigate this problem. We also prove that, like traditional Pearson correlation, our Bayesian correlation calculation constitutes a kernel in the machine learning sense, and thus can be used as a similarity measure in any kernel-based machine learning algorithm. We demonstrate our approach on two RNA-seq datasets and one miRNA-seq dataset.

References

  1. Hum Mol Genet. 2014 Sep 1;23(17):4528-42 [PMID: 24781209]
  2. Cell. 2007 Jun 29;129(7):1401-14 [PMID: 17604727]
  3. PLoS Biol. 2011 Apr;9(4):e1001046 [PMID: 21526222]
  4. Mol Cancer. 2006 Jun 19;5:24 [PMID: 16784538]
  5. Bioinformatics. 2013 Apr 15;29(8):1035-43 [PMID: 23428641]
  6. Cell Res. 2008 Oct;18(10):997-1006 [PMID: 18766170]
  7. Nat Biotechnol. 2013 Jan;31(1):46-53 [PMID: 23222703]
  8. Pac Symp Biocomput. 2000;:418-29 [PMID: 10902190]
  9. Proc Natl Acad Sci U S A. 2000 Oct 24;97(22):12182-6 [PMID: 11027309]
  10. BMC Bioinformatics. 2011 Nov 16;12:449 [PMID: 22087737]
  11. Nature. 2007 Aug 2;448(7153):553-60 [PMID: 17603471]
  12. Clin Chem. 2009 Apr;55(4):641-58 [PMID: 19246620]
  13. Trends Genet. 2008 Mar;24(3):133-41 [PMID: 18262675]
  14. Nature. 2008 Nov 27;456(7221):470-6 [PMID: 18978772]
  15. Nature. 2009 Jun 18;459(7249):927-30 [PMID: 19536255]
  16. Nat Genet. 2013 Oct;45(10):1113-20 [PMID: 24071849]
  17. Nat Biotechnol. 2008 Oct;26(10):1135-45 [PMID: 18846087]
  18. Science. 2003 Oct 10;302(5643):249-55 [PMID: 12934013]
  19. Cancer Cell. 2010 Jan 19;17(1):98-110 [PMID: 20129251]
  20. BMC Bioinformatics. 2006 Mar 20;7 Suppl 1:S7 [PMID: 16723010]
  21. Science. 2004 Oct 22;306(5696):636-40 [PMID: 15499007]
  22. Genome Res. 2004 Jun;14(6):1085-94 [PMID: 15173114]
  23. Genome Res. 1997 Oct;7(10):986-95 [PMID: 9331369]
  24. Genome Biol. 2014;15(12):550 [PMID: 25516281]

MeSH Term

Algorithms
Bayes Theorem
Cluster Analysis
Computational Biology
Erythropoiesis
Gene Expression Profiling
High-Throughput Nucleotide Sequencing
Humans
MicroRNAs
RNA
Sequence Analysis, RNA

Chemicals

MicroRNAs
RNA

Word Cloud

Created with Highcharts 10.0.0differentcorrelationBayesianlevelsmeasuredentitiessequencingsignalvaryingpriorssimilarityproposeentities'measurementswhoseexpressionRNA-seqexperimentstraditionalPearsonanalysisshowhighcorrelationsmeasurementconfidencecantwomachinelearningEvaluatingvariablesfundamentaltaskstatisticskeypartmanybioinformaticsalgorithmsschemeestimatingbasedhigh-throughputdatagenesmiRNAstranscriptionfactorshistonemarksChIP-seqevencombinationstypesformulationaccountsuncertaintyduedepthabsoluteindividualaffectprecisioncomparisonretainssuppresseslow-especiallylowadditionconsiderinfluenceestimatePerhapssurprisinglynaiveuniformleadhighlybiasedestimatesparticularlywidelydepthsHoweveralternativeprovablymitigateproblemalsoprovelikecalculationconstituteskernelsensethususedmeasurekernel-basedalgorithmdemonstrateapproachdatasetsonemiRNA-seqdatasetCorrelationAnalysisSequenceCountData

Similar Articles

Cited By