Cautions about the reliability of pairwise gene correlations based on expression data.

Scott Powers, Matt DeJongh, Aaron A Best, Nathan L Tintle
Author Information
  1. Scott Powers: Department of Statistics, Stanford University Stanford, CA, USA.
  2. Matt DeJongh: Department of Computer Science, Hope College Holland, MI, USA.
  3. Aaron A Best: Department of Biology, Hope College Holland, MI, USA.
  4. Nathan L Tintle: Department of Mathematics, Statistics and Computer Science, Dordt College Sioux Center, IA, USA.

Abstract

BACKGROUND: Rapid growth in the availability of genome-wide transcript abundance levels through gene expression microarrays and RNAseq promises to provide deep biological insights into the complex, genome-wide transcriptional behavior of single-celled organisms. However, this promise has not yet been fully realized.
RESULTS: We find that computation of pairwise gene associations (correlation; mutual information) across a set of 2782 total genome-wide expression samples from six diverse bacteria produces unexpectedly large variation in estimates of pairwise gene association-regardless of the metric used, the organism under study, or the number and source of the samples. We pinpoint the cause to sampling bias. In particular, in repositories of expression data (e.g., Gene Expression Omnibus, GEO), many individual genes show small differences in absolute gene expression levels across the set of samples. We demonstrate that these small differences are due mainly to "noise" instead of "signal" attributable to environmental or genetic perturbations. We show that downstream analysis using gene expression levels of genes with small differences yields biased estimates of pairwise association.
CONCLUSIONS: We propose flagging genes with small differences in absolute, RMA-normalized, expression levels (e.g., standard deviation less than 0.5), as potentially yielding biased pairwise association metrics. This strategy has the potential to substantially improve the confidence in genome-wide conclusions about transcriptional behavior in bacterial organisms. Further work is needed to further refine strategies to identify genes with small difference in expression levels prior to computing gene-gene association metrics.

Keywords

References

  1. Nat Biotechnol. 2005 Dec;23(12):1499-501 [PMID: 16333293]
  2. BMC Bioinformatics. 2012 Aug 08;13:193 [PMID: 22873695]
  3. Bioinformatics. 2003 Jul 1;19(10):1227-35 [PMID: 12835266]
  4. BMC Genomics. 2010 Nov 25;11:666 [PMID: 21108805]
  5. Proc Natl Acad Sci U S A. 2010 Oct 12;107(41):17845-50 [PMID: 20876091]
  6. Microbiology. 2004 Nov;150(Pt 11):3783-95 [PMID: 15528664]
  7. Bioinformatics. 2010 Aug 1;26(15):1918-9 [PMID: 20538728]
  8. Proc Natl Acad Sci U S A. 2004 Dec 21;101(51):17777-82 [PMID: 15596728]
  9. Genome Biol. 2004;5(7):R48 [PMID: 15239833]
  10. J Bacteriol. 2011 Jul;193(13):3228-40 [PMID: 21531804]
  11. Nucleic Acids Res. 2007;35(1):288-98 [PMID: 17170009]
  12. BMC Genomics. 2007 Feb 13;8:48 [PMID: 17298663]
  13. PLoS Biol. 2007 Jan;5(1):e8 [PMID: 17214507]
  14. Nucleic Acids Res. 2007;35(1):11-20 [PMID: 17148478]
  15. BMC Bioinformatics. 2007 Apr 26;8:139 [PMID: 17462086]
  16. Mol Ecol. 2007 Jul;16(13):2613-6 [PMID: 17594433]
  17. Nucleic Acids Res. 2011 Jan;39(Database issue):D583-90 [PMID: 21097882]
  18. Nucleic Acids Res. 2005 Feb 08;33(3):880-92 [PMID: 15701760]
  19. Bioinformatics. 2012 Aug 1;28(15):2029-36 [PMID: 22685074]
  20. Nat Biotechnol. 2008 Mar;26(3):303-4 [PMID: 18327243]
  21. Nucleic Acids Res. 2010 Jun;38(10):3263-74 [PMID: 20150412]
  22. BMC Genomics. 2003 Oct 02;4(1):41 [PMID: 14525623]
  23. Artif Intell Med. 2007 Oct;41(2):151-9 [PMID: 17869072]
  24. BMC Bioinformatics. 2004 May 05;5:54 [PMID: 15128431]
  25. Genome Biol. 2002;3(12):RESEARCH0071 [PMID: 12537560]
  26. J Bacteriol. 1983 Mar;153(3):1368-78 [PMID: 6298183]
  27. Environ Microbiol. 2014 May;16(5):1378-97 [PMID: 24238297]
  28. BMC Syst Biol. 2010 Aug 18;4:116 [PMID: 20718955]
  29. Nucleic Acids Res. 2002 Jul 1;30(13):2886-93 [PMID: 12087173]
  30. Bioinformatics. 2005 Apr 1;21(7):880-8 [PMID: 15539453]
  31. Nucleic Acids Res. 2011 Jan;39(Database issue):D552-5 [PMID: 21051344]
  32. BMC Bioinformatics. 2006 Mar 20;7 Suppl 1:S7 [PMID: 16723010]
  33. Nucleic Acids Res. 2003 Feb 15;31(4):e15 [PMID: 12582260]

Word Cloud

Created with Highcharts 10.0.0expressiongenelevelspairwisesmallgenome-widegenesdifferencessamplesassociationtranscriptionalbehaviororganismscorrelationmutualinformationacrosssetestimatesdataegshowabsolutebiasedmetricsBACKGROUND:RapidgrowthavailabilitytranscriptabundancemicroarraysRNAseqpromisesprovidedeepbiologicalinsightscomplexsingle-celledHoweverpromiseyetfullyrealizedRESULTS:findcomputationassociations2782totalsixdiversebacteriaproducesunexpectedlylargevariationassociation-regardlessmetricusedorganismstudynumbersourcepinpointcausesamplingbiasparticularrepositoriesGeneExpressionOmnibusGEOmanyindividualdemonstrateduemainly"noise"instead"signal"attributableenvironmentalgeneticperturbationsdownstreamanalysisusingyieldsCONCLUSIONS:proposeflaggingRMA-normalizedstandarddeviationless05potentiallyyieldingstrategypotentialsubstantiallyimproveconfidenceconclusionsbacterialworkneededrefinestrategiesidentifydifferencepriorcomputinggene-geneCautionsreliabilitycorrelationsbasedPearsonco-regulationoperonpredictionregulatorynetworkinference

Similar Articles

Cited By