Principal component analysis-based filtering improves detection for Affymetrix gene expression arrays.

Jun Lu, Robnet T Kerns, Shyamal D Peddada, Pierre R Bushel
Author Information
  1. Jun Lu: Microarray and Genome Informatics Group, National Institute of Environmental Health Sciences, SRA International, Inc. and Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA.

Abstract

Gene expression array technology has reached the stage of being routinely used to study clinical samples in search of diagnostic and prognostic biomarkers. Due to the nature of array experiments, which examine the expression of tens of thousands of genes simultaneously, the number of null hypotheses is large. Hence, multiple testing correction is often necessary to control the number of false positives. However, multiple testing correction can lead to low statistical power in detecting genes that are truly differentially expressed. Filtering out non-informative genes allows for reduction in the number of null hypotheses. While several filtering methods have been suggested, the appropriate way to perform filtering is still debatable. We propose a new filtering strategy for Affymetrix GeneChips®, based on principal component analysis of probe-level gene expression data. Using a wholly defined spike-in data set and one from a diabetes study, we show that filtering by the proportion of variation accounted for by the first principal component (PVAC) provides increased sensitivity in detecting truly differentially expressed genes while controlling false discoveries. We demonstrate that PVAC exhibits equal or better performance than several widely used filtering methods. Furthermore, a data-driven approach that guides the selection of the filtering threshold value is also proposed.

References

  1. Blood. 2010 Feb 4;115(5):1026-36 [PMID: 19965671]
  2. Nature. 2010 Aug 19;466(7309):973-7 [PMID: 20725040]
  3. BMC Bioinformatics. 2006 Jan 31;7:49 [PMID: 16448562]
  4. BMC Bioinformatics. 2010 Feb 24;11:104 [PMID: 20181266]
  5. Physiol Genomics. 2007 Feb 12;28(3):284-93 [PMID: 17062650]
  6. Genome Biol. 2004;5(10):R80 [PMID: 15461798]
  7. Nat Biotechnol. 2010 Aug;28(8):827-38 [PMID: 20676074]
  8. BMC Bioinformatics. 2009 Jan 08;10:11 [PMID: 19133141]
  9. Nucleic Acids Res. 2009 Jan;37(Database issue):D885-90 [PMID: 18940857]
  10. Bioinformatics. 2007 Nov 1;23(21):2897-902 [PMID: 17921172]
  11. Nucleic Acids Res. 2007;35(16):e102 [PMID: 17702762]
  12. Proc Natl Acad Sci U S A. 2010 May 25;107(21):9546-51 [PMID: 20460310]
  13. Biostatistics. 2003 Apr;4(2):249-64 [PMID: 12925520]
  14. Bioinformatics. 2004 Feb 12;20(3):307-15 [PMID: 14960456]
  15. Proc Natl Acad Sci U S A. 2009 Feb 24;106(8):2824-8 [PMID: 19196983]
  16. BMC Bioinformatics. 2010 May 27;11:285 [PMID: 20507584]
  17. Stat Methods Med Res. 1999 Jun;8(2):113-34 [PMID: 10501649]
  18. Genome Biol. 2005;6(2):R16 [PMID: 15693945]
  19. Bioinformatics. 2006 Apr 15;22(8):943-9 [PMID: 16473874]
  20. Blood. 2010 Jan 14;115(2):315-25 [PMID: 19837975]
  21. PLoS Comput Biol. 2009 Dec;5(12):e1000598 [PMID: 20011106]
  22. Nat Biotechnol. 1996 Dec;14(13):1675-80 [PMID: 9634850]
  23. Stat Appl Genet Mol Biol. 2004;3:Article3 [PMID: 16646809]
  24. Proc Natl Acad Sci U S A. 2002 Apr 2;99(7):4465-70 [PMID: 11904358]
  25. Nat Biotechnol. 2006 Sep;24(9):1151-61 [PMID: 16964229]

Grants

  1. Z01 ES101744-04/NIEHS NIH HHS
  2. Z01 ES102345-04/NIEHS NIH HHS
  3. /Intramural NIH HHS

MeSH Term

Animals
Diabetic Cardiomyopathies
Gene Expression Profiling
Oligonucleotide Array Sequence Analysis
Principal Component Analysis
Rats

Word Cloud

Created with Highcharts 10.0.0filteringexpressiongenesnumbercomponentarrayusedstudynullhypothesesmultipletestingcorrectionfalsedetectingtrulydifferentiallyexpressedseveralmethodsAffymetrixprincipalgenedataPVACGenetechnologyreachedstageroutinelyclinicalsamplessearchdiagnosticprognosticbiomarkersDuenatureexperimentsexaminetensthousandssimultaneouslylargeHenceoftennecessarycontrolpositivesHowevercanleadlowstatisticalpowerFilteringnon-informativeallowsreductionsuggestedappropriatewayperformstilldebatableproposenewstrategyGeneChips®basedanalysisprobe-levelUsingwhollydefinedspike-insetonediabetesshowproportionvariationaccountedfirstprovidesincreasedsensitivitycontrollingdiscoveriesdemonstrateexhibitsequalbetterperformancewidelyFurthermoredata-drivenapproachguidesselectionthresholdvaluealsoproposedPrincipalanalysis-basedimprovesdetectionarrays

Similar Articles

Cited By