Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms.

Yu Guo, Armin Graber, Robert N McBurney, Raji Balasubramanian
Author Information
  1. Yu Guo: BG Medicine, Inc., 610 Lincoln St., Waltham, MA 02451, USA.

Abstract

BACKGROUND: data generated using 'omics' technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of 'omics' data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques.
RESULTS: the analysis of data from seven 'omics' studies revealed that the average magnitude of effect size observed in human studies was markedly lower when compared to that in animal studies. The data measured in human studies were characterized by higher biological variation and the presence of outliers. The results from simulation studies indicated that the classifier Prediction Analysis for Microarrays (PAM) had the highest power when the class conditional feature distributions were Gaussian and outcome distributions were balanced. Random Forests was optimal when feature distributions were skewed and when class distributions were unbalanced. We provide a free open-source R statistical software library (MVpower) that implements the simulation strategy proposed in this paper.
CONCLUSION: no single classifier had optimal performance under all settings. Simulation studies provide useful guidance for the design of biomedical studies involving high-dimensionality data.

References

  1. Bioinformatics. 2005 Dec 1;21(23):4263-71 [PMID: 16204346]
  2. PLoS One. 2009;4(3):e4922 [PMID: 19290050]
  3. Bioinformatics. 2005 Jul 1;21(13):3017-24 [PMID: 15840707]
  4. Stat Med. 2002 Dec 15;21(23):3543-70 [PMID: 12436455]
  5. Biostatistics. 2005 Jan;6(1):157-69 [PMID: 15618534]
  6. Proc Natl Acad Sci U S A. 2002 May 14;99(10):6567-72 [PMID: 12011421]
  7. Biomarkers. 2009 Dec;14(8):572-86 [PMID: 19780643]
  8. Biostatistics. 2005 Jan;6(1):27-38 [PMID: 15618525]
  9. J Comput Biol. 2004;11(4):714-26 [PMID: 15579240]
  10. Bioinformatics. 2005 Apr 15;21(8):1502-8 [PMID: 15564298]
  11. Biostatistics. 2007 Jan;8(1):101-17 [PMID: 16613833]
  12. Toxicol Pathol. 2009 Jan;37(1):52-64 [PMID: 19171931]
  13. BMC Genomics. 2004 Nov 08;5:87 [PMID: 15533245]
  14. Laryngoscope. 2009 Jul;119(7):1291-302 [PMID: 19444892]
  15. Bioinformatics. 2002 Sep;18(9):1184-93 [PMID: 12217910]
  16. Proc Natl Acad Sci U S A. 2006 Apr 11;103(15):5923-8 [PMID: 16585533]
  17. Stat Med. 2005 Aug 15;24(15):2267-80 [PMID: 15977294]
  18. Blood. 2009 Jul 30;114(5):1063-72 [PMID: 19443663]
  19. Anal Chem. 2006 Jan 15;78(2):567-74 [PMID: 16408941]

MeSH Term

Algorithms
Animals
Classification
Databases, Factual
Gene Expression Profiling
Humans
Models, Statistical
Oligonucleotide Array Sequence Analysis
Pattern Recognition, Automated
Sample Size

Word Cloud

Created with Highcharts 10.0.0studiesdatadistributions'omics'studydesignperformancehigh-dimensionalityclassclassifierhumancharacterizednumberfeaturesmeasuredpaperbiomedicaloutcomePredictionAnalysisMicroarraysRandomForestssettingsanimalsizesimulationpowerfeatureoptimalprovidestatisticalBACKGROUND:generatedusingtechnologieshighdimensionalitypersubjectvastlyexceedssubjectsconsiderissuesrelevantgoaldiscoverysubsetassociatedalgorithmcanpredictbinarydiseasestatuscomparefourcommonlyusedclassifiersK-NearestNeighborsSupportVectorMachinesevaluateeffectsvaryinglevelssignal-to-noiseratiodatasetimbalancedistributionchoicemetricquantifyingguidepresentsummarykeycharacteristicsprofiledseveralmodelexperimentsutilizinghigh-contentmassspectrometrymultiplexedimmunoassaybasedtechniquesRESULTS:analysissevenrevealedaveragemagnitudeeffectobservedmarkedlylowercomparedhigherbiologicalvariationpresenceoutliersresultsindicatedPAMhighestconditionalGaussianbalancedskewedunbalancedfreeopen-sourceRsoftwarelibraryMVpowerimplementsstrategyproposedCONCLUSION:singleSimulationusefulguidanceinvolvingSampleconsiderationssettings:comparativeclassificationalgorithms

Similar Articles

Cited By