SWISS MADE: Standardized WithIn Class Sum of Squares to evaluate methodologies and dataset elements.

Christopher R Cabanski, Yuan Qi, Xiaoying Yin, Eric Bair, Michele C Hayward, Cheng Fan, Jianying Li, Matthew D Wilkerson, J S Marron, Charles M Perou, D Neil Hayes
Author Information
  1. Christopher R Cabanski: Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, North Carolina, United States of America.

Abstract

Contemporary high dimensional biological assays, such as mRNA expression microarrays, regularly involve multiple data processing steps, such as experimental processing, computational processing, sample selection, or feature selection (i.e. gene selection), prior to deriving any biological conclusions. These steps can dramatically change the interpretation of an experiment. Evaluation of processing steps has received limited attention in the literature. It is not straightforward to evaluate different processing methods and investigators are often unsure of the best method. We present a simple statistical tool, Standardized WithIn class Sum of Squares (SWISS), that allows investigators to compare alternate data processing methods, such as different experimental methods, normalizations, or technologies, on a dataset in terms of how well they cluster a priori biological classes. SWISS uses Euclidean distance to determine which method does a better job of clustering the data elements based on a priori classifications. We apply SWISS to three different gene expression applications. The first application uses four different datasets to compare different experimental methods, normalizations, and gene sets. The second application, using data from the MicroArray Quality Control (MAQC) project, compares different microarray platforms. The third application compares different technologies: a single Agilent two-color microarray versus one lane of RNA-Seq. These applications give an indication of the variety of problems that SWISS can be helpful in solving. The SWISS analysis of one-color versus two-color microarrays provides investigators who use two-color arrays the opportunity to review their results in light of a single-channel analysis, with all of the associated benefits offered by this design. Analysis of the MACQ data shows differential intersite reproducibility by array platform. SWISS also shows that one lane of RNA-Seq clusters data by biological phenotypes as well as a single Agilent two-color microarray.

References

  1. Bioinformatics. 2001 Jun;17(6):520-5 [PMID: 11395428]
  2. Front Biosci. 2008 Jan 01;13:691-708 [PMID: 17981580]
  3. Int J Plant Genomics. 2008;2008:584360 [PMID: 18584033]
  4. Biotechniques. 2005 Jan;38(1):121-4 [PMID: 15679094]
  5. Curr Protoc Mol Biol. 2007 Jan;Chapter 19:Unit 19.6 [PMID: 18265395]
  6. BMC Bioinformatics. 2006 Mar 15;7:137 [PMID: 16539732]
  7. Nat Genet. 2002 Dec;32 Suppl:490-5 [PMID: 12454643]
  8. Physiol Genomics. 2006 Dec 13;28(1):15-23 [PMID: 16985008]
  9. Anal Chem. 2003 Sep 1;75(17):4672-5 [PMID: 14632079]
  10. Bioinformatics. 2005 Feb 15;21(4):492-501 [PMID: 15374872]
  11. Nat Rev Genet. 2001 Jun;2(6):418-27 [PMID: 11389458]
  12. Bioinformatics. 2010 Jun 15;26(12):1572-3 [PMID: 20427518]
  13. Genet Epidemiol. 2002 Jun;23(1):21-36 [PMID: 12112246]
  14. Biostatistics. 2003 Apr;4(2):249-64 [PMID: 12925520]
  15. Bioinformatics. 2004 Feb 12;20(3):307-15 [PMID: 14960456]
  16. Nat Biotechnol. 2006 Sep;24(9):1140-50 [PMID: 16964228]
  17. Genome Res. 2002 Oct;12(10):1574-81 [PMID: 12368250]
  18. Microb Cell Fact. 2007 Jan 25;6:4 [PMID: 17254338]
  19. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4 [PMID: 15608248]
  20. Nucleic Acids Res. 2002 Feb 15;30(4):e15 [PMID: 11842121]
  21. Genome Res. 2008 Sep;18(9):1509-17 [PMID: 18550803]
  22. Bioinformatics. 2007 Oct 15;23(20):2700-7 [PMID: 17720982]
  23. Curr Issues Mol Biol. 2002 Apr;4(2):57-64 [PMID: 11931570]
  24. BMC Bioinformatics. 2008 Oct 29;9:462 [PMID: 18959783]
  25. BMC Genomics. 2004 Mar 09;5(1):20 [PMID: 15113400]
  26. Nat Methods. 2008 Jul;5(7):621-8 [PMID: 18516045]
  27. Genome Res. 2008 Nov;18(11):1851-8 [PMID: 18714091]
  28. Proc Natl Acad Sci U S A. 2001 Sep 11;98(19):10869-74 [PMID: 11553815]
  29. Cell Oncol. 2004;26(5-6):279-90 [PMID: 15623938]
  30. Breast Cancer Res. 2006;8(2):R23 [PMID: 16626501]
  31. Proc Natl Acad Sci U S A. 2001 Apr 24;98(9):5116-21 [PMID: 11309499]
  32. Bioinformatics. 2003 Jan 22;19(2):185-93 [PMID: 12538238]
  33. BMC Bioinformatics. 2008 Jan 03;9:1 [PMID: 18173834]
  34. Genome Biol. 2007;8(5):R76 [PMID: 17493263]
  35. Bioinformatics. 2004 Jul 22;20(11):1772-84 [PMID: 15037508]
  36. Nat Biotechnol. 2006 Sep;24(9):1151-61 [PMID: 16964229]
  37. BMC Genomics. 2006 Apr 27;7:96 [PMID: 16643655]
  38. Trends Biotechnol. 2001 May;19(5):189-93 [PMID: 11301132]
  39. Clin Cancer Res. 2006 Aug 1;12(15):4469-73 [PMID: 16899590]

Grants

  1. P50CA58223/NCI NIH HHS
  2. P30 ES010126/NIEHS NIH HHS
  3. U24CA126554/NCI NIH HHS
  4. F32 CA142039/NCI NIH HHS
  5. P50 CA058223/NCI NIH HHS
  6. U24 CA126554/NCI NIH HHS
  7. F32CA142039/NCI NIH HHS
  8. KL2 RR025746/NCRR NIH HHS

MeSH Term

Algorithms
Cluster Analysis
Computational Biology
Computers
Databases, Genetic
Gene Expression Profiling
Humans
Models, Statistical
Oligonucleotide Array Sequence Analysis
Programming Languages
Quality Control
RNA
Reproducibility of Results
Software

Chemicals

RNA

Word Cloud

Created with Highcharts 10.0.0differentSWISSdataprocessingbiologicalmethodstwo-colorstepsexperimentalselectiongeneinvestigatorsapplicationmicroarrayexpressionmicroarrayscanevaluatemethodStandardizedWithInSumSquarescomparenormalizationsdatasetwellprioriuseselementsapplicationscomparessingleAgilentversusonelaneRNA-SeqanalysisshowsContemporaryhighdimensionalassaysmRNAregularlyinvolvemultiplecomputationalsamplefeatureiepriorderivingconclusionsdramaticallychangeinterpretationexperimentEvaluationreceivedlimitedattentionliteraturestraightforwardoftenunsurebestpresentsimplestatisticaltoolclassallowsalternatetechnologiestermsclusterclassesEuclideandistancedeterminebetterjobclusteringbasedclassificationsapplythreefirstfourdatasetssetssecondusingMicroArrayQualityControlMAQCprojectplatformsthirdtechnologies:giveindicationvarietyproblemshelpfulsolvingone-colorprovidesusearraysopportunityreviewresultslightsingle-channelassociatedbenefitsoffereddesignAnalysisMACQdifferentialintersitereproducibilityarrayplatformalsoclustersphenotypesMADE:Classmethodologies

Similar Articles

Cited By