Robust classification using average correlations as features (ACF).

Yannis Schumann, Julia E Neumann, Philipp Neumann
Author Information
  1. Yannis Schumann: Chair for High Performance Computing, Helmut-Schmidt University, Hamburg, Germany. schumany@hsu-hh.de.
  2. Julia E Neumann: Center for Molecular Neurobiology Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
  3. Philipp Neumann: Chair for High Performance Computing, Helmut-Schmidt University, Hamburg, Germany.

Abstract

MOTIVATION: In single-cell transcriptomics and other omics technologies, large fractions of missing values commonly occur. Researchers often either consider only those features that were measured for each instance of their dataset, thereby accepting severe loss of information, or use imputation which can lead to erroneous results. Pairwise metrics allow for imputation-free classification with minimal loss of data.
RESULTS: Using pairwise correlations as metric, state-of-the-art approaches to classification would include the K-nearest-neighbor- (KNN) and distribution-based-classification-classifier. Our novel method, termed average correlations as features (ACF), significantly outperforms those approaches by training tunable machine learning models on inter-class and intra-class correlations. Our approach is characterized in simulation studies and its classification performance is demonstrated on real-world datasets from single-cell RNA sequencing and bottom-up proteomics. Furthermore, we demonstrate that variants of our method offer superior flexibility and performance over KNN classifiers and can be used in conjunction with other machine learning methods. In summary, ACF is a flexible method that enables missing value tolerant classification with minimal loss of data.

Keywords

References

  1. PLoS Comput Biol. 2020 Oct 29;16(10):e1008263 [PMID: 33119584]
  2. Cell. 2020 Dec 23;183(7):1962-1985.e31 [PMID: 33242424]
  3. Big Data. 2019 Dec;7(4):221-248 [PMID: 31411491]
  4. Cancer Cell. 2022 Aug 8;40(8):835-849.e8 [PMID: 35839778]
  5. Nat Methods. 2020 Mar;17(3):261-272 [PMID: 32015543]
  6. Nature. 2018 Mar 22;555(7697):469-474 [PMID: 29539639]
  7. Sci Rep. 2021 Jan 19;11(1):1760 [PMID: 33469060]
  8. F1000Res. 2018 Nov 2;7:1740 [PMID: 30906525]
  9. Brief Funct Genomics. 2019 Feb 14;18(1):41-57 [PMID: 30265280]
  10. J Big Data. 2021;8(1):140 [PMID: 34722113]
  11. Genome Biol. 2022 Jan 21;23(1):31 [PMID: 35063006]
  12. Cell. 2020 Nov 25;183(5):1436-1456.e31 [PMID: 33212010]
  13. Proc Natl Acad Sci U S A. 2010 Apr 13;107(15):6737-42 [PMID: 20339085]
  14. Biostatistics. 2007 Jan;8(1):118-27 [PMID: 16632515]
  15. Cell Syst. 2016 Oct 26;3(4):346-360.e4 [PMID: 27667365]
  16. Cell Metab. 2016 Oct 11;24(4):608-615 [PMID: 27667665]
  17. Nature. 2020 Sep;585(7825):357-362 [PMID: 32939066]
  18. J Proteome Res. 2021 Jul 2;20(7):3489-3496 [PMID: 34062065]
  19. Mol Cell Proteomics. 2017 May;16(5):873-890 [PMID: 28325852]
  20. Genome Biol. 2020 Aug 27;21(1):218 [PMID: 32854757]
  21. Nat Commun. 2022 Jun 20;13(1):3523 [PMID: 35725563]
  22. Nat Commun. 2022 Jan 11;13(1):192 [PMID: 35017482]
  23. J Proteome Res. 2016 Apr 1;15(4):1116-25 [PMID: 26906401]

MeSH Term

Machine Learning
Computer Simulation
Gene Expression Profiling
Cluster Analysis
Algorithms

Word Cloud

Created with Highcharts 10.0.0classificationcorrelationsfeatureslossmethodACFlearningsingle-cellmissingvaluescanminimaldataapproachesKNNaveragemachineperformanceMOTIVATION:transcriptomicsomicstechnologieslargefractionscommonlyoccurResearchersofteneitherconsidermeasuredinstancedatasettherebyacceptingsevereinformationuseimputationleaderroneousresultsPairwisemetricsallowimputation-freeRESULTS:Usingpairwisemetricstate-of-the-artincludeK-nearest-neighbor-distribution-based-classification-classifiernoveltermedsignificantlyoutperformstrainingtunablemodelsinter-classintra-classapproachcharacterizedsimulationstudiesdemonstratedreal-worlddatasetsRNAsequencingbottom-upproteomicsFurthermoredemonstratevariantsoffersuperiorflexibilityclassifiersusedconjunctionmethodssummaryflexibleenablesvaluetolerantRobustusingClassificationCorrelationMachineMissingscRNA-seq

Similar Articles

Cited By (1)