Optimizer's dilemma: optimization strongly influences model selection in transcriptomic prediction.

Advanced Search

Jake Crawford, Maria Chikina, Casey S Greene

Author Information

Jake Crawford: Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States.
Maria Chikina: Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, United States.
Casey S Greene: Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, United States. ORCID

PMID: 38282973 DOI: 10.1093/bioadv/vbae004

Motivation: Most models can be fit to data using various optimization approaches. While model choice is frequently reported in machine-learning-based research, optimizers are not often noted. We applied two different implementations of LASSO logistic regression implemented in Python's scikit-learn package, using two different optimization approaches (coordinate descent, implemented in the liblinear library, and stochastic gradient descent, or SGD), to predict mutation status and gene essentiality from gene expression across a variety of pan-cancer driver genes. For varying levels of regularization, we compared performance and model sparsity between optimizers.
Results: After model selection and tuning, we found that liblinear and SGD tended to perform comparably. liblinear models required more extensive tuning of regularization strength, performing best for high model sparsities (more nonzero coefficients), but did not require selection of a learning rate parameter. SGD models required tuning of the learning rate to perform well, but generally performed more robustly across different model sparsities as regularization strength decreased. Given these tradeoffs, we believe that the choice of optimizers should be clearly reported as a part of the model selection and validation process, to allow readers and reviewers to better understand the context in which results have been generated.
Availability and implementation: The code used to carry out the analyses in this study is available at https://github.com/greenelab/pancancer-evaluation/tree/master/01_stratified_classification. Performance/regularization strength curves for all genes in the Vogelstein (2013) dataset are available at https://doi.org/10.6084/m9.figshare.22728644.

figshare | 10.6084/m9.figshare.22728644

PLoS Comput Biol. 2022 Mar 11;18(3):e1009926 [PMID: 35275931]
Mol Cell. 2015 Jun 18;58(6):925-34 [PMID: 26091341]
Mol Cancer Ther. 2023 Feb 1;22(2):215-226 [PMID: 36228090]
Genome Biol. 2022 Jun 27;23(1):137 [PMID: 35761387]
Nucleic Acids Res. 2016 May 19;44(9):4189-99 [PMID: 26873923]
J Clin Oncol. 2009 Mar 10;27(8):1160-7 [PMID: 19204204]
Pac Symp Biocomput. 2020;25:343-354 [PMID: 31797609]
Nat Genet. 2013 Oct;45(10):1113-20 [PMID: 24071849]
Proc Natl Acad Sci U S A. 2020 Nov 10;117(45):28068-28079 [PMID: 33097661]
Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30063-30070 [PMID: 32332161]
Cell. 2017 Jul 27;170(3):564-576.e16 [PMID: 28753430]
Nat Commun. 2019 Sep 19;10(1):4286 [PMID: 31537809]
Breast. 2009 Oct;18 Suppl 3:S141-5 [PMID: 19914534]
Bioinformatics. 2020 Jun 1;36(11):3457-3465 [PMID: 32129827]
Cell Syst. 2018 Mar 28;6(3):271-281.e7 [PMID: 29596782]
J Stat Softw. 2010;33(1):1-22 [PMID: 20808728]
Genome Biol. 2020 May 11;21(1):109 [PMID: 32393369]
Genome Biol. 2011;12(4):R41 [PMID: 21527027]
Genomics Proteomics Bioinformatics. 2024 Sep 26;: [PMID: 39325855]
Science. 2013 Mar 29;339(6127):1546-58 [PMID: 23539594]
PLoS Comput Biol. 2019 Jun 24;15(6):e1007128 [PMID: 31233491]
NPJ Genom Med. 2022 Oct 17;7(1):58 [PMID: 36253482]
Cell Rep. 2018 Apr 3;23(1):172-180.e3 [PMID: 29617658]
Cancer Rep (Hoboken). 2023 Sep;6(9):e1874 [PMID: 37533331]
PLoS One. 2020 Nov 9;15(11):e0241514 [PMID: 33166334]
Cell Rep. 2018 Apr 03;23(1):239-254.e6 [PMID: 29617664]

R01 CA237170/NCI NIH HHS
R01 HG010067/NHGRI NIH HHS

Journal Article

OpenLB
Open Library of Bioscience