Simultaneous feature selection and outlier detection with optimality guarantees.

Luca Insolia, Ana Kenney, Francesca Chiaromonte, Giovanni Felici
Author Information
  1. Luca Insolia: Faculty of Sciences, Scuola Normale Superiore, Pisa, Italy. ORCID
  2. Ana Kenney: Department of Statistics, The Pennsylvania State University, University Park, Pennsylvania, USA. ORCID
  3. Francesca Chiaromonte: Institute of Economics & EMbeDS, Sant'Anna School of Advanced Studies, Pisa, Italy.
  4. Giovanni Felici: Istituto di Analisi dei Sistemi ed Informatica, Consiglio Nazionale delle Ricerche, Rome, Italy. ORCID

Abstract

Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial portion of the features may be redundant and/or contain contamination (outlying values). This poses serious challenges, which are exacerbated in cases where the sample sizes are relatively small. Effective and efficient approaches to perform sparse estimation in the presence of outliers are critical for these studies, and have received considerable attention in the last decade. We contribute to this area considering high-dimensional regressions contaminated by multiple mean-shift outliers affecting both the response and the design matrix. We develop a general framework and use mixed-integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We prove theoretical properties for our approach, that is, a necessary and sufficient condition for the robustly strong oracle property, where the number of features can increase exponentially with the sample size; the optimal estimation of parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and warm-start the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through simulations and use it to study the relationships between childhood obesity and the human microbiome.

Keywords

References

  1. J R Stat Soc Series B Stat Methodol. 2011 Jun;73(3):325-349 [PMID: 21589849]
  2. Sci Rep. 2018 Sep 19;8(1):14030 [PMID: 30232389]
  3. Adv Nutr. 2018 Jan 1;9(1):21-29 [PMID: 29438462]
  4. J Clin Periodontol. 2009 Feb;36(2):89-99 [PMID: 19207883]
  5. Obesity (Silver Spring). 2012 Jan;20(1):157-64 [PMID: 21996660]
  6. J Am Stat Assoc. 2012 Jan 1;107(497):223-232 [PMID: 22736876]
  7. J Appl Stat. 2020 Dec 23;48(13-15):2421-2440 [PMID: 35707096]
  8. BMC Pediatr. 2014 Jul 18;14:184 [PMID: 25037579]
  9. Ann Stat. 2014 Jun;42(3):819-849 [PMID: 25598560]
  10. JAMA Pediatr. 2016 Aug 1;170(8):742-9 [PMID: 27271455]
  11. Biometrics. 2022 Dec;78(4):1592-1603 [PMID: 34437713]
  12. J Comput Graph Stat. 2021;30(3):566-577 [PMID: 36406776]
  13. mBio. 2014 Apr 22;5(2):e00889 [PMID: 24757212]
  14. Ann Inst Stat Math. 2013 Oct;65(5):807-832 [PMID: 24465052]
  15. Pediatrics. 2009 Apr;123(4):1177-83 [PMID: 19336378]
  16. Am J Clin Nutr. 2012 Sep;96(3):544-51 [PMID: 22836031]

Grants

  1. T32 LM012415/NLM NIH HHS
  2. 5T32LM012415-03/NIH HHS

MeSH Term

Child
Humans
Pediatric Obesity
Algorithms
Sample Size
Probability

Word Cloud

Created with Highcharts 10.0.0featuresestimationstudiesstudysampleefficientperformsparseoutliersusemixed-integerprogrammingfeatureselectionoutlierdetectionoptimalguaranteesstrongoraclepropertybreakdownpointregressionBiomedicalresearchincreasinglydatarichcomprisingevergrowingnumberslargerhigherlikelihoodsubstantialportionmayredundantand/orcontaincontaminationoutlyingvaluesposesseriouschallengesexacerbatedcasessizesrelativelysmallEffectiveapproachespresencecriticalreceivedconsiderableattentionlastdecadecontributeareaconsideringhigh-dimensionalregressionscontaminatedmultiplemean-shiftaffectingresponsedesignmatrixdevelopgeneralframeworksimultaneouslyprovablyprovetheoreticalpropertiesapproach anecessarysufficientconditionrobustlynumbercanincreaseexponentiallysizeparametersresultingestimatesMoreoverprovidecomputationallyprocedurestuneintegerconstraintswarm-startalgorithmshowsuperiorperformanceproposalcomparedexistingheuristicmethodssimulationsrelationshipschildhoodobesityhuman microbiomeSimultaneousoptimalityanalysisrobust

Similar Articles

Cited By (2)