Feature Augmentation via Nonparametrics and Selection (FANS) in High-Dimensional Classification.

Jianqing Fan, Yang Feng, Jiancheng Jiang, Xin Tong
Author Information
  1. Jianqing Fan: Jianqing Fan is Frederick L. Moore Professor of Finance, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ, 08544 ( jqfan@princeton.edu ).
  2. Yang Feng: Yang Feng is Assistant Professor, Department of Statistics, Columbia University, New York, NY, 10027 ( yangfeng@stat.columbia.edu ).
  3. Jiancheng Jiang: Jiancheng Jiang is Associate Professor, Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC, 28223 ( jjiang1@uncc.edu ).
  4. Xin Tong: Xin Tong is Assistant Professor, Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA, 90089 ( xint@marshall.usc.edu ).

Abstract

We propose a high dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called Feature Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by generalizing the Naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are developed for FANS. In numerical analysis, FANS is compared with competing methods, so as to provide a guideline on its best application domain. Real data analysis demonstrates that FANS performs very competitively on benchmark email spam and gene expression data sets. Moreover, FANS is implemented by an extremely fast algorithm through parallel computing.

Keywords

References

  1. Bioinformatics. 2002 Jan;18(1):39-50 [PMID: 11836210]
  2. Cancer Res. 2002 Sep 1;62(17):4963-7 [PMID: 12208747]
  3. Bioinformatics. 2003 Mar 22;19(5):563-70 [PMID: 12651713]
  4. Bioinformatics. 2003 Nov 1;19(16):2072-8 [PMID: 14594712]
  5. Bioinformatics. 2005 May 1;21(9):1979-86 [PMID: 15691862]
  6. Biostatistics. 2007 Jan;8(1):86-100 [PMID: 16603682]
  7. Stat Appl Genet Mol Biol. 2004;3:Article33 [PMID: 16646813]
  8. Bioinformatics. 2009 May 1;25(9):1145-51 [PMID: 19168911]
  9. Ann Stat. 2008;36(6):2605-2637 [PMID: 19169416]
  10. BMC Bioinformatics. 2009 Feb 03;10:47 [PMID: 19192285]
  11. J R Stat Soc Series B Stat Methodol. 2008;70(5):849-911 [PMID: 19603084]
  12. J Stat Softw. 2010;33(1):1-22 [PMID: 20808728]
  13. J Am Stat Assoc. 2011 Jun;106(494):544-557 [PMID: 22279246]
  14. J R Stat Soc Series B Stat Methodol. 2011 Nov;73(5):753-772 [PMID: 22323898]
  15. J R Stat Soc Series B Stat Methodol. 2012 Sep;74(4):745-771 [PMID: 23074363]

Grants

  1. R01 GM072611/NIGMS NIH HHS
  2. R01 GM100474/NIGMS NIH HHS

Word Cloud

Created with Highcharts 10.0.0FANSfeaturehighdimensionalclassificationmethodaugmentationmarginaldensityratiomodelsnonlineardecisionboundaryFeatureAugmentationviaNonparametricsSelectiondensitiesanalysisdataparallelcomputingproposeinvolvesnonparametricKnowingratiospowerfulunivariateclassifiersuseestimatestransformoriginalmeasurementsSubsequentlypenalizedlogisticregressioninvokedtakinginputnewlytransformedaugmentedfeaturesproceduretrainsequippedlocalcomplexityglobalsimplicitytherebyavoidingcursedimensionalitycreatingflexibleresultingcalledmotivategeneralizingNaiveBayesmodelwritinglogjointlinearcombinationrelatedgeneralizedadditivebetterinterpretabilitycomputabilityRiskboundsdevelopednumericalcomparedcompetingmethodsprovideguidelinebestapplicationdomainRealdemonstratesperformscompetitivelybenchmarkemailspamgeneexpressionsetsMoreoverimplementedextremelyfastalgorithmHigh-DimensionalClassificationestimationselectionspace

Similar Articles

Cited By