Ultrahigh-dimensional variable selection method for whole-genome gene-gene interaction analysis.

Masao Ueki, Gen Tamiya
Author Information
  1. Masao Ueki: Advanced Molecular Epidemiology Research Institute, Faculty of Medicine, Yamagata University, 2-2-2 Iida-Nishi, Yamagata, Yamagata, Japan. uekimrsd@nifty.com

Abstract

BACKGROUND: Genome-wide gene-gene interaction analysis using single nucleotide polymorphisms (SNPs) is an attractive way for identification of genetic components that confers susceptibility of human complex diseases. Individual hypothesis testing for SNP-SNP pairs as in common genome-wide association study (GWAS) however involves difficulty in setting overall p-value due to complicated correlation structure, namely, the multiple testing problem that causes unacceptable false negative results. A large number of SNP-SNP pairs than sample size, so-called the large p small n problem, precludes simultaneous analysis using multiple regression. The method that overcomes above issues is thus needed.
RESULTS: We adopt an up-to-date method for ultrahigh-dimensional variable selection termed the sure independence screening (SIS) for appropriate handling of numerous number of SNP-SNP interactions by including them as predictor variables in logistic regression. We propose ranking strategy using promising dummy coding methods and following variable selection procedure in the SIS method suitably modified for gene-gene interaction analysis. We also implemented the procedures in a software program, EPISIS, using the cost-effective GPGPU (General-purpose computing on graphics processing units) technology. EPISIS can complete exhaustive search for SNP-SNP interactions in standard GWAS dataset within several hours. The proposed method works successfully in simulation experiments and in application to real WTCCC (Wellcome Trust Case-control Consortium) data.
CONCLUSIONS: Based on the machine-learning principle, the proposed method gives powerful and flexible genome-wide search for various patterns of gene-gene interaction.

References

  1. Bioinformatics. 2011 Jan 1;27(1):1-8 [PMID: 21036813]
  2. Pac Symp Biocomput. 2006;:499-510 [PMID: 17094264]
  3. Ann Stat. 2009;37(4):1733-1751 [PMID: 20445770]
  4. Hum Mol Genet. 2005 Jan 15;14(2):241-53 [PMID: 15563509]
  5. Bioinformatics. 2010 Jul 15;26(14):1752-8 [PMID: 20505004]
  6. Am J Hum Genet. 2007 Sep;81(3):559-75 [PMID: 17701901]
  7. Genet Epidemiol. 2007 May;31(4):306-15 [PMID: 17323372]
  8. Bioinformatics. 2009 Mar 15;25(6):714-21 [PMID: 19176549]
  9. PLoS Genet. 2007 Apr 20;3(4):e58 [PMID: 17447842]
  10. Hum Hered. 2009;67(2):128-39 [PMID: 19077429]
  11. Am J Hum Genet. 2001 Jul;69(1):138-47 [PMID: 11404819]
  12. Bioinformatics. 2007 Jan 1;23(1):71-6 [PMID: 17092990]
  13. Nature. 2008 Nov 13;456(7219):259-63 [PMID: 18849966]
  14. Am J Hum Genet. 1990 Feb;46(2):222-8 [PMID: 2301392]
  15. J Stat Softw. 2010;33(1):1-22 [PMID: 20808728]
  16. Ann Hum Genet. 1995 Jan;59(1):97-105 [PMID: 7762987]
  17. Nat Rev Genet. 2009 Jun;10(6):392-404 [PMID: 19434077]
  18. Genome Med. 2010 Feb 02;2(2):10 [PMID: 20181060]
  19. J Natl Cancer Inst. 2004 Mar 17;96(6):434-42 [PMID: 15026468]
  20. Genet Epidemiol. 2010 Dec;34(8):879-91 [PMID: 21104890]
  21. Am J Hum Genet. 2006 Nov;79(5):831-45 [PMID: 17033960]
  22. Nature. 2007 Jul 26;448(7152):427-34 [PMID: 17653185]
  23. Nat Med. 2009 Jun;15(6):633-40 [PMID: 19465928]
  24. PLoS Genet. 2008 Jul 25;4(7):e1000130 [PMID: 18654633]
  25. Am J Hum Genet. 2010 Sep 10;87(3):325-40 [PMID: 20817139]
  26. Nat Genet. 2007 Sep;39(9):1167-73 [PMID: 17721534]
  27. Genet Epidemiol. 2009 Jan;33(1):79-86 [PMID: 18642345]
  28. Nat Genet. 2005 Apr;37(4):413-7 [PMID: 15793588]
  29. J R Stat Soc Series B Stat Methodol. 2008;70(5):903 [PMID: 19603084]
  30. Biometrika. 2007 Aug 1;94(3):553-568 [PMID: 19343105]
  31. Hum Mol Genet. 2010 May 1;19(9):1828-39 [PMID: 20106866]
  32. J Mach Learn Res. 2009;10:2013-2038 [PMID: 21603590]
  33. Am J Hum Genet. 2007 Aug;81(2):208-27 [PMID: 17668372]
  34. Am J Med Genet B Neuropsychiatr Genet. 2007 Mar 5;144B(2):250-3 [PMID: 17066476]
  35. Nature. 2007 Jun 7;447(7145):661-78 [PMID: 17554300]
  36. PLoS Genet. 2010 Sep 23;6(9):e1001131 [PMID: 20885795]
  37. Nature. 2009 Oct 8;461(7265):747-53 [PMID: 19812666]

Grants

  1. G1001799/Medical Research Council

MeSH Term

Algorithms
Artificial Intelligence
Computer Simulation
Gene Expression
Genetic Predisposition to Disease
Genome, Human
Genome-Wide Association Study
Humans
Logistic Models
Models, Genetic
Polymorphism, Single Nucleotide
Software

Word Cloud

Created with Highcharts 10.0.0methodgene-geneinteractionanalysisusingSNP-SNPvariableselectiontestingpairsgenome-wideGWASmultipleproblemlargenumberregressionSISinteractionsEPISISsearchproposedBACKGROUND:Genome-widesinglenucleotidepolymorphismsSNPsattractivewayidentificationgeneticcomponentsconferssusceptibilityhumancomplexdiseasesIndividualhypothesiscommonassociationstudyhoweverinvolvesdifficultysettingoverallp-valueduecomplicatedcorrelationstructurenamelycausesunacceptablefalsenegativeresultssamplesizeso-calledpsmallnprecludessimultaneousovercomesissuesthusneededRESULTS:adoptup-to-dateultrahigh-dimensionaltermedsureindependencescreeningappropriatehandlingnumerousincludingpredictorvariableslogisticproposerankingstrategypromisingdummycodingmethodsfollowingproceduresuitablymodifiedalsoimplementedproceduressoftwareprogramcost-effectiveGPGPUGeneral-purposecomputinggraphicsprocessingunitstechnologycancompleteexhaustivestandarddatasetwithinseveralhoursworkssuccessfullysimulationexperimentsapplicationrealWTCCCWellcomeTrustCase-controlConsortiumdataCONCLUSIONS:Basedmachine-learningprinciplegivespowerfulflexiblevariouspatternsUltrahigh-dimensionalwhole-genome

Similar Articles

Cited By