A comprehensive survey on computational learning methods for analysis of gene expression data.

Nikita Bhandari, Rahee Walambe, Ketan Kotecha, Satyajeet P Khare
Author Information
  1. Nikita Bhandari: Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India.
  2. Rahee Walambe: Electronics and Telecommunication Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India.
  3. Ketan Kotecha: Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India.
  4. Satyajeet P Khare: Symbiosis School of Biological Sciences, Symbiosis International (Deemed University), Pune, India.

Abstract

Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine. High-throughput gene expression analysis methods such as microarray technology and RNA sequencing produce enormous amounts of data. Traditionally, statistical methods are used for comparative analysis of gene expression data. However, more complex analysis for classification of sample observations, or discovery of feature genes requires sophisticated computational approaches. In this review, we compile various statistical and computational tools used in analysis of expression microarray data. Even though the methods are discussed in the context of expression microarrays, they can also be applied for the analysis of RNA sequencing and quantitative proteomics datasets. We discuss the types of missing values, and the methods and approaches usually employed in their imputation. We also discuss methods of data normalization, feature selection, and feature extraction. Lastly, methods of classification and class discovery along with their evaluation parameters are described in detail. We believe that this detailed review will help the users to select appropriate methods for preprocessing and analysis of their data based on the expected outcome.

Keywords

References

  1. Plant Biotechnol J. 2016 Apr;14(4):1095-8 [PMID: 26360509]
  2. Indian J Ophthalmol. 2008 Jan-Feb;56(1):45-50 [PMID: 18158403]
  3. Cell Immunol. 2017 Dec;322:56-63 [PMID: 29050663]
  4. Biotechniques. 2008 Nov;45(5):501-20 [PMID: 19007336]
  5. Bioinformatics. 2001 Jun;17(6):520-5 [PMID: 11395428]
  6. Nat Genet. 2019 Jan;51(1):12-18 [PMID: 30478442]
  7. Protein Eng. 1990 Dec;4(2):155-61 [PMID: 2075190]
  8. Psychol Methods. 2002 Jun;7(2):147-77 [PMID: 12090408]
  9. Bioinformatics. 2007 Feb 15;23(4):487-92 [PMID: 17204462]
  10. Bioinformatics. 2016 Jun 15;32(12):1832-9 [PMID: 26873929]
  11. Gigascience. 2020 Aug 1;9(8): [PMID: 32761097]
  12. Sci Rep. 2020 Mar 13;10(1):4679 [PMID: 32170141]
  13. Proc Natl Acad Sci U S A. 2004 Mar 23;101(12):4164-9 [PMID: 15016911]
  14. Methods Mol Biol. 2019;1986:255-266 [PMID: 31115893]
  15. J Mol Diagn. 2003 May;5(2):73-81 [PMID: 12707371]
  16. BMC Bioinformatics. 2010 Feb 18;11:94 [PMID: 20167110]
  17. Nat Rev Genet. 2001 Jun;2(6):418-27 [PMID: 11389458]
  18. Nature. 2015 May 28;521(7553):436-44 [PMID: 26017442]
  19. Genome Res. 2003 Jul;13(7):1706-18 [PMID: 12840046]
  20. Brief Bioinform. 2010 Mar;11(2):253-64 [PMID: 19965979]
  21. Genomics Proteomics Bioinformatics. 2017 Dec;15(6):396-404 [PMID: 29247873]
  22. J Comput Biol. 2000;7(3-4):601-20 [PMID: 11108481]
  23. Bioinformatics. 2000 Oct;16(10):906-14 [PMID: 11120680]
  24. IEEE Trans Neural Netw. 1998;9(4):685-96 [PMID: 18252491]
  25. Front Immunol. 2021 Feb 12;11:594978 [PMID: 33679689]
  26. Biostatistics. 2003 Apr;4(2):249-64 [PMID: 12925520]
  27. Clin Proteomics. 2018 Feb 28;15:7 [PMID: 29507545]
  28. Bioinformatics. 2004 Feb 12;20(3):307-15 [PMID: 14960456]
  29. Biosystems. 2019 Feb;176:41-51 [PMID: 30611843]
  30. Genome Biol. 2014;15(12):550 [PMID: 25516281]
  31. Biochim Biophys Acta. 1975 Oct 20;405(2):442-51 [PMID: 1180967]
  32. Nucleic Acids Res. 2006 Mar 20;34(5):1608-19 [PMID: 16549873]
  33. Genome Res. 2008 Jul;18(7):1051-63 [PMID: 18477713]
  34. Pac Symp Biocomput. 2016;22:219-229 [PMID: 27896977]
  35. BMC Genomics. 2020 Jan 2;21(1):6 [PMID: 31898477]
  36. Neural Netw. 2002 Oct-Nov;15(8-9):953-66 [PMID: 12416686]
  37. PLoS One. 2017 Feb 3;12(2):e0171410 [PMID: 28158264]
  38. DNA Cell Biol. 2020 Oct 16;: [PMID: 33064574]
  39. BMC Bioinformatics. 2011;12 Suppl 5:S7 [PMID: 21989140]
  40. Philos Trans A Math Phys Eng Sci. 2012 Dec 31;371(1984):20110534 [PMID: 23277597]
  41. Nucleic Acids Res. 2019 Jan 8;47(D1):D69-D73 [PMID: 30357349]
  42. Sci Rep. 2018 May 3;8(1):6951 [PMID: 29725024]
  43. Chem Res Toxicol. 2020 Jan 21;33(1):20-37 [PMID: 31625725]
  44. Pharmacogenomics J. 2007 Jun;7(3):212-20 [PMID: 16940966]
  45. Bioinformatics. 2007 Nov 1;23(21):2934-41 [PMID: 17893089]
  46. Methods Mol Biol. 2003;224:149-57 [PMID: 12710672]
  47. Nucleic Acids Res. 2015 Apr 20;43(7):e47 [PMID: 25605792]
  48. Bioinformatics. 2011 Nov 15;27(22):3142-8 [PMID: 21926125]
  49. Gigascience. 2021 Aug 18;10(8): [PMID: 34406415]
  50. Clin Transl Sci. 2021 Jan;14(1):113-119 [PMID: 33089968]
  51. Comput Biol Med. 2017 Dec 1;91:213-221 [PMID: 29100115]
  52. PLoS One. 2018 Aug 20;13(8):e0202167 [PMID: 30125332]
  53. Genomics Proteomics Bioinformatics. 2003 Nov;1(4):279-87 [PMID: 15629056]
  54. BMC Bioinformatics. 2015 Feb 26;16:64 [PMID: 25888091]
  55. Bioinformatics. 2003 Nov 1;19(16):2088-96 [PMID: 14594714]
  56. Database (Oxford). 2018 Jan 1;2018: [PMID: 29688366]
  57. Nucleic Acids Res. 2002 Feb 15;30(4):e15 [PMID: 11842121]
  58. Bioinformatics. 2007 Jun 15;23(12):1495-502 [PMID: 17483501]
  59. Bioinformatics. 2004 Apr 12;20(6):917-23 [PMID: 14751970]
  60. BMC Genomics. 2015;16 Suppl 9:S1 [PMID: 26330180]
  61. BMC Bioinformatics. 2011 Oct 07;12:390 [PMID: 21982277]
  62. Conf Proc IEEE Eng Med Biol Soc. 2006;Suppl:6593-6 [PMID: 17959461]
  63. Hum Genomics Proteomics. 2009 Apr 27;2009: [PMID: 20948566]
  64. Nucleic Acids Res. 2003 Jan 1;31(1):68-71 [PMID: 12519949]
  65. Sci Rep. 2021 Dec 21;11(1):24297 [PMID: 34934107]
  66. BMC Genomics. 2008;9 Suppl 1:S13 [PMID: 18366602]
  67. Bioinformatics. 2010 Oct 1;26(19):2363-7 [PMID: 20688976]
  68. PLoS One. 2022 Jun 16;17(6):e0269369 [PMID: 35709188]
  69. PeerJ Comput Sci. 2021 Feb 9;7:e365 [PMID: 33817015]
  70. Curr Genomics. 2009 Sep;10(6):430-45 [PMID: 20190957]
  71. Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5 [PMID: 23193258]
  72. Bioinformatics. 2004 Nov 22;20(17):3185-95 [PMID: 15231531]
  73. Neural Netw. 2019 Mar;111:11-34 [PMID: 30654138]
  74. BMC Bioinformatics. 2008 Jul 22;9:319 [PMID: 18647401]
  75. J Androl. 2004 Nov-Dec;25(6):944-53 [PMID: 15477368]
  76. Sci Rep. 2018 Nov 6;8(1):16444 [PMID: 30401894]
  77. Bioinformation. 2011;7(3):142-6 [PMID: 22125385]
  78. Bioinformatics. 2010 Jan 1;26(1):139-40 [PMID: 19910308]
  79. BMC Bioinformatics. 2006 Jan 06;7:3 [PMID: 16398926]
  80. JAMA Neurol. 2021 Apr 1;78(4):464-472 [PMID: 33523105]
  81. Genome Biol. 2003;4(11):R76 [PMID: 14611662]
  82. Nature. 2005 Sep 15;437(7057):376-80 [PMID: 16056220]
  83. Biostatistics. 2007 Jan;8(1):86-100 [PMID: 16603682]
  84. Nature. 2008 Nov 6;456(7218):53-9 [PMID: 18987734]
  85. Bioinformatics. 2010 Nov 1;26(21):2705-12 [PMID: 20802251]
  86. Proc Natl Acad Sci U S A. 1998 Dec 8;95(25):14863-8 [PMID: 9843981]
  87. Biomed Res Int. 2014;2014:651751 [PMID: 25101291]
  88. Science. 2008 Apr 4;320(5872):106-9 [PMID: 18388294]
  89. Proc Natl Acad Sci U S A. 2000 Jan 4;97(1):262-7 [PMID: 10618406]
  90. Comput Math Methods Med. 2012;2012:320698 [PMID: 22924059]
  91. Nat Biotechnol. 2008 Mar;26(3):303-4 [PMID: 18327243]
  92. Bioinformatics. 2001 Jun;17(6):509-19 [PMID: 11395427]
  93. BMC Bioinformatics. 2007 Feb 22;8:60 [PMID: 17316436]
  94. Contemp Oncol (Pozn). 2015;19(1A):A68-77 [PMID: 25691825]
  95. Nat Biotechnol. 2012 Apr 10;30(4):295-6 [PMID: 22491260]
  96. Proc Natl Acad Sci U S A. 1977 Dec;74(12):5463-7 [PMID: 271968]
  97. Brief Bioinform. 2011 Sep;12(5):498-513 [PMID: 21156727]
  98. BMC Syst Biol. 2011;5 Suppl 3:S13 [PMID: 22784619]
  99. Sci Rep. 2019 Oct 29;9(1):15502 [PMID: 31664045]
  100. Front Genet. 2019 Apr 02;10:226 [PMID: 31001311]
  101. Cancer Inform. 2007 Feb 22;2:301-11 [PMID: 19458773]
  102. BMC Bioinformatics. 2017 Jan 3;18(1):9 [PMID: 28049413]
  103. Nat Genet. 1999 Jul;22(3):281-5 [PMID: 10391217]
  104. Per Med. 2015 Aug;12(4):371-387 [PMID: 29771660]
  105. Gene. 2020 Feb 5;726:144168 [PMID: 31759986]
  106. Int J Approx Reason. 2008 Jan;47(1):17-36 [PMID: 19079753]
  107. PLoS Comput Biol. 2012;8(2):e1002375 [PMID: 22383865]
  108. Proc Natl Acad Sci U S A. 1999 Mar 16;96(6):2907-12 [PMID: 10077610]
  109. Saudi J Biol Sci. 2018 Jul;25(5):895-903 [PMID: 30108438]
  110. BMC Genomics. 2012 Jun 18;13 Suppl 4:S2 [PMID: 22759650]
  111. Microb Biotechnol. 2022 Jan;15(1):6-12 [PMID: 33529492]
  112. J Bioinform Comput Biol. 2005 Apr;3(2):185-205 [PMID: 15852500]
  113. BMC Bioinformatics. 2008 Dec 29;9:559 [PMID: 19114008]
  114. BMC Bioinformatics. 2006 Oct 12;7:449 [PMID: 17038176]
  115. BMC Bioinformatics. 2010 Apr 20;11:197 [PMID: 20406472]
  116. Methods Mol Biol. 2016;1434:29-43 [PMID: 27300529]
  117. Bioinformatics. 2018 Jun 1;34(11):1868-1874 [PMID: 29360996]
  118. Biostatistics. 2012 Apr;13(2):204-16 [PMID: 22285995]
  119. Bioinformatics. 2005 Nov 15;21(22):4155-61 [PMID: 16118262]
  120. IEEE/ACM Trans Comput Biol Bioinform. 2018 Feb 27;: [PMID: 29994588]
  121. Genes (Basel). 2022 May 19;13(5): [PMID: 35627292]
  122. BMC Bioinformatics. 2008 Apr 18;9:202 [PMID: 18423022]
  123. Genomics Proteomics Bioinformatics. 2015 Oct;13(5):278-89 [PMID: 26542840]
  124. BMC Bioinformatics. 2008 May 29;9:252 [PMID: 18510747]
  125. Bioinformatics. 2003 Jan 22;19(2):185-93 [PMID: 12538238]
  126. Bioinformatics. 2008 Jul 1;24(13):1547-8 [PMID: 18467348]
  127. Science. 2005 Sep 9;309(5741):1728-32 [PMID: 16081699]
  128. Trends Biotechnol. 2005 Aug;23(8):429-35 [PMID: 15950303]
  129. J Am Geriatr Soc. 1984 Jun;32(6):441-4 [PMID: 6725805]
  130. PLoS One. 2017 Jun 29;12(6):e0179804 [PMID: 28662070]
  131. BMC Bioinformatics. 2006 Jan 22;7:32 [PMID: 16426462]
  132. Nature. 2011 Jul 20;475(7356):348-52 [PMID: 21776081]
  133. Front Oncol. 2020 Jan 08;9:1442 [PMID: 31970084]
  134. BMC Bioinformatics. 2010 Oct 11;11:503 [PMID: 20937082]
  135. Genome Biol. 2002;3(2):RESEARCH0009 [PMID: 11864371]
  136. BMC Bioinformatics. 2014 Feb 20;15:49 [PMID: 24555567]
  137. Trends Biotechnol. 2015 Feb;33(2):120-31 [PMID: 25496918]
  138. PeerJ Comput Sci. 2020 Apr 13;6:e270 [PMID: 33816921]
  139. Bioinformatics. 2006 Mar 01;22(5):566-72 [PMID: 16377613]
  140. Nucleic Acids Res. 2004 Feb 20;32(3):e34 [PMID: 14978222]
  141. Methods. 2003 Dec;31(4):265-73 [PMID: 14597310]
  142. J Proteome Res. 2016 Apr 1;15(4):1116-25 [PMID: 26906401]
  143. Proc Natl Acad Sci U S A. 1994 May 24;91(11):5022-6 [PMID: 8197176]
  144. Bioinformatics. 2007 Oct 1;23(19):2507-17 [PMID: 17720704]
  145. Reprod Biol Endocrinol. 2019 Dec 27;17(1):112 [PMID: 31881887]
  146. Science. 2009 Jan 2;323(5910):133-8 [PMID: 19023044]
  147. Genomics. 2011 May;97(5):257-64 [PMID: 21397683]

Word Cloud

Created with Highcharts 10.0.0methodsanalysisexpressiondatalearninggenefeaturemicroarraycomputationalmachineRNAsequencingstatisticalusedclassificationdiscoveryapproachesreviewalsodiscussmissingimputationselectionComputationalincludingsignificantimpactfieldsgenomicsmedicineHigh-throughputtechnologyproduceenormousamountsTraditionallycomparativeHowevercomplexsampleobservationsgenesrequiressophisticatedcompilevarioustoolsEventhoughdiscussedcontextmicroarrayscanappliedquantitativeproteomicsdatasetstypesvaluesusuallyemployednormalizationextractionLastlyclassalongevaluationparametersdescribeddetailbelievedetailedwillhelpusersselectappropriatepreprocessingbasedexpectedoutcomecomprehensivesurveydeepexplainabletechniquesinterpretationvalue

Similar Articles

Cited By