STR-based feature extraction and selection for genetic feature discovery in neurological disease genes.

Jasbir Dhaliwal, John Wagner
Author Information
  1. Jasbir Dhaliwal: Faculty of Information Technology, Monash University, Clayton, VIC, 3800, Australia. jasbir.dhaliwal@monash.edu.
  2. John Wagner: PsychoGenics Inc., Paramus, New Jersey, 07652, United States of America.

Abstract

Gene expression, often determined by single nucleotide polymorphisms, short repeated sequences known as short tandem repeats (STRs), structural variants, and environmental factors, provides means for an organism to produce gene products necessary to live. Variation in expression levels, sometimes known as enrichment patterns, has been associated with disease progression. Thus, the STR enrichment patterns have recently gained interest as potential genetic markers for disease progression. However, to the best of our knowledge, we are unaware of any study that evaluates and explores STRs, particularly trinucleotide sequences, as machine learning features for classifying neurological disease genes for the purpose of discovering genetic features. Thus, in this paper, we proposed a new metric and a novel feature extraction and selection algorithm based on statistically significant STR-based features and their respective enrichment patterns to create a statistically significant feature set. The proposed new metric has shown that the neurological disease family genes have a non-random AA, AT, TA, TG, and TT enrichment pattern. This is an important result, as it supports prior research that has established that certain trinucleotides, such as AAT, ATA, ATT, TAT, and TTA, are favored during protein misfolding. In contrast, trinucleotides, such as TAA, TAG, and TGA, are favored during premature termination codon mutations as they are stop codons. This suggests that the metric has the potential to identify patterns that may be genetic features in a sample of neurological genes. Moreover, the practical performance and high prediction results of the statistically significant STR-based feature set indicate that variations in STR enrichment patterns can distinguish neurological disease genes. In conclusion, the proposed approach may have the potential to discover differential genetic features for other diseases.

References

  1. Database (Oxford). 2018 Jan 1;2018: [PMID: 29688368]
  2. Handb Clin Neurol. 2018;147:105-123 [PMID: 29325606]
  3. Nat Rev Genet. 2018 May;19(5):286-298 [PMID: 29398703]
  4. Genome Res. 2021 Sep 20;: [PMID: 34544830]
  5. J Cell Mol Med. 2010 Mar;14(3):457-87 [PMID: 20070435]
  6. Science. 2015 Jan 9;347(6218):1254806 [PMID: 25525159]
  7. PLoS One. 2013;8(1):e54082 [PMID: 23382867]
  8. Nucleic Acids Res. 2020 Jan 8;48(D1):D845-D855 [PMID: 31680165]
  9. Genomics Proteomics Bioinformatics. 2007 Feb;5(1):7-14 [PMID: 17572359]
  10. Protein Sci. 2013 Oct;22(10):1366-78 [PMID: 23904395]
  11. NeuroRx. 2004 Apr;1(2):255-62 [PMID: 15717026]
  12. Mol Biol Evol. 2003 Dec;20(12):2123-31 [PMID: 12949124]
  13. J Biol Chem. 2020 Mar 27;295(13):4134-4170 [PMID: 32060097]
  14. J Neurosci. 2012 Nov 21;32(47):16807-20 [PMID: 23175834]
  15. Am J Hum Genet. 1989 Mar;44(3):397-401 [PMID: 2563634]
  16. Acta Neuropathol Commun. 2021 May 25;9(1):98 [PMID: 34034831]
  17. FEBS J. 2006 Apr;273(7):1331-49 [PMID: 16689923]
  18. Front Microbiol. 2013 Sep 06;4:269 [PMID: 24046767]
  19. Nucleic Acids Res. 1999 Jan 15;27(2):573-80 [PMID: 9862982]
  20. BMC Neurol. 2018 Jan 9;18(1):3 [PMID: 29316893]
  21. Nat Rev Genet. 2005 Apr;6(4):287-98 [PMID: 15803198]
  22. Neurobiol Aging. 2021 Oct;106:307.e1-307.e7 [PMID: 34090711]
  23. Nat Commun. 2019 Feb 18;10(1):822 [PMID: 30778053]
  24. Genes Immun. 2001 Aug;2(5):263-8 [PMID: 11528519]
  25. Int J Bioinform Res Appl. 2005;1(2):181-201 [PMID: 18048129]
  26. Physiol Genomics. 2001 Dec 21;7(2):97-104 [PMID: 11773596]
  27. Cell. 2018 Feb 22;172(5):979-992.e6 [PMID: 29456084]
  28. Am J Hum Genet. 2021 May 6;108(5):764-785 [PMID: 33811808]

MeSH Term

Mutation
Microsatellite Repeats
Codon
Polymorphism, Single Nucleotide

Chemicals

Codon

Word Cloud

Created with Highcharts 10.0.0diseaseenrichmentpatternsgeneticfeaturesneurologicalgenesfeaturepotentialproposedmetricstatisticallysignificantSTR-basedexpressionshortsequencesknownSTRsprogressionThusSTRnewextractionselectionsettrinucleotidesfavoredmayGeneoftendeterminedsinglenucleotidepolymorphismsrepeatedtandemrepeatsstructuralvariantsenvironmentalfactorsprovidesmeansorganismproducegeneproductsnecessaryliveVariationlevelssometimesassociatedrecentlygainedinterestmarkersHoweverbestknowledgeunawarestudyevaluatesexploresparticularlytrinucleotidemachinelearningclassifyingpurposediscoveringpaper wenovelalgorithmbasedrespectivecreateshownfamilynon-randomAAATTATGTTpatternimportantresultsupportspriorresearchestablishedcertainAATATAATTTATTTAproteinmisfoldingcontrastTAATAGTGAprematureterminationcodonmutationsstopcodonssuggestsidentifysampleMoreoverpracticalperformancehighpredictionresultsindicatevariationscandistinguishconclusion theapproachdiscoverdifferentialdiseasesdiscovery

Similar Articles

Cited By (1)