Cracking the genetic code with neural networks.

Marc Joiret, Marine Leclercq, Gaspard Lambrechts, Francesca Rapino, Pierre Close, Gilles Louppe, Liesbet Geris
Author Information
  1. Marc Joiret: Biomechanics Research Unit, GIGA in Silico Medicine, Liège University, Liège, Belgium.
  2. Marine Leclercq: Cancer Signaling, GIGA Stem Cells, Liège University, Liège, Belgium.
  3. Gaspard Lambrechts: Department of Electrical Engineering and Computer Science, Artificial Intelligence and Deep Learning, Montefiore Institute, Liège University, Liège, Belgium.
  4. Francesca Rapino: Cancer Signaling, GIGA Stem Cells, Liège University, Liège, Belgium.
  5. Pierre Close: Cancer Signaling, GIGA Stem Cells, Liège University, Liège, Belgium.
  6. Gilles Louppe: Department of Electrical Engineering and Computer Science, Artificial Intelligence and Deep Learning, Montefiore Institute, Liège University, Liège, Belgium.
  7. Liesbet Geris: Biomechanics Research Unit, GIGA in Silico Medicine, Liège University, Liège, Belgium.

Abstract

The genetic code is textbook scientific knowledge that was soundly established without resorting to Artificial Intelligence (AI). The goal of our study was to check whether a neural network could re-discover, on its own, the mapping links between codons and amino acids and build the complete deciphering dictionary upon presentation of transcripts proteins data training pairs. We compared different Deep Learning neural network architectures and estimated quantitatively the size of the required human transcriptomic training set to achieve the best possible accuracy in the codon-to-amino-acid mapping. We also investigated the effect of a codon embedding layer assessing the semantic similarity between codons on the rate of increase of the training accuracy. We further investigated the benefit of quantifying and using the unbalanced representations of amino acids within real human proteins for a faster deciphering of rare amino acids codons. Deep neural networks require huge amount of data to train them. Deciphering the genetic code by a neural network is no exception. A test accuracy of 100% and the unequivocal deciphering of rare codons such as the tryptophan codon or the stop codons require a training dataset of the order of 4-22 millions cumulated pairs of codons with their associated amino acids presented to the neural network over around 7-40 training epochs, depending on the architecture and settings. We confirm that the wide generic capacities and modularity of deep neural networks allow them to be customized easily to learn the deciphering task of the genetic code efficiently.

Keywords

References

  1. N Engl J Med. 2019 Apr 4;380(14):1347-1358 [PMID: 30943338]
  2. BMC Bioinformatics. 2018 May 31;19(1):202 [PMID: 29855387]
  3. J Comput Biol. 2019 Jun;26(6):509-518 [PMID: 30785347]
  4. Neural Comput. 1997 Nov 15;9(8):1735-80 [PMID: 9377276]
  5. Cancers (Basel). 2019 Aug 23;11(9): [PMID: 31450799]
  6. Bioinformatics. 2021 Aug 9;37(15):2112-2120 [PMID: 33538820]
  7. BMC Genomics. 2019 Dec 24;20(Suppl 9):906 [PMID: 31874640]
  8. Nat Rev Genet. 2019 Jul;20(7):389-403 [PMID: 30971806]
  9. Nucleic Acids Res. 2021 Jan 8;49(D1):D10-D17 [PMID: 33095870]
  10. IEEE J Biomed Health Inform. 2015 Jul;19(4):1209-15 [PMID: 26218867]
  11. Science. 1965 Mar 19;147(3664):1462-5 [PMID: 14263761]
  12. Brief Funct Genomics. 2019 Feb 14;18(1):41-57 [PMID: 30265280]
  13. Neural Netw. 2017 Jan;85:85-105 [PMID: 27814468]
  14. Elife. 2022 Jul 06;11: [PMID: 35792600]
  15. Nature. 1953 Apr 25;171(4356):737-8 [PMID: 13054692]

Word Cloud

Created with Highcharts 10.0.0neuralcodonsgeneticcodenetworkdecipheringtrainingaminoacidscodondataaccuracynetworksArtificialIntelligencemappingproteinspairsDeephumaninvestigatedembeddingrarerequiredeeptextbookscientificknowledgesoundlyestablishedwithoutresortingAIgoalstudycheckwhetherre-discoverlinksbuildcompletedictionaryuponpresentationtranscriptscompareddifferentLearningarchitecturesestimatedquantitativelysizerequiredtranscriptomicsetachievebestpossiblecodon-to-amino-acidalsoeffectlayerassessingsemanticsimilarityrateincreasebenefitquantifyingusingunbalancedrepresentationswithinrealfasterhugeamounttrainDecipheringexceptiontest100%unequivocaltryptophanstopdatasetorder4-22millionscumulatedassociatedpresentedaround7-40epochsdependingarchitecturesettingsconfirmwidegenericcapacitiesmodularityallowcustomizedeasilylearntaskefficientlyCrackingusageefficiencynaturallanguageprocessing

Similar Articles

Cited By