Triku: a feature selection method based on nearest neighbors for single-cell data.

Alex M Ascensión, Olga Ibáñez-Solé, Iñaki Inza, Ander Izeta, Marcos J Araúzo-Bravo
Author Information
  1. Alex M Ascensión: Biodonostia Health Research Institute, Computational Biology and Systems Biomedicine Group, Paseo Dr. Begiristain, s/n, Donostia-San Sebastian, 20014, Spain. ORCID
  2. Olga Ibáñez-Solé: Biodonostia Health Research Institute, Computational Biology and Systems Biomedicine Group, Paseo Dr. Begiristain, s/n, Donostia-San Sebastian, 20014, Spain. ORCID
  3. Iñaki Inza: Intelligent Systems Group, Computer Science Faculty, University of the Basque Country, Donostia-San Sebastian, 20018, Spain. ORCID
  4. Ander Izeta: Biodonostia Health Research Institute, Tissue Engineering Group, Paseo Dr. Begiristain, s/n, Donostia-San Sebastian, 20014, Spain. ORCID
  5. Marcos J Araúzo-Bravo: Biodonostia Health Research Institute, Computational Biology and Systems Biomedicine Group, Paseo Dr. Begiristain, s/n, Donostia-San Sebastian, 20014, Spain. ORCID

Abstract

BACKGROUND: Feature selection is a relevant step in the analysis of single-cell RNA sequencing datasets. Most of the current feature selection methods are based on general univariate descriptors of the data such as the dispersion or the percentage of zeros. Despite the use of correction methods, the generality of these feature selection methods biases the genes selected towards highly expressed genes, instead of the genes defining the cell populations of the dataset.
RESULTS: Triku is a feature selection method that favors genes defining the main cell populations. It does so by selecting genes expressed by groups of cells that are close in the k-nearest neighbor graph. The expression of these genes is higher than the expected expression if the k-cells were chosen at random. Triku efficiently recovers cell populations present in artificial and biological benchmarking datasets, based on adjusted Rand index, normalized mutual information, supervised classification, and silhouette coefficient measurements. Additionally, gene sets selected by triku are more likely to be related to relevant Gene Ontology terms and contain fewer ribosomal and mitochondrial genes.
CONCLUSION: Triku is developed in Python 3 and is available at https://github.com/alexmascension/triku.

Keywords

References

  1. Nat Methods. 2018 Apr;15(4):233-234 [PMID: 30100822]
  2. Immunity. 2020 Sep 15;53(3):685-696.e3 [PMID: 32783921]
  3. Cell. 2021 Jun 24;184(13):3573-3587.e29 [PMID: 34062119]
  4. Bioinformatics. 2019 Aug 15;35(16):2865-2867 [PMID: 30590489]
  5. Nat Biotechnol. 2020 Jun;38(6):747-755 [PMID: 32518403]
  6. Bioinformatics. 2017 Nov 01;33(21):3486-3488 [PMID: 29036287]
  7. Genome Biol. 2020 Jan 16;21(1):12 [PMID: 31948481]
  8. Brief Bioinform. 2022 Mar 10;23(2): [PMID: 35037023]
  9. Front Immunol. 2019 Oct 18;10:2458 [PMID: 31681331]
  10. J Transl Med. 2018 Jul 17;16(1):198 [PMID: 30016977]
  11. Nat Commun. 2017 Jan 16;8:14049 [PMID: 28091601]
  12. Bioinformatics. 2007 Oct 1;23(19):2507-17 [PMID: 17720704]
  13. Genome Biol. 2017 Sep 12;18(1):174 [PMID: 28899397]
  14. Brief Bioinform. 2021 Sep 2;22(5): [PMID: 33611426]
  15. Genome Biol. 2019 Dec 23;20(1):295 [PMID: 31870412]
  16. Front Immunol. 2021 Jul 30;12:700152 [PMID: 34394094]
  17. Genome Biol. 2018 May 31;19(1):70 [PMID: 29855333]
  18. Gigascience. 2019 Aug 1;8(8): [PMID: 31505654]
  19. BMC Bioinformatics. 2014;15 Suppl 13:S4 [PMID: 25434802]
  20. Curr Opin Syst Biol. 2018 Jun;9:32-41 [PMID: 30450444]
  21. Science. 2017 Apr 21;356(6335): [PMID: 28428369]
  22. Genome Biol. 2018 Feb 6;19(1):15 [PMID: 29409532]
  23. Genome Res. 2015 Oct;25(10):1491-8 [PMID: 26430159]
  24. Genome Biol. 2019 Dec 23;20(1):296 [PMID: 31870423]
  25. Nat Methods. 2013 Nov;10(11):1093-5 [PMID: 24056876]
  26. Cell Syst. 2016 Apr 27;2(4):239-250 [PMID: 27135536]
  27. Cells. 2019 Dec 19;9(1): [PMID: 31861624]
  28. F1000Res. 2016 Aug 31;5:2122 [PMID: 27909575]
  29. Nat Med. 2020 Jul;26(7):1070-1076 [PMID: 32514174]
  30. Brief Bioinform. 2019 Jul 19;20(4):1583-1589 [PMID: 29481632]
  31. Nat Biotechnol. 2020 Jun;38(6):737-746 [PMID: 32341560]
  32. Nat Commun. 2020 Mar 3;11(1):1169 [PMID: 32127540]
  33. F1000Res. 2018 Aug 15;7:1297 [PMID: 30228881]
  34. Genome Biol. 2020 May 11;21(1):112 [PMID: 32393363]
  35. Cell. 2019 Jun 13;177(7):1888-1902.e21 [PMID: 31178118]
  36. Nat Biotechnol. 2020 Feb;38(2):147-150 [PMID: 31937974]
  37. Front Genet. 2020 Feb 07;11:41 [PMID: 32117453]
  38. Mol Syst Biol. 2019 Jun 19;15(6):e8746 [PMID: 31217225]
  39. Gigascience. 2022 Mar 12;11: [PMID: 35277963]
  40. Front Immunol. 2021 Mar 18;12:602539 [PMID: 33815362]
  41. Blood. 2016 Aug 25;128(8):e20-31 [PMID: 27365425]

MeSH Term

Algorithms
Cluster Analysis

Word Cloud

Created with Highcharts 10.0.0genesselectionfeaturemethodsbasedcellpopulationsTrikurelevantsingle-celldatasetsdataselectedexpresseddefiningmethodexpressionPythonBACKGROUND:FeaturestepanalysisRNAsequencingcurrentgeneralunivariatedescriptorsdispersionpercentagezerosDespiteusecorrectiongeneralitybiasestowardshighlyinsteaddatasetRESULTS:favorsmainselectinggroupscellsclosek-nearestneighborgraphhigherexpectedk-cellschosenrandomefficientlyrecoverspresentartificialbiologicalbenchmarkingadjustedRandindexnormalizedmutualinformationsupervisedclassificationsilhouettecoefficientmeasurementsAdditionallygenesetstrikulikelyrelatedGeneOntologytermscontainfewerribosomalmitochondrialCONCLUSION:developed3availablehttps://githubcom/alexmascension/trikuTriku:nearestneighborsbioinformaticsmachinelearningsc-RNAseq

Similar Articles

Cited By