Accurate feature selection improves single-cell RNA-seq cell clustering.

Kenong Su, Tianwei Yu, Hao Wu
Author Information
  1. Kenong Su: Department of Computer Science, Emory University, Atlanta, GA 30322, USA.
  2. Tianwei Yu: School of Data Science, The Chinese University of Hong Kong, Shenzhen, China.
  3. Hao Wu: Department of Biostatistics and Bioinformatics, Emory University, 201 Dowman Dr, Atlanta, GA 30322, USA. ORCID

Abstract

Cell clustering is one of the most important and commonly performed tasks in single-cell RNA sequencing (scRNA-seq) data analysis. An important step in cell clustering is to select a subset of genes (referred to as 'features'), whose expression patterns will then be used for downstream clustering. A good set of features should include the ones that distinguish different cell types, and the quality of such set could have a significant impact on the clustering accuracy. All existing scRNA-seq clustering tools include a feature selection step relying on some simple unsupervised feature selection methods, mostly based on the statistical moments of gene-wise expression distributions. In this work, we carefully evaluate the impact of feature selection on cell clustering accuracy. In addition, we develop a feature selection algorithm named FEAture SelecTion (FEAST), which provides more representative features. We apply the method on 12 public scRNA-seq datasets and demonstrate that using features selected by FEAST with existing clustering tools significantly improve the clustering accuracy.

Keywords

References

  1. Nat Commun. 2018 Nov 9;9(1):4719 [PMID: 30413715]
  2. Cell. 2015 May 21;161(5):1202-1214 [PMID: 26000488]
  3. Nat Commun. 2020 Apr 14;11(1):1818 [PMID: 32286268]
  4. Genome Biol. 2019 Dec 23;20(1):295 [PMID: 31870412]
  5. Genome Biol. 2019 Dec 23;20(1):296 [PMID: 31870423]
  6. Nat Methods. 2018 May;15(5):359-362 [PMID: 29608555]
  7. Nat Commun. 2017 Jun 01;8:15599 [PMID: 28569836]
  8. Nucleic Acids Res. 2016 Jul 27;44(13):e117 [PMID: 27179027]
  9. Cell Syst. 2019 Apr 24;8(4):315-328.e8 [PMID: 31022373]
  10. BMC Bioinformatics. 2019 May 2;20(1):222 [PMID: 31046658]
  11. Nat Methods. 2017 May;14(5):483-486 [PMID: 28346451]
  12. Cell. 2016 Dec 15;167(7):1883-1896.e15 [PMID: 27984734]
  13. Mol Cell. 2015 May 21;58(4):610-20 [PMID: 26000846]
  14. Nat Methods. 2013 Nov;10(11):1096-8 [PMID: 24056875]
  15. Genome Biol. 2018 Feb 6;19(1):15 [PMID: 29409532]
  16. PLoS Comput Biol. 2019 Aug 30;15(8):e1007040 [PMID: 31469823]
  17. Nat Methods. 2017 Jun;14(6):584-586 [PMID: 28418000]
  18. Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5 [PMID: 23193258]
  19. Nat Methods. 2018 Dec;15(12):1053-1058 [PMID: 30504886]
  20. Brief Bioinform. 2020 Jul 15;21(4):1196-1208 [PMID: 31271412]
  21. Nat Methods. 2017 Apr;14(4):414-416 [PMID: 28263960]
  22. Brief Bioinform. 2019 Nov 27;20(6):2316-2326 [PMID: 30137247]
  23. Genome Biol. 2016 Apr 27;17:75 [PMID: 27122128]
  24. Cell Res. 2019 Sep;29(9):725-738 [PMID: 31273297]
  25. Nat Biotechnol. 2014 Apr;32(4):381-386 [PMID: 24658644]
  26. Nucleic Acids Res. 2019 Sep 19;47(16):e95 [PMID: 31226206]
  27. Nature. 2019 Feb;566(7745):496-502 [PMID: 30787437]
  28. F1000Res. 2018 Jul 26;7:1141 [PMID: 30271584]
  29. Genome Biol. 2019 Dec 10;20(1):269 [PMID: 31823809]
  30. PLoS Comput Biol. 2015 Nov 24;11(11):e1004575 [PMID: 26600239]
  31. Proc Natl Acad Sci U S A. 2019 Jan 8;116(2):466-471 [PMID: 30587579]
  32. Nat Commun. 2017 Jan 16;8:14049 [PMID: 28091601]
  33. Nat Methods. 2020 Jan;17(1):45-49 [PMID: 31740822]
  34. Brief Bioinform. 2021 Jul 20;22(4): [PMID: 33285568]
  35. BMC Genomics. 2017 Oct 3;18(Suppl 6):689 [PMID: 28984204]
  36. Proc Natl Acad Sci U S A. 2018 Jul 10;115(28):E6437-E6446 [PMID: 29946020]
  37. Front Genet. 2019 Dec 11;10:1253 [PMID: 31921297]
  38. Cell. 2015 May 21;161(5):1187-1201 [PMID: 26000487]
  39. Genome Res. 2020 Feb;30(2):205-213 [PMID: 31992615]
  40. Cell Res. 2018 Jul;28(7):730-745 [PMID: 29867213]
  41. Nat Biotechnol. 2015 May;33(5):495-502 [PMID: 25867923]

Grants

  1. P50 AG025688/NIA NIH HHS
  2. R01 GM122083/NIGMS NIH HHS
  3. R01 GM124061/NIGMS NIH HHS

MeSH Term

Algorithms
Benchmarking
Cluster Analysis
Datasets as Topic
Gene Expression Profiling
High-Throughput Nucleotide Sequencing
Humans
Sequence Analysis, RNA
Single-Cell Analysis