A probabilistic gene expression barcode for annotation of cell types from single-cell RNA-seq data.

Isabella N Grabski, Rafael A Irizarry
Author Information
  1. Isabella N Grabski: Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA. ORCID
  2. Rafael A Irizarry: Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA and Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA. ORCID

Abstract

Single-cell RNA sequencing (scRNA-seq) quantifies gene expression for individual cells in a sample, which allows distinct cell-type populations to be identified and characterized. An important step in many scRNA-seq analysis pipelines is the annotation of cells into known cell types. While this can be achieved using experimental techniques, such as fluorescence-activated cell sorting, these approaches are impractical for large numbers of cells. This motivates the development of data-driven cell-type annotation methods. We find limitations with current approaches due to the reliance on known marker genes or from overfitting because of systematic differences, or batch effects, between studies. Here, we present a statistical approach that leverages public data sets to combine information across thousands of genes, uses a latent variable model to define cell-type-specific barcodes and account for batch effect variation, and probabilistically annotates cell-type identity from a reference of known cell types. The barcoding approach also provides a new way to discover marker genes. Using a range of data sets, including those generated to represent imperfect real-world reference data, we demonstrate that our approach substantially outperforms current reference-based methods, particularly when predicting across studies.

Keywords

References

  1. PLoS One. 2018 Oct 10;13(10):e0205499 [PMID: 30304022]
  2. Nat Immunol. 2019 Feb;20(2):163-172 [PMID: 30643263]
  3. Database (Oxford). 2019 Jan 1;2019: [PMID: 30951143]
  4. Genome Biol. 2016 May 26;17(1):112 [PMID: 27230763]
  5. Nat Methods. 2018 Dec;15(12):1053-1058 [PMID: 30504886]
  6. Genome Biol. 2019 Sep 9;20(1):194 [PMID: 31500660]
  7. F1000Res. 2016 Aug 31;5:2122 [PMID: 27909575]
  8. Neuron. 2019 Sep 25;103(6):1016-1033.e10 [PMID: 31375314]
  9. Nucleic Acids Res. 2019 Sep 19;47(16):e95 [PMID: 31226206]
  10. Cell. 2019 Jun 13;177(7):1888-1902.e21 [PMID: 31178118]
  11. Nat Methods. 2019 Oct;16(10):983-986 [PMID: 31501545]
  12. Genome Biol. 2017 Mar 28;18(1):59 [PMID: 28351406]
  13. Nucleic Acids Res. 2011 Jan;39(Database issue):D1011-5 [PMID: 21177656]
  14. Nat Biotechnol. 2022 Apr;40(4):517-526 [PMID: 33603203]
  15. Nat Methods. 2017 May;14(5):483-486 [PMID: 28346451]
  16. Cell Syst. 2019 Aug 28;9(2):207-213.e2 [PMID: 31377170]
  17. Nat Methods. 2018 May;15(5):359-362 [PMID: 29608555]
  18. Trends Genet. 2013 Oct;29(10):569-74 [PMID: 23810203]
  19. JCI Insight. 2016 Dec 8;1(20):e90558 [PMID: 27942595]
  20. Nat Methods. 2019 Oct;16(10):1007-1015 [PMID: 31501550]

Grants

  1. R01 HG005220/NHGRI NIH HHS
  2. R35 GM131802/NIGMS NIH HHS

MeSH Term

Gene Expression
Gene Expression Profiling
Humans
RNA-Seq
Sequence Analysis, RNA
Single-Cell Analysis
Software

Word Cloud

Created with Highcharts 10.0.0celldatacellscell-typeannotationknowntypesgenesapproachSingle-cellscRNA-seqgeneexpressionapproachesmethodscurrentmarkerbatchstudiessetsacrossreferenceRNA-seqRNAsequencingquantifiesindividualsampleallowsdistinctpopulationsidentifiedcharacterizedimportantstepmanyanalysispipelinescanachievedusingexperimentaltechniquesfluorescence-activatedsortingimpracticallargenumbersmotivatesdevelopmentdata-drivenfindlimitationsduerelianceoverfittingsystematicdifferenceseffectspresentstatisticalleveragespubliccombineinformationthousandsuseslatentvariablemodeldefinecell-type-specificbarcodesaccounteffectvariationprobabilisticallyannotatesidentitybarcodingalsoprovidesnewwaydiscoverUsingrangeincludinggeneratedrepresentimperfectreal-worlddemonstratesubstantiallyoutperformsreference-basedparticularlypredictingprobabilisticbarcodesingle-cell

Similar Articles

Cited By