scEMAIL: Universal and Source-free Annotation Method for scRNA-seq Data with Novel Cell-type Perception.

Hui Wan, Liang Chen, Minghua Deng
Author Information
  1. Hui Wan: School of Mathematical Sciences, Peking University, Beijing 100871, China.
  2. Liang Chen: Huawei Technologies Co., Ltd., Beijing 100080, China. Electronic address: chenliang260@huawei.com.
  3. Minghua Deng: School of Mathematical Sciences, Peking University, Beijing 100871, China; Center for Statistical Science, Peking University, Beijing 100871, China; Center for Quantitative Biology, Peking University, Beijing 100871, China. Electronic address: dengmh@pku.edu.cn.

Abstract

Current cell-type annotation tools for single-cell RNA sequencing (scRNA-seq) data mainly utilize well-annotated source data to help identify cell types in target data. However, on account of privacy preservation, their requirements for raw source data may not always be satisfied. In this case, achieving feature alignment between source and target data explicitly is impossible. Additionally, these methods are barely able to discover the presence of novel cell types. A subjective threshold is often selected by users to detect novel cells. We propose a universal annotation framework for scRNA-seq data called scEMAIL, which automatically detects novel cell types without accessing source data during adaptation. For new cell-type identification, a novel cell-type perception module is designed with three steps. First, an expert ensemble system measures uncertainty of each cell from three complementary aspects. Second, based on this measurement, bimodality tests are applied to detect the presence of new cell types. Third, once assured of their presence, an adaptive threshold via manifold mixup partitions target cells into "known" and "unknown" groups. Model adaptation is then conducted to alleviate the batch effect. We gather multi-order neighborhood messages globally and impose local affinity regularizations on "known" cells. These constraints mitigate wrong classifications of the source model via reliable self-supervised information of neighbors. scEMAIL is accurate and robust under various scenarios in both simulation and real data. It is also flexible to be applied to challenging single-cell ATAC-seq data without loss of superiority. The source code of scEMAIL can be accessed at https://github.com/aster-ww/scEMAIL and https://ngdc.cncb.ac.cn/biocode/tools/BT007335/releases/v1.0.

Keywords

References

  1. Nat Methods. 2022 Jan;19(1):41-50 [PMID: 34949812]
  2. Nat Mach Intell. 2022 Nov;4(11):940-952 [PMID: 36873621]
  3. Nat Commun. 2020 Jul 10;11(1):3458 [PMID: 32651388]
  4. Cell Syst. 2019 Aug 28;9(2):207-213.e2 [PMID: 31377170]
  5. Cell. 2018 Aug 23;174(5):1309-1324.e18 [PMID: 30078704]
  6. Bioinformatics. 2021 May 5;37(6):775-784 [PMID: 33098418]
  7. Nat Methods. 2018 May;15(5):359-362 [PMID: 29608555]
  8. Cell Res. 2018 Dec;28(12):1141-1157 [PMID: 30315278]
  9. Nat Rev Nephrol. 2020 Jul;16(7):408-421 [PMID: 32221477]
  10. Bioinformatics. 2022 Mar 4;38(6):1575-1583 [PMID: 34999761]
  11. Nucleic Acids Res. 2019 Sep 19;47(16):e95 [PMID: 31226206]
  12. Nature. 2018 Oct;562(7727):367-372 [PMID: 30283141]
  13. Mol Syst Biol. 2021 Jan;17(1):e9620 [PMID: 33491336]
  14. Nat Methods. 2009 May;6(5):377-82 [PMID: 19349980]
  15. Bioinformatics. 2020 Dec 30;36(Suppl_2):i610-i617 [PMID: 33381839]
  16. Nat Immunol. 2019 Feb;20(2):163-172 [PMID: 30643263]
  17. Nat Biotechnol. 2022 Jan;40(1):121-130 [PMID: 34462589]
  18. Nat Commun. 2019 Jan 23;10(1):390 [PMID: 30674886]
  19. Nat Commun. 2017 Dec 11;8(1):2128 [PMID: 29225342]
  20. Nat Mach Intell. 2020 Oct;2(10):607-618 [PMID: 33817554]
  21. Genome Biol. 2019 Sep 9;20(1):194 [PMID: 31500660]
  22. Front Psychol. 2013 Oct 02;4:700 [PMID: 24109465]
  23. Cell. 2018 Feb 22;172(5):1091-1107.e17 [PMID: 29474909]
  24. Cell. 2019 Jun 13;177(7):1888-1902.e21 [PMID: 31178118]
  25. Nat Methods. 2019 Oct;16(10):983-986 [PMID: 31501545]
  26. Cell. 2017 Oct 5;171(2):321-330.e14 [PMID: 28965763]

MeSH Term

Single-Cell Gene Expression Analysis
Single-Cell Analysis
Software
Computer Simulation
Perception
Sequence Analysis, RNA
Gene Expression Profiling
Cluster Analysis

Word Cloud

Created with Highcharts 10.0.0datasourcecelltypesnovelcell-typeannotationscRNA-seqtargetpresencecellsscEMAILsingle-cellRNAsequencingpreservationthresholddetectwithoutadaptationnewthreeappliedvia"known"Cell-typeCurrenttoolsmainlyutilizewell-annotatedhelpidentifyHoweveraccountprivacyrequirementsrawmayalwayssatisfiedcaseachievingfeaturealignmentexplicitlyimpossibleAdditionallymethodsbarelyablediscoversubjectiveoftenselectedusersproposeuniversalframeworkcalledautomaticallydetectsaccessingidentificationperceptionmoduledesignedstepsFirstexpertensemblesystemmeasuresuncertaintycomplementaryaspectsSecondbasedmeasurementbimodalitytestsThirdassuredadaptivemanifoldmixuppartitions"unknown"groupsModelconductedalleviatebatcheffectgathermulti-orderneighborhoodmessagesgloballyimposelocalaffinityregularizationsconstraintsmitigatewrongclassificationsmodelreliableself-supervisedinformationneighborsaccuraterobustvariousscenariossimulationrealalsoflexiblechallengingATAC-seqlosssuperioritycodecanaccessedhttps://githubcom/aster-ww/scEMAILhttps://ngdccncbaccn/biocode/tools/BT007335/releases/v10scEMAIL:UniversalSource-freeAnnotationMethodDataNovelPerceptionGeneexpressionPrivacySingle-cellTransferlearning

Similar Articles

Cited By