Integration of single cell data by disentangled representation learning.

Tiantian Guo, Yang Chen, Minglei Shi, Xiangyu Li, Michael Q Zhang
Author Information
  1. Tiantian Guo: MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist; Department of Automation, Tsinghua University, Beijing 100084, China. ORCID
  2. Yang Chen: MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist; Department of Automation, Tsinghua University, Beijing 100084, China.
  3. Minglei Shi: MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist; School of Medicine, Tsinghua University, Beijing 100084, China.
  4. Xiangyu Li: School of Software Engineering, Beijing Jiaotong University, Beijing 100044, China. ORCID
  5. Michael Q Zhang: MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist; Department of Automation, Tsinghua University, Beijing 100084, China.

Abstract

Recent developments of single cell RNA-sequencing technologies lead to the exponential growth of single cell sequencing datasets across different conditions. Combining these datasets helps to better understand cellular identity and function. However, it is challenging to integrate different datasets from different laboratories or technologies due to batch effect, which are interspersed with biological variances. To overcome this problem, we have proposed Single Cell Integration by Disentangled Representation Learning (SCIDRL), a domain adaption-based method, to learn low-dimensional representations invariant to batch effect. This method can efficiently remove batch effect while retaining cell type purity. We applied it to thirteen diverse simulated and real datasets. Benchmark results show that SCIDRL outperforms other methods in most cases and exhibits excellent performances in two common situations: (i) effective integration of batch-shared rare cell types and preservation of batch-specific rare cell types; (ii) reliable integration of datasets with different cell compositions. This demonstrates SCIDRL will offer a valuable tool for researchers to decode the enigma of cell heterogeneity.

References

  1. Nature. 2019 Oct;574(7778):418-422 [PMID: 31619793]
  2. Genome Biol. 2017 Sep 12;18(1):174 [PMID: 28899397]
  3. Cell. 2015 May 21;161(5):1202-1214 [PMID: 26000488]
  4. Cell. 2018 Aug 9;174(4):1015-1030.e16 [PMID: 30096299]
  5. Cell Syst. 2016 Oct 26;3(4):346-360.e4 [PMID: 27667365]
  6. Science. 2018 Apr 13;360(6385):176-182 [PMID: 29545511]
  7. Science. 2017 Apr 21;356(6335): [PMID: 28428369]
  8. Nat Methods. 2019 Dec;16(12):1289-1296 [PMID: 31740819]
  9. Genome Biol. 2019 Dec 31;21(1):1 [PMID: 31892341]
  10. Nat Biotechnol. 2018 Jun;36(5):421-427 [PMID: 29608177]
  11. Cell. 2020 Apr 16;181(2):236-249 [PMID: 32302568]
  12. Cell Syst. 2016 Oct 26;3(4):385-394.e3 [PMID: 27693023]
  13. J Am Soc Nephrol. 2018 Aug;29(8):2069-2080 [PMID: 29980650]
  14. Nat Commun. 2017 Jan 16;8:14049 [PMID: 28091601]
  15. Blood. 2016 Aug 25;128(8):e20-31 [PMID: 27365425]
  16. Nat Biotechnol. 2019 Jun;37(6):685-691 [PMID: 31061482]
  17. Cell. 2019 Jun 13;177(7):1873-1887.e17 [PMID: 31178122]
  18. Nat Methods. 2018 Dec;15(12):1053-1058 [PMID: 30504886]
  19. Genome Biol. 2021 Feb 18;22(1):63 [PMID: 33602306]
  20. Cell. 2016 Aug 25;166(5):1308-1323.e30 [PMID: 27565351]
  21. Cell. 2015 Dec 17;163(7):1663-77 [PMID: 26627738]
  22. Nat Biotechnol. 2020 Jun;38(6):737-746 [PMID: 32341560]
  23. Genome Biol. 2019 Aug 12;20(1):165 [PMID: 31405383]
  24. Nat Commun. 2018 Oct 22;9(1):4383 [PMID: 30348985]
  25. Cell. 2018 Feb 22;172(5):1091-1107.e17 [PMID: 29474909]
  26. Cell. 2019 Jun 13;177(7):1888-1902.e21 [PMID: 31178118]
  27. Nature. 2018 Oct;562(7727):367-372 [PMID: 30283141]
  28. Nat Commun. 2020 May 11;11(1):2338 [PMID: 32393754]
  29. Brief Bioinform. 2021 May 20;22(3): [PMID: 32591778]
  30. Nature. 2017 Oct 18;550(7677):451-453 [PMID: 29072289]

MeSH Term

Algorithms
Computational Biology
High-Throughput Nucleotide Sequencing
Humans
Sequence Analysis, RNA
Single-Cell Analysis
Software

Word Cloud

Created with Highcharts 10.0.0celldatasetsdifferentsinglebatcheffectSCIDRLtechnologiesIntegrationmethodintegrationraretypesRecentdevelopmentsRNA-sequencingleadexponentialgrowthsequencingacrossconditionsCombininghelpsbetterunderstandcellularidentityfunctionHoweverchallengingintegratelaboratoriesdueinterspersedbiologicalvariancesovercomeproblemproposedSingleCellDisentangledRepresentationLearningdomainadaption-basedlearnlow-dimensionalrepresentationsinvariantcanefficientlyremoveretainingtypepurityappliedthirteendiversesimulatedrealBenchmarkresultsshowoutperformsmethodscasesexhibitsexcellentperformancestwocommonsituations:effectivebatch-sharedpreservationbatch-specificiireliablecompositionsdemonstrateswilloffervaluabletoolresearchersdecodeenigmaheterogeneitydatadisentangledrepresentationlearning

Similar Articles

Cited By