Integration of single cell data by disentangled representation learning.
Tiantian Guo, Yang Chen, Minglei Shi, Xiangyu Li, Michael Q Zhang
Author Information
Tiantian Guo: MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist; Department of Automation, Tsinghua University, Beijing 100084, China. ORCID
Yang Chen: MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist; Department of Automation, Tsinghua University, Beijing 100084, China.
Minglei Shi: MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist; School of Medicine, Tsinghua University, Beijing 100084, China.
Xiangyu Li: School of Software Engineering, Beijing Jiaotong University, Beijing 100044, China. ORCID
Michael Q Zhang: MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist; Department of Automation, Tsinghua University, Beijing 100084, China.
Recent developments of single cell RNA-sequencing technologies lead to the exponential growth of single cell sequencing datasets across different conditions. Combining these datasets helps to better understand cellular identity and function. However, it is challenging to integrate different datasets from different laboratories or technologies due to batch effect, which are interspersed with biological variances. To overcome this problem, we have proposed Single Cell Integration by Disentangled Representation Learning (SCIDRL), a domain adaption-based method, to learn low-dimensional representations invariant to batch effect. This method can efficiently remove batch effect while retaining cell type purity. We applied it to thirteen diverse simulated and real datasets. Benchmark results show that SCIDRL outperforms other methods in most cases and exhibits excellent performances in two common situations: (i) effective integration of batch-shared rare cell types and preservation of batch-specific rare cell types; (ii) reliable integration of datasets with different cell compositions. This demonstrates SCIDRL will offer a valuable tool for researchers to decode the enigma of cell heterogeneity.