Gene Expression Nebulas
A data portal of transcriptomic profiles analyzed by a unified pipeline across multiple species

Gene Expression Nebulas

A data portal of transcriptome profiles across multiple species

Documentation

Introduction
Gene Expression Nebulas (GEN) is a data portal of gene expression profiles under various conditions derived entirely from RNA-Seq data analysis in multiple species. It aims to facilitate the broad research community dedicated to exploring functional genomics. As the initial step, GEN-1.0 is released to provide a comprehensive transcriptomic and post-transcriptomic landscapes across multiple species through ontology-based systematic integration of RNA sequencing data acquired from NGDC, NCBI and EBI. GEN-1.0 provides user-friendly interfaces to access, visualize or further excavate the curated gene expression data by implementing the functionalities of Browse, Search, Analysis, Visualization and Download.

Data Collection
High quality raw sequencing data of bulk and single cell RNA-seq datasets are acquired from data repositories such as GSA, SRA and ENA. The primary list of candidate datasets is filtered by specifying the following attributes:
(1) The status of data resource is open-access
(2) 'LibraryStrategy'='RNA-Seq'
(3) 'Sequencing Platform'='ILLUMINA'
(4) Median Mapping rates of bulk and single-cell RNA-seq datasets should be greater than 50% and 40%, respectively.
(5) The datasets are classified based on biological contexts involving 'Baseline', 'Genetic', 'Phenotype', 'Environment', 'Spatial', 'Temporal'.
The final list of datasets integrated in GEN-1.0 was based on further manual curation.
Data Processing

RNA-seq data processing pipelines include raw data preprocessing (quality control), read alignments, gene/transcript expression quantification (for both bulk and single cell RNA-seq data), cell clustering and cell annotation (specific for single cell RNA-seq data).

Bulk RNA-seq data analysis – preprocessing and gene/transcript expression quantification
First, low-quality RNA-seq reads are filtered by preprocessing steps using Fastp v0.20.0 (Chen et al, 2018) and the strandness of RNA-seq library is inferred by RseQC v2.6.4 (Wang et al, 2012). Then the high-quality RNA-seq reads are mapped to the reference genome Ensemble GRCh38 by STAR 2.7.1a (Dobin et al, 2013). After read alignments, gene/isoform assembly and quantification are performed using RSEM v1.3.1 (Li & Dewey, 2011) with default parameters, for basic expression profiling, RawCount, RPKM and TPM are all calculated.
Citations:
Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018,34(17):i884-i890. PMID:30423086
Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012,28(16):2184-2185. PMID:22743226
Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013,29(1):15-21. PMID:23104886
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011,12:323. PMID:21816040
Bulk RNA-seq data analysis – identification of RNA editing sites and quantification of RNA editing level
Identification of RNA editing sites and quantification of RNA editing levels are mainly executed by REDItoolDenovo.py in REDItools 2.0 (Picardi et al. 2013). Firstly, all candidate RNA editing sites are identified and quantified based on read coverage and variation frequency at each site using Parallel Strategy of REDItools 2.0. Secondly, candidate RNA editing sites located in Alu and non-Alu regions are filtered by different parameter configurations, and annotated subsequently based on a variety of annotation files such as gene annotation, RepeatMasker, SNP and known RNA editing information. Thirdly, additional filtering criteria is used to obtain more accurate novel editing sites located in non-Alu regions because non-Alu region usually has a narrow range of editing sites. To do this, Pblat (Wang and Kong 2019) is used to detect the mismatched and multi-mapping reads, while Samtools (Li et al. 2009) is used to delete duplicated reads. Finally, RNA editing sites are tagged as novel or known sites. In the current version, RNA editing types of both A-to-I and C-to-U are included.
Source of annotation files: (1) Gene annotation file: GENCODE V33, (2) RepeatMasker annotation file: UCSC, (3) SNP annotation files: UCSC, (4) Known RNA editing sites: REDIportal database. Genomic coordinates of the RepeatMasker file and the known RNA editing sites file are converted from hg19 to hg38 using UCSC liftover.
Citations:
Picardi E, Pesole G. REDItools: high-throughput RNA editing detection made easy. Bioinformatics. 2013, 29(14):1813-1814. PMID:23742983
Wang M, Kong L. pblat: a multithread blat algorithm speeding up aligning sequences to genomes. BMC Bioinformatics. 2019, 20(1):28. PMID:30646844
Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009,25(16):2078-2079. PMID:19505943
Single cell RNA-seq data analysis – read alignment and generation of count matrix
For datasets from 10X Genomics platform, 'CellRanger' is implemented for data generation. The matrix counting step is slightly more complicated for drop-based single-cell data because the tools need to keep track of where each read came from (which cell and which transcript, if UMI were used). To obtain a matrix of read counts per gene is a part of the alignment step, where rows usually correspond to genes and columns to cells.

For datasets from Drop-seq and inDrop, 'dropEst' is implemented for data generation.
(1) dropTag: extraction of cell barcodes and UMIs from the library. Result: demultiplexed .fastq.gz files, which should be aligned to the reference.
(2) Alignments of the demultiplexed files to reference genome. Result: .bam files with the alignment.
(3) dropEst: building count matrix and estimation of some statistics, necessary for quality control. Result: .rds file with the count matrix and statistics. Optionally: count matrix in MatrixMarket format.
(4) dropReport - Generating report on library quality.

For dataset from Smart-seq2 and SMARTer (Fluidigm C1), processing pipeline is the same as bulk RNA-seq analysis, including fastp, RseQC and RSEM, except a special parameter '--single-cell-prior' using Dirichlet (0.1) as the prior to calculate posterior mean estimates and credibility intervals in the RSEM step.
Metadata Curation
Manual curation of metadata of all included RNA-seq datasets are done on 4 levels ('Project', 'Dataset', 'Profile', and 'Sample') based on the structured curation model listed below.
Curation model on the 'Project' level
Items Description Value (Grey letters: prefix of accession numbers)
Data Resource Controlled vocabulary NGDC, NCBI, EBI, DDBJ
BioProject ID Accession number of each BioProject from data resource PRJCA, PRJNA, PRJEB, or PRJDA
Original Project ID Accession number of each series or raw data project from data resource CRA, GSE, ERA, or DRP
Project Name Title of BioProject from data resource Conclusion term
Species Controlled vocabulary Homo sapiens, Mus musculus, Drosophila melanogaster, etc
Strategy Controlled vocabulary Bulk RNA-seq, scRNA 10X Genomics, scRNA Smart-seq2, etc
Tissue Controlled vocabulary Brain, Liver, Skin, Kidney, Leaf, Root, Seed, etc
Cell Type Controlled vocabulary T cell, B cell, etc
Cell Line Controlled vocabulary CB660, H358, 501 mel, etc
Disease Controlled vocabulary Asthma, Chronic Lymphocytic Leukemia (CLL), Healthy Control, etc
Development The development stage of Samples in BioProject Conclusion term
Case/Control detail The detail description of Case/Control condition Conclusion term
Sample Number Statistical data Number of samples included in the project
Summary Brief description of the project scheme Conclusion term
Overall Design Experiment design, mainly including samples grouping Conclusion term
PMID Publication in which the interaction is described PubMed ID or DOI
Release Date Release date of BioProject in Data Resource Including year, month and day
Submission Date Submission date of BioProject in Data Resource Including year, month and day
Update Date Update date of BioProject in Data Resource Including year, month and day
Curation model on the 'Dataset' level
Items Description Value (Grey letters: prefix of accession numbers)
Data Resource Controlled vocabulary NGDC, NCBI, EBI, DDBJ
GEN DataSet ID Accession number of each dataset in GEN GEND
BioProject ID Accession number of each BioProject in data resource PRJCA, PRJNA, PRJEB, or PRJDA
Original Project ID Accession number of each series or raw data project in data resource CRA, GSE, ERA, or DRP
Species Controlled vocabulary Homo sapiens, Mus musculus, Drosophila melanogaster, etc
Strategy Controlled vocabulary Bulk RNA-seq, scRNA 10X Genomics, scRNA Smart-seq2, etc
Baseline Controlled vocabulary Yes or No
Genetic Controlled vocabulary Genetic characteristics of samples in Dataset (Mutation, Natural Variation, etc)
Phenotype Controlled vocabulary Phenotype characteristics of samples in Dataset (Disease, Gender, Virus Infection, etc)
Environmental Controlled vocabulary Abiotic Stress, Biotic Stress or Ecological exposure
Spatial Controlled vocabulary Cell type, Cell line, Organism, Organoid and Tissue
Temporal Controlled vocabulary Development, Circadian, Time Series
RNA type Controlled vocabulary rRNA- RNA, poly(A)+ RNA, poly(A)- RNA, etc
Median Mapping Quality Statistical data The median mapping rate of samples in BioProject
Median Coverage Statistical data The median coverage of samples in BioProject
Max Sequencing Length Statistical data The max sequencing length of samples in BioProject
Max Replicate Number Statistical data The max replicate number of samples in BioProject
Tissue Controlled vocabulary Brain, Liver, Skin, Kidney, Leaf, Root, Seed, etc
Cell Type Controlled vocabulary T cell, B cell, etc
Cell Line Controlled vocabulary CB660, H358, 501 mel, etc
Disease Controlled vocabulary Asthma, Chronic Lymphocytic Leukemia (CLL), Healthy Control, etc
Development The development stage of Sample in BioProject Conclusion term
Case/Control detail The detail description of Case/Control condition Conclusion term
Sample Number Statistical data Number of samples included in the project
Dataset Name Title of Dataset Conclusion term
Summary Brief description of the project scheme Conclusion term
Overall Design Experiment design, mainly including samples grouping Conclusion term
PMID Publication in which the interaction is described PubMed ID or DOI
Release Date Release date of BioProject in Data Resource Including year, month and day
Submission Date Submission date of BioProject in Data Resource Including year, month and day
Update Date Update date of BioProject in Data Resource Including year, month and day
Curation model on the 'Profile' level
Items Description Value (Grey letters: prefix of accession numbers)
GEN XProfile ID Accession number of gene expression profile in GEN GENDX
GEN CProfile ID Accession number of circRNA expression profile in GEN GENDC
GEN EProfile ID Accession number of gene editing profile in GEN GENDE
GEN SProfile ID Accession number of gene splicing profile in GEN GENDS
Data Resource Controlled vocabulary NGDC, NCBI, EBI, DDBJ
Original Project ID Accession number of each series or raw data project in data resource CRA, GSE, ERA, or DRP
GEN DataSet ID Accession number of each dataset in GEN GEND
BioProject ID Accession number of each BioProject in data resource PRJCA, PRJNA, PRJEB, or PRJDA
Species Controlled vocabulary Homo sapiens, Mus musculus, Drosophila melanogaster, etc
Strategy Controlled vocabulary Bulk RNA-seq, scRNA 10X Genomics, scRNA Smart-seq2, etc
Reference Genome Reference genome version, eg. GRCh38 v99 (including ERCC if needed) .fa file, or .fasta file, or .fna file
Genome annotation Genome annotation version, eg. GRCh38 v99 (including ERCC if needed) .gff file (or .gff3 file, or .gtf file) and .bed file
Data Processing For different sequencing strategy, implementing quality control, aligning, and generating expression matrix e.g. For bulk/scRNA Smart-seq2, Fastp v0.20.0 is used to implement quality control, RseQC v2.6.4 is used to infer strand, STAR v2.7 and RSEM v1.3.1 are used to align and generate expression profiles, respectively
Curation model on the 'Sample' level
Items Description Value (Grey letters: prefix of accession numbers)
Basic Information
Data Resource Controlled vocabulary NGDC, NCBI, EBI, DDBJ
Original Project ID Accession number of each series or raw data project in data resource CRA, GSE, ERA, or DRP
BioProject ID Accession number of each BioProject from data resource PRJCA, PRJNA, PRJEB, or PRJDA
BioSample ID Accession number of each Biosample in data resource SAMC, SAMN, or SAME
Sample ID Accession number of each sample from data resource GSM
Sample Name Name of each sample in data resource Conclusion term
Sample Accession Accession number of each raw data sample in data resource CRS, SRS, or ERS
Experiment Accession Accession number of each experiment sample in data resource CRX, SRX, or ERX
GEN DataSet ID Accession number of each dataset in GEN GEND
GEN Sample ID Accession number of each sample in GEN GENDS
Sample_Name_GEN Name of each sample in GEN Conclusion term
Release Date Release date of sample data in data resource Including year, month and day
Submission Date Submission date of sample data in data resource Including year, month and day
Update Date Update date of sample data in data resource Including year, month and day
Sample Characteristic
Species Controlled vocabulary Homo sapiens, Mus musculus, Drosophila melanogaster, etc
Race/Breed/Strain/Cultivar Controlled vocabulary
Race refers to a person's physical characteristics, such as bone structure and skin, hair, or eye color. For example, American Indian, Asian, Black, Hispanic, White and etc
Breed refers to a specific group of domestic animals having homogeneous appearance (phenotype), homogeneous behavior, and/or other characteristics that distinguish it from other organisms of the same species.
Strain refers to variants of plants, viruses or bacteria; or an inbred animal used for experimental purposes
Cultivar is an assemblage of plants selected for desirable characteristics that are maintained during propagation
Ethnicity/Country Controlled vocabulary Ethnicity refers to cultural factors, including nationality, regional culture, ancestry, and language. An example of ethnicity is German or Spanish ancestry or Han Chinese
Age Statistical data The age of samples (patients, healthy donors, etc)
Age unit Controlled vocabulary The age unit of samples (Year, week, day, etc)
Gender Controlled vocabulary Male, female, etc
Source Name Name of each sample group Conclusion term
Tissue Controlled vocabulary Brain, Liver, Skin, Kidney, Leaf, Root, Seed, etc
Cell type Controlled vocabulary T cell, B cell, etc
Cell Subtype Controlled vocabulary Cell subtype or cell population
Cell Line Controlled vocabulary CB660, H358, 501 mel, etc
Biological Condition
Disease Controlled vocabulary Asthma, Chronic Lymphocytic Leukemia (CLL), Healthy Control, etc
Disease State The disease stage of samples Conclusion term
Development Stage The development stage of samples Conclusion term
Mutation Related gene mutation Conclusion term
Phenotype The phenotype characteristics of samples Conclusion term
Height/Length/Weight Height, Length and Weight of plant samples Conclusion term
Isolation Source The isolation condition of plant samples Conclusion term
Experimental Variables
Case/Control Case/Control grouping Conclusion term
Case detail Case details to distinguish between case and control Conclusion term
Control detail Control details to distinguish between case and control Conclusion term
Protocol
Growth Protocol Culture protocols of cells from samples or cell lines Conclusion term
Treatment Protocol Protocols of sample treatment Conclusion term
Treatment Brief description of sample treatment Conclusion term
Extract Protocol The extract protocols of RNA Conclusion term
Library Construction Protocol The protocols of RNA sequencing library construction Conclusion term
Molecule Type Controlled vocabulary rRNA- RNA, poly(A)+ RNA, poly(A)- RNA, etc
Library Layout Controlled vocabulary PAIRED, SINGLE
Strand-Specific Controlled vocabulary Specific, Unspecific
Library Strand Controlled vocabulary Reverse means First strand, Forward means Second strand, and dash (-) means strand-unspecific
Spike-in Controlled vocabulary ERCC or -
Sequencing Technology
Strategy Controlled vocabulary Bulk RNA-seq, scRNA 10X Genomics, scRNA Smart-seq2, etc
Platform Controlled vocabulary Illumina, BGISEQ, etc
Instrument Model Controlled vocabulary Illumina HiSeq 2000, Illumina NextSeq 500, BGISEQ-500, Illumina NextSeq 500, etc
Assessing Quality
#Cell Statistical data The estimated number of cells
#Reads Statistical data The number of reads in fastq file
GBases Statistical data Total bases after filtering
AvgSpotLen1 Statistical data Average spot1 length (after filtering if filtered)
AvgSpotLen2 Statistical data Average spot2 length (after filtering if filtered)
Unique-Mapping Rate Calculated data Percent of uniquely mapped reads
Multi-Mapping Rate Calculated data Percent of multi-mapped reads
Coverage Rate Calculated data total mapped reads number*Average read length/total bases of reference genome
Feature-rich Gene Annotation
Gene functional annotation
To better understand the function of genes in multi-species, GEN provide multidimensional information of each gene for users' reference. The basic information of gene (including Entrez ID, Refseq ID, Symbol, Position, etc) are achieved from Genome Annotation Profile. Furthermore, Housekeeping or Tissue-Specific gene, Gene Ontology, Disease Ontology, and gene structure visualization on Genome Browser are presented as gene summary items. External information from GeneCard, EDK and ICG are also linked to each gene (if available).
Definition of Tissue-specific (TS) and Housekeeping (HK) gene
Housekeeping genes and tissue-specific genes are defined based on the expression profile derived from GTEx portal (the Genotype-Tissue Expression, 53 normal human tissues are covered). The highest expression value of gene which lower than 0.5 TPM/FPKM across tissues are filtered out. Then, tissue specificity index τ-value and CV (coefficient of variance) value are used to determine housekeeping genes (HK, τ-value <= 0.5 and CV <= 0.5) and tissue-specific genes (TS, τ-value >= 0.95).
The index τ value is defined as: where N is the number of tissues and is the expression profile component normalized by the maximal component value. CV is abbreviated from coefficient of variation, which stands for the fluctuation of gene expression levels across tissues. The coefficient of variation (CV) is defined as the ratio of the standard deviation to the mean of gene expression levels across tissues.
Citations:
Yanai I, Benjamin H, Shmoish M, et al. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics. 2005,21(5):650-659. PMID:15388519
Reference of gene expression and RNA editing profiles derived from GTEx and REDIportal
Reference gene expression level and RNA editing level across normal human tissues or body sites are obtained from GTEx portal and REDIportal, respectively. To provide an overview of the general expression and RNA editing pattern, the 'Average', 'Median', 'Maximum', 'Minimum' expression and RNA editing levels across 53 tissues, the 'CV' value, 'τ value' and 'Expression Breadth' of each gene are indicated in the 'Gene Summary' section and can also be used as options for filtering gene.
Gene expression profile across 53 normal human tissues is downloaded from GTEx at: https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_reads.gct.gz Main annotation table of all known and predicted editing sites is downloaded from REDIportal at: http://srv00.recas.ba.infn.it/webshare/rediportalDownload/table1_full.txt.gz RNA editing profile across normal human tissues is downloaded from REDIportal at: http://srv00.recas.ba.infn.it/webshare/rediportalDownload/table2_full.txt.gz
Tools
Differential expression analysis
For selected project with one control group and one case group, case group will be compared with the control group directly. Limma (Law et al. 2014) is an R package that was originally developed for differential expression (DE) analysis of microarray data. And voom (Ritchie et al. 2015) is a function in the limma package that modifies RNA-Seq data for use with limma. Together Limma-voom allow fast, flexible, and powerful differential expression analyses of RNA-Seq data.
limma workflow for analysing RNA-seq data that takes gene-level counts as its input, and moves through pre-processing and exploratory data analysis before obtaining lists of differentially expressed genes and gene signatures. In limma, linear modelling is carried out on the log-CPM values which are assumed to be normally distributed and the mean-variance relationship is accommodated using precision weights calculated by the voom function. Then, limma will fit a separate model to the expression values for each gene, using lmFit and contrasts.fit functionsfit. Next, empirical Bayes moderation is carried out by borrowing information across all the genes to obtain more precise estimates of gene-wise variability. Finally, the top differentially expressed genes can be listed using topTable for results using eBayes.
Citations:
Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015,43(7):e47. PMID:25605792
Law CW, Chen Y, Shi W, et al. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014,15(2):R29. PMID:24485249
Weighted Gene Co-expression Network Analysis (WGCNA)
Weighted gene co-expression network analysis (WGCNA), is a widely used data mining method, developed by Steve Horvath (Zhang and Horvath 2005; Langfelder and Horvath 2008).
WGCNA package includes functions for network construction, module detection, gene selection, calculations of topological properties, visualization, and interfacing with external software. WGCNA can be used for finding clusters (modules) of highly correlated genes, for summarizing such clusters using the module eigengene or an intramodular hub gene, for relating modules to one another and to external sample traits (using eigengene network methodology), and for calculating module membership measures. Correlation networks facilitate network-based gene screening methods that can be used to identify candidate biomarkers or therapeutic targets.
Citations:
Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005,4:Article17. PMID:16646834
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008,9:559. PMID:19114008
Functional enrichment analysis
Here, we implement clusterProfiler package, an universal enrichment tool for functional and comparative study, developed by Guangchuang Yu (Yu et al. 2012).
The clusterProfiler package offers a gene classification method to classify genes based on their projection at a specific level of the GO corpus, and provides functions, enrichGO, enrichKEGG and enrichDO, to calculate enrichment test for GO terms, KEGG pathways and DO terms based on hypergeometric distribution. To prevent high false discovery rate (FDR) in multiple testing, q-values are also estimated for FDR control. Furthermore, clusterProfiler supplies a visualization module for displaying analysis results.

1. Gene Ontology (GO)
Gene Ontology defines concepts/classes used to describe gene function, and relationships between these concepts. It classifies functions along three aspects:
MF: Molecular Function (molecular activities of gene products)
CC: Cellular Component (where gene products are active)
BP: Biological Process (pathways and larger processes made up of the activities of multiple gene products)
GO terms are organized in a directed acyclic graph, where edge between the terms represent parent-child relationship.

2. Kyoto Encyclopedia of Genes and Genomes (KEGG)
KEGG is a collection of manually drawn pathway maps representing molecular interaction and reaction networks. These pathways cover a wide range of biochemical processes that can be divided in 7 broad categories: metabolism, genetic and environmental information processing, cellular processes, organismal systems, human diseases, and drug development1.

3. Disease Ontology (DO)
The Disease Ontology has been developed as a standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts.
Citations:
Yu G, Wang LG, Han Y, et al. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012,16(5):284-287. PMID:22455463
Gene regulatory networks inference
Here, we implement GENIE3 package (Huynh-Thu et al. 2010) to infer gene regulatory networks (in the form of weighted adjacency matrixs) from expression data, using ensembles of regression trees. Known regulators from TRRUST package (Han et al. 2018) will be filtered to predict their target genes. After prediction of the regulatory networks, we further annotate known regulator-target interactions based on manually curated result from TRRUST.
Citations:
Huynh-Thu VA, Irrthum A, Wehenkel L, et al. Inferring regulatory networks from expression data using tree-based methods. PLoS One. 2010,5(9):e12776. PMID:20927193
Han H, Cho JW, Lee S, et al. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res. 2018,46(D1):D380-D386. PMID:29087512
scRNA-seq data analysis
This tool aims to analyze single cell RNA-seq data and infer cell type for each cluster cells. The major components of the Seurat clustering workflow are implemented based on Seurat 3.12 package, including QC and data filtration, calculation of high-variance genes, dimensional reduction, graph-based clustering, and the identification of cluster markers (Stuart et al. 2019). Furthermore, we perform unbiased cell type recognition for each cluster of cells by leveraging reference transcriptomic datasets of pure cell types based on SingleR package (Aran et al. 2019).
In the current version, single cell RNA-seq data from one sample or one project of SMARTer (Fluidigm C1), Smart-seq2 or 10X Genomics are supported. Regrettably, it is not supported for integrated analysis of single-cell datasets generated across different conditions and technologies due to time-consuming. Please feel free to download and analyze them on your local computer.
Step1
Data input (including expression profile, meta information). Selection and filtration of cells based on QC metrics

Step2
Data normalization and scaling. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. There are also alternative methods available according to different requirements.

Step3
Calculating highly variable genes for further downstream analysis. FindVariableGenes calculates the average expression and dispersion for each gene, places these genes into bins, and then calculates a z-score for dispersion within each bin. This helps control for the relationship between variability and average expression.

Step4
Determining the dimensionality of the dataset based on their PCA scores, with each PC essentially representing a 'meta feature' that combines information across a correlated feature set. ElbowPlot can be used to suggest that the majority of true signal is captured in how many top PCs.

Step5
Clustering trees display how clusters are divided as resolution increases, which clusters are clearly separate and distinct, which are related to each other, and how samples change groups as more clusters are produced.Clustering based on the resolution inferring from the result of clustering trees, running non-linear dimensional reduction (UMAP/tSNE) and finding differentially expressed features (cluster biomarkers). Dimensional reduction techniques allow you to represent the data in a xy-coordinates (2 dimensions) rather than the original extremely high number of dimensions a single cell RNA-seq count matrix will have (probably something like 30 000 genes x 10 000 cells). Each genes expression level can be visualized on tSNE or UMAP plot

Step6
Find markers for all cluster and conduct gene set enrichment analysis.

Step7
Trajectory inference function is powered by Monocle, which employs a differential expression test to reduce the number of genes then applies independent component analysis for additional dimensionality reduction. To build the trajectory Monocle computes a minimum spanning tree, then finds the longest connected path in that tree.

Step8
Cell type annotation, which is usually the main goal of analyzing scRNA-seq data sets. Here, GEN is equipped with SingleR to infer 'cell type' by assigning labels to cells based on the rules that certain genes are only expressed in certain clusters of cells (marker genes). The five built-in reference transcriptomic datasets for human and two for mouse are as follow:
(1) Human Primary Cell Atlas (Mabbott et al. 2013) includes 37 main non-specific cell type and 157 fine cell type from 713 samples of microarray data);
(2) Blueprint (Martens and Stunnenberg 2013) and Encode (The ENCODE Project Consortium 2012) Dataset, includes 24 main non-specific cell type and 43 fine cell type from 259 samples of RNA-seq data);
(3) Monaco Immune Dataset (Monaco et al. 2019), includes 11 main immune cell type and 29 fine cell type from 114 samples of RNA-seq data);
(4) Novershtern Hematopoietic Dataset (Novershtern et al. 2011, Monaco et al. 2019), includes 17 immune main cell type and 38 fine cell type from 211 samples of microarray data);
(5) Database Immune Cell Expression Dataset (Schmiedel et al. 2018), includes 5 main hematopoietic and immune cell type and 15 fine cell type from 1561 samples of RNA-seq data).
(6) The Immunological Genome Project (ImmGen) (Heng et al. 2008), includes 20 main hematopoietic and immune cell type and 253 fine cell type from 830 samples of microarray data).
(7) Mouse RNA-seq Dataset (Benayoun et al. 2019), includes 18 non-specific cell type and 28 fine cell type from 358 samples of RNA-seq data).
Citations:
Stuart T, Butler A, Hoffman P, et al. Comprehensive Integration of Single-Cell Data. Cell. 2019, 177(7):1888-1902. e21. PMID:31178118
Aran D, Looney AP, Liu L, et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat Immunol. 2019,20(2):163-172. PMID:30643263
Mabbott N A, Baillie J K, Brown H, et al. An expression atlas of human primary cells: inference of gene function from coexpression networks. BMC genomics. 2013, 14(1): 1-13. PMID:24053356
Martens J H A, Stunnenberg H G. BLUEPRINT: mapping human blood cell epigenomes. Haematologica. 2013, 98(10): 1487. PMID:24091925
Monaco G, Lee B, Xu W, et al. RNA-Seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types. Cell reports, 2019, 26(6): 1627-1640. e7. PMID:30726743
Novershtern N, Subramanian A, Lawton L N, et al. Densely interconnected transcriptional circuits control cell states in human hematopoiesis. Cell, 2011, 144(2): 296-309. PMID:21241896
Schmiedel B J, Singh D, Madrigal A, et al. Impact of genetic polymorphisms on human immune cell gene expression. Cell, 2018, 175(6): 1701-1715. e16. PMID:30449622
Heng T S P, Painter M W, Elpek K, et al. The Immunological Genome Project: networks of gene expression in immune cells. Nature immunology, 2008, 9(10): 1091-1094. PMID:18800157
Benayoun B A, Pollina E A, Singh P P, et al. Remodeling of epigenome and transcriptome landscapes with aging in mice reveals widespread induction of inflammatory responses. Genome research, 2019, 29(4): 697-709. PMID:30858345
Handbook

Please download handbook for GEN usage.

Home page: overview of data and utilities in GEN

The home page provides overview of all gene expression data or metadata incorporated and main features implemented in GEN. Users can click on each item in the navigation bar or enter keyword(s) in the query box to retrieve information of interest.

Browse gene expression datasets archived in GEN

The dataset page displays detailed descriptive meta information of each dataset archived in GEN. Specific dataset can be filtered out through the above box. In the left panel, users can screen out the datasets of interest by specifying items under the self-defined ‘Biological Context’, ‘Animal Tissue’, ‘Human Disease’, and ‘Transcriptomic Profile’. Users can click and find related datasets of interest quickly.

Browse the metadata of specific dataset

Click on the ‘Dataset ID’ to jump to the detailed information of specific dataset, including ‘Basic Information’, ‘Samples’, ‘Gene Expression Patterns’, ‘Transcriptomic Profiles’ and ‘Visualization’. Users can click on each item to view specific contents.

Browse expression profiles of specific gene(s)

Users can browse gene expression pattern across sample of datasets in GEN. First, click to select one species. Second, select one or more genes of interest. By default, all genes will be selected. Third, select one dataset of interest to view the gene expression profiles.

Browse value-added information of specific gene

Click on the ‘Gene ID’ to jump to the detailed information of specific gene including ‘Gene Summary’, ‘Genome Browser’, ‘Expression Level’, and ‘Differential Expression’. Users can click on each item to view specific contents.

Browse species

The species page describes metadata of species curated in GEN. Users can browse total datasets and samples of specific species in GEN.

Browse projects

The project page provides detailed metadata of each project in GEN. The items of project metadata can be further filtered by specifying terms of interest.

Browse samples

The sample page describes detailed metadata of each sample in GEN. The metadata of samples can be further filtered by specifying terms of interest.

Browse publications related to datasets in GEN

The publication page provides basic information of publications related to datasets.

Tools

The tool page provides convenient online and offline tools for personalized RNA-seq data analysis. Users can click on ‘Online Analysis’ button to analyze the datasets deposited in GEN, or click on ‘GEN Toolkit’ button to download the one-stop pipeline for their own data analysis.

  1. Bulk RNA-seq Data Analysis


    • Input Data

    • Identification of Differentially Expressed Genes (DEG)

    • GRN (Gene Regulatory Network)

  2. Single-cell RNA-seq Data Analysis


    • Input Data

    • Filter Cells

    • Data Normalization & Detection of Variable Genes

    • PCA Reduction

    • Determining Significant PCs

    • Cell Clustering

    • TSNE/UMAP Reduction

    • Identification of Cluster Marker Genes

    • Visualization of Marker Genes’ Expression

    • Gene Enrichment Analysis

    • Download

  3. Visualization of Pre-processed scRNA-seq Data Analysis Results

    • Input Data

    • Result Summary

    • TSNE/UMAP

    • Cell Clustering

    • Most Expressed Genes

    • Maker Genes

    • Maker Enrichment

    • Maker Expression

    • Trajectory

    • Gene ID Conversion

  4. GENtoolkit (https://ngdc.cncb.ac.cn/gen/toolkit)

    GENtoolkit is a powerful one-stop pipeline for analyzing both bulk and single-cell (10X Genomics, Smart-seq2, Drop-seq and inDrop) RNA-seq data. GENtoolkit provides detailed guide information including ‘Prerequisite software and packages’, ‘Download and install’, ‘Usage and option summary’ and ‘Options’ for users.