Non-human primates (NHP) provide important biomedical models for many aspects of human health and diseases. With the current amount of NHPs sequences available, there is still a lack of integration of relevant biological resources. To inspect NHP complicated biological processes holistically, it is essential to take an integrative approach that combines multi-omics data so as to illustrate the interrelationships of the involved biomolecules and their functions. Due to the shortage of experimental NHP resources along with high and fast-growing prices of experimental NHP, it could be an enlightening option to go through the data prior their usage.
Therefore, we present NHP Atlas (Non-Human Primate multi-omics Atlas). NHP Atlas aims to integrate large, diverse and continually arriving NHP biological resources. It is the first non-human primate database that encompasses comprehensive species and omics, with NHP model animals as its core. It further integrates NHP resources, so as to contribute to human health.
2.1 Overview
It systematically integrates 3052 high-quality bulk RNA-seq samples, involving 73 projects from NCBI, EBI, DDBJ and NGDC (GEN). Bulk transcriptome model has user-friendly interfaces for access, visualization or further excavation of the curated gene expression data by implementing the functionalities of Browse, Search, Analysis, Visualization and Download.
2.2 Data Collection
Some of the attributes are listed here:
Species: Macaca mulatta, Macaca fascicularis, Papio anubis, Callithrix jacchus and Pan troglodytes
The status of data resource is open-access
'LibraryStrategy'='RNA-Seq'
'Sequencing Platform'='ILLUMINA'
2.3 Data Processing
RNA-seq pipeline is constructed with reference to
GEN toolkits. It includes quality control, read alignments, gene/transcript expression quantification).
Filter low quality reads: Fastp v0.20.0 (Chen et al, 2018)
Strandness library: RseQC v2.6.4 (Wang et al, 2012)
Mapping to reference genome: STAR 2.7.1a(Dobin et al, 2013)
Gene/isoform assembly and quantification: RSEM v1.3.1 (Li & Dewey, 2011)
Basic expression profiling: RawCount, RPKM and TPM
Reference:
Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018,34(17):i884-i890. PMID:
30423086
Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012,28(16):2184-2185. PMID:
22743226
Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013,29(1):15-21. PMID:
23104886
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011,12:323. PMID:
221816040
Manual curation of metadata have 3 levels : Sample, SubProject and Project.
The structure of curation model is as follows:
Curation model on the
'Project' level
Items |
Description |
Value |
Data Resource |
Controlled vocabulary |
NGDC(GEN), NCBI, EBI, DDBJ |
SubProject |
Project with one species |
Eg. PRJNA218629_caj |
BioProject ID |
Accession number of each BioProject from data resource |
Eg. PRJNA218629, GEND000248 |
Project ID |
Accession number of each expression project from data resource |
Eg. GSE50747 |
Title |
Publication in which the interaction is described |
Eg. Origins and functional evolution of Y chromosome gene repertoires across the class Mammalia |
Species |
Controlled vocabulary |
Macaca mulatta, Macaca fascicularis, Papio anubis, Callithrix jacchus and Pan troglodytes |
Strategy |
Controlled vocabulary |
Bulk RNA-seq |
Tissue |
Target tissue |
Eg. Liver |
Cell Type |
Target cell type |
Eg. T cells |
Cell Line |
Target cell line |
Eg. iPSC line |
Healthy Condition |
Controlled vocabulary |
Asthma, Chronic Lymphocytic Leukemia (CLL), Healthy Control, etc |
Development Stage |
The development stage of Samples in Project |
Conclusion term |
Sample Number |
Statistical data |
Number of samples included in the project |
Summary |
Brief description of the project scheme |
Conclusion term |
Overall Design |
Experiment design, mainly including samples grouping |
Conclusion term |
PMID |
Publication in which the interaction is described |
PubMed ID or DOI |
Release Date |
Release date of Project in Data Resource |
Including year, month and day |
Submission Date |
Submission date of Project in Data Resource |
Including year, month and day |
Update Date |
Update date of Project in Data Resource |
Including year, month and day |
Corresponding.Author |
Name of corresponding author |
Author name |
Institution.of.Corresponding.Author |
Institution or Postal address of corresponding author |
Institution of Corresponding Author |
Country |
Country of corresponding author |
Conclusion term |
DO Term |
Ontology disease term |
Eg. COVID-19 |
DO ID |
Ontology disease id |
Eg. DOID:0080600 |
DO Category |
Ontology disease category |
Eg. Infectious agent |
Tissue-BTO Term |
Ontology BTO term |
Eg. BALL-1 cell |
Tissue-BTO ID |
Ontology BTO id |
Eg. BTO:0001148 |
BTO Category |
Ontology BTO category |
Eg. hematopoietic system |
Cell Type-BTO Term |
Ontology Cell Type BTO term |
Eg. hematopoietic system |
Cell type-BTO ID |
Ontology Cell Type BTO ID |
Eg. BTO:0001008 |
Curation model on the
'Sample' level
Items |
Description |
Value |
Data Resource |
Controlled vocabulary |
NGDC(GEN), NCBI, EBI, DDBJ |
SubProject |
Project with one species |
Eg. PRJNA218629_caj |
GSE ID |
GSE or other expression source |
Eg. GSE50747 |
BioProject ID |
Accession number of each BioProject from data resource |
Eg. PRJNA218629, GEND000248 |
Sample ID |
Accession number of each sample from data resource |
Eg. GSM4232471 |
Sample_Name_Main |
Name of each sample in NHP Atlas |
Conclusion term |
Sample Name |
Name of each sample in data resource |
Conclusion term |
BioSample ID |
Accession number of each Biosample in data resource |
Eg. SAMN13678831 |
Sample Accession |
Accession number of each raw data sample in data resource |
CRS, SRS, or ERS |
Experiment Accession |
Accession number of each raw data Experiment in data resource |
CRX, SRX, or ERX |
Release Date |
Release date of sample data in data resource |
Including year, month and day |
Submission Date |
Submission date of sample data in data resource |
Including year, month and day |
Update Date |
Update date of sample data in data resource |
Including year, month and day |
Species |
Controlled vocabulary |
Macaca mulatta, Macaca fascicularis, Papio anubis, Callithrix jacchus and Pan troglodytes |
Race/Breed/Strain |
Controlled vocabulary |
Race refers to a person's physical characteristics, such as bone structure and skin, hair, or eye color. For example, American Indian, Asian, Black, Hispanic, White and etc Breed refers to a specific group of domestic animals having homogeneous appearance (phenotype), homogeneous behavior, and/or other characteristics that distinguish it from other organisms of the same species Strain refers to variants of plants, viruses or bacteria; or an inbred animal used for experimental purposes Cultivar is an assemblage of plants selected for desirable characteristics that are maintained during propagation |
Ethnicity/Country |
Controlled vocabulary |
Eg. China |
Age |
Statistical data |
The age of samples (patients, healthy donors, etc) |
Age unit |
Controlled vocabulary |
The age unit of samples (Year, week, day, etc) |
Gender |
Controlled vocabulary |
Male, female, etc |
Source Name |
Name of each sample group |
Conclusion term |
Tissue |
Target tissue |
Eg. Liver |
Cell Type |
Target cell type |
Eg. T cells |
Cell Line |
Target cell line |
Eg. iPSC line |
Disease |
Controlled vocabulary |
Asthma, Chronic Lymphocytic Leukemia (CLL), Healthy Control, etc |
Development Stage |
The development stage of samples |
Conclusion term |
Mutation |
Related gene mutation |
Conclusion term |
Strategy |
Controlled vocabulary |
Bulk RNA-seq, scRNA 10X Genomics, scRNA Smart-seq2, etc |
Library Layout |
Controlled vocabulary |
Reverse means First strand, Forward means Second strand, and dash (-) means strand-unspecific |
Platform |
Controlled vocabulary |
Illumina, BGISEQ, etc |
Instrument Model |
Controlled vocabulary |
Illumina HiSeq 2000, Illumina NextSeq 500, BGISEQ-500, Illumina NextSeq 500, etc |
#Cells |
Statistical data |
The estimated number of cells |
#Reads |
Statistical data |
The number of reads in fastq file |
Gbases |
Statistical data |
Total bases after filtering |
AvgSpotLen1 (bp) |
Statistical data |
Average spot1 length (after filtering if filtered) |
AvgSpotLen2 (bp) |
Statistical data |
Average spot2 length (after filtering if filtered) |
Multi_Mapping Rate |
Statistical data |
Percent of multi-mapped reads |
Coverage Rate |
Statistical data |
total mapped reads number*Average read length/total bases of reference genome |
Reference Genome |
Reference genome version, eg. GRCh38 v99 (including ERCC if needed) |
Eg. Macaque (Mmul_10) |
Genome Annotation |
Genome annotation version, eg. GRCh38 v99 (including ERCC if needed) |
Eg. Macaque (Mmul_10) |
3.1 Overview
It systematically integrates 179,079 high-quality cells, involving 16 projects and 5 strategy from NCBI, EBI, DDBJ and NGDC (GEN). It also enables Browse, Search, Analysis, Visualization and Download.
3.2 Data Collection
Some of the attributes are listed here:
Same as RNA-seq
'LibraryStrategy'='10x genomics' or
'LibraryStrategy'='Smart-seq2' or
'LibraryStrategy'='SMARTer' or
'LibraryStrategy'='Smart-seq v4' or
'LibraryStrategy'=' Drop-seq'
3.3 Data Processing
scRNA-seq pipeline is constructed with reference to GEN toolkits (https://ngdc.cncb.ac.cn/gen/documentation).It includes quality control, read alignments, gene/transcript expression quantification and cell clustering.
10X:
1.Extract barcode, UMI, RNA read
2.Correct barcode
3.Aligned reads by STAR
4.Tag reads with genes, transcript hits
5.Count UMIs
6.Select cell barcodes
Drop-seq:
1.Drop tag
2.Extract barcode, UMI, RNA read
3.Aligned reads by STAR
4.Generate dropEST and dropReport
SMART-seq:
1.Filter low quality reads: Fastp v0.20.0 (Chen et al, 2018)
2.Strandness library: RseQC v2.6.4 (Wang et al, 2012)
3.Mapping to reference genome: STAR 2.7.1a(Dobin et al, 2013)
4.Gene/isoform assembly and quantification: RSEM v1.3.1 (Li & Dewey, 2011)
5.Basic expression profiling
The same as 2.4 Meta Curation
4.1 Overview
It systematically integrates 26 high-quality methylation samples, involving 9 projects. It also enables Browse, Search, Analysis, Visualization and Download. Besides, it provides manually curated knowledge of both featured differentially methylated genes (DMGs) across 12 kinds of biological contexts like disease, as well as methylation tools collection.
4.2 Data Collection
Some of the attributes are listed here:
The status of data resource is open-access
'DataSet Type'= 'methylation profiling by high throughput sequencing'
[All Fields] = "WGBS" or "BS-Seq" or "Whole Genome Bisulfite Sequencing"
The predicted sequencing depth of WGBS sample should be greater than 10
4.3 Data Processing
WGBS-seq pipeline is constructed with reference to
MethBank toolkits
Quality control: FastQC v0.11.7, Fastq-dump (sratoolkit.2.8.2-1)
Mapping to the reference genome: Bismark-0.22.3
Visualizing: Mapping rate, Unique mapping rate, Genome coverage, C coverage, Conversion rate and Depth.
Manual curation of metadata have 2 levels: Sample and Project.
The structure of curation model is as follows:
Curation model on the
'Project' level
Items |
Description |
Value |
BioProject ID |
BioProject from data resource |
PRJNA668521 |
Project ID |
series project from data resource |
Eg. GSE159347 |
Data Resource |
Controlled vocabulary |
NGDC, NCBI |
PMID |
Publication in which the interaction is described |
PubMed ID or DOI |
Title |
Title of each project from data resource |
Conclusion term |
Species |
Controlled vocabulary |
Macaca mulatta, Macaca fascicularis and Pan troglodytes |
Tissue |
Controlled vocabulary |
Brain, Liver, Skin, Kidney, Leaf, Root, Seed, etc |
Overall Design |
Experiment design, mainly including samples grouping |
Conclusion term |
Healthy Condition |
Controlled vocabulary |
Lymphocytic Leukemia (CLL), Healthy Control, etc |
Development Stage |
The development stage of samples |
Conclusion term |
Sample Number |
Release date of sample data in data resource |
Including year, month and day |
Submission Date |
Statistical data |
Number of samples included in the project |
Curation model on the
'Sample' level
Items |
Description |
Value |
Sample ID |
Accession number of each sample from data resource |
Eg. SRX10614838 |
Project ID |
series project from data resource |
Eg. GSE159347 |
Sample Name |
Name of each sample in data resource |
Conclusion term |
Source |
Name of each sample group |
Conclusion term |
Tissue |
Controlled vocabulary |
Brain, Liver, Skin, Kidney, Leaf, Root, Seed, etc |
Disease |
Controlled vocabulary |
Lymphocytic Leukemia (CLL), Healthy Control, etc |
Gender |
Controlled vocabulary |
Male, female, etc |
Development Stage |
The development stage of samples |
Conclusion term |
Age |
Statistical data |
The age of samples (patients, healthy donors, etc) |
Cell Type |
Controlled vocabulary |
T cell, B cell, etc |
Genotype |
Genotype of sample |
Eg. Wide_Type |
Uniquely mapping rate |
Calculated data |
Percent of uniquely mapped reads |
Coverage Rate |
Calculated data |
total mapped reads number*Average read length/total bases of reference genome |
5.1 Overview
To further explore human disease and health, we manually curated 1229 NHP model animal disease articles encompassing 308 diseases from 21 disease ontology systems recorded in disease model. Users in any biomedical field can apply the Disease Module to browse disease name or DOID of interest accessing omics data, literature detail information and research situation in other model animal databases.
5.2 How to use
In the “Disease” part, we present all the diseases related to human health and link disease nodes directly to the curated literatures. Users can conveniently browser the specific disease and find corresponding research articles.
The network presented in the top-left corner is analyzed by Cytoscape. It shows MCL clustered subnetworks from the Disease Ontology (DO) tree which includes all the disease nodes from our curations. Each color block represents a type of systematic disease which is labeled and bolded on the map. The small colored points inside each subnetwork reveal the number of curations. White means zero curation. Blue means 1-5 curations. Orange means 6-14 curations. Purple means over 15 curations. After clicking the subnetwork (i.e. the color block), details will appear on the lower region in the webpage. Users could click the nodes of details to explore curations which showed in the right part of the webpage. We provide some basic information for browsing. More info can be obtained by clicking the PubMed ID.
5.3 Methods
First, all the Disease Ontology (DO) IDs and their relationships were downloaded from the DO website (https://disease-ontology.org). Then, we mapped our curated DO terms to the downloaded DO tree and extracted the basic necessarily nodes and relations (“Is a”) by customized python and R scripts. The basic DO structures were loaded into Cytoscape (V3.9.1) network analyzing tool and annotated by curated information. A new network was generated and simplified by removing some high-level terms such as Disease (DOID:4), Nervous system disease (DOID:863) and Disease of anatomical entity (DOID:7). Then the new network was created by clustering with the MCL clustering algorithm from the clusterMaker2 (v1.2.1) Cytoscape app. The layout of the new network was manually edited to orderly arrange these clusters.
First, all the Disease Ontology (DO) IDs and their relationships were downloaded from the DO website (https://disease-ontology.org). Then, we mapped our curated DO terms to the downloaded DO tree and extracted the basic necessarily nodes and relations (“Is a”) by customized python and R scripts. The basic DO structures were loaded into Cytoscape (V3.9.1) network analyzing tool and annotated by curated information. A new network was generated and simplified by removing some high-level terms such as Disease (DOID:4), Nervous system disease (DOID:863) and Disease of anatomical entity (DOID:7). Then the new network was created by clustering with the MCL clustering algorithm from the clusterMaker2 (v1.2.1) Cytoscape app. The layout of the new network was manually edited to orderly arrange these clusters.
The structure of curation model is as follows:
Curation model of disease
Items |
Description |
Value |
PubMed ID |
Publication in which the interaction is described |
PubMed ID or DOI |
Title |
Title of each publication |
Conclusion term |
Journal |
Jorurnal name |
Conclusion term |
Publication year |
Release date of publications |
Including year |
Species |
Controlled vocabulary |
Macaca mulatta, Macaca fascicularis, Papio anubis, Callithrix jacchus and Pan troglodytes |
Strain |
The origin of target species |
Eg. Chinese-origin |
DOID |
Ontology disease ID |
Eg. DOID:0080600 |
Disease name |
Ontology disease term |
Eg. COVID-19 |
Tag |
Ontology disease category |
Eg. Infectious agent |
Gene & Molecular |
Literature researched or mentioned gene or molecular |
Eg. FUT2 |
Gene & Molecular Description |
More detail about literature researched or mentioned gene or molecular |
Conclusion term |
Tissue/sample |
Controlled vocabulary |
Brain, Liver, Skin, Kidney, Leaf, Root, Seed, and etc. |
Tissue/sample |
Controlled vocabulary |
Brain, Liver, Skin, Kidney, Leaf, Root, Seed, and etc. |
Dataset |
Accession number or Accession way of each publication |
Conclusion term |
Drug |
Drug name |
Conclusion term |
Drug ID |
Accession number or Accession way of each drug |
Conclusion term |
Firstly, users can find more details of gene information on this page. The gene card shows gene symbol name, gene description, species, gene location, type of gene and HGNC ID, which was collected from
Ensembl BioMart. Besides, users can also query gene when choosing target species and target gene. Multi-species and multi-omics information is reunited into the Gene page, and users can query Transcripts, Gene Ontology, Homology gene, transcriptome and methylome expression information and JBrowse visualization information.
Tools pages display manually curated NHP related tools and softwares for users’ access and query.
nhp_atlas@big.ac.cn
Postal Address:
National Genomics Data Center
China National Center for Bioinformation / Beijing Institute of Genomics
Chinese Academy of Sciences
No.1 Beichen West Road
Chaoyang District, Beijing 100101
China