1.Introduction

Non-human primates (NHP) provide important biomedical models for many aspects of human health and diseases. With the current amount of NHPs sequences available, there is still a lack of integration of relevant biological resources. To inspect NHP complicated biological processes holistically, it is essential to take an integrative approach that combines multi-omics data so as to illustrate the interrelationships of the involved biomolecules and their functions. Due to the shortage of experimental NHP resources along with high and fast-growing prices of experimental NHP, it could be an enlightening option to go through the data prior their usage.
Therefore, we present NHP Atlas (Non-Human Primate multi-omics Atlas). NHP Atlas aims to integrate large, diverse and continually arriving NHP biological resources. It is the first non-human primate database that encompasses comprehensive species and omics, with NHP model animals as its core. It further integrates NHP resources, so as to contribute to human health.

2.Bulk Transcriptome

2.1 Overview

It systematically integrates 3052 high-quality bulk RNA-seq samples, involving 73 projects from NCBI, EBI, DDBJ and NGDC (GEN). Bulk transcriptome model has user-friendly interfaces for access, visualization or further excavation of the curated gene expression data by implementing the functionalities of Browse, Search, Analysis, Visualization and Download.

2.2 Data Collection

Some of the attributes are listed here:
Species: Macaca mulatta, Macaca fascicularis, Papio anubis, Callithrix jacchus and Pan troglodytes
The status of data resource is open-access
'LibraryStrategy'='RNA-Seq'
'Sequencing Platform'='ILLUMINA'

2.3 Data Processing

RNA-seq pipeline is constructed with reference to GEN toolkits. It includes quality control, read alignments, gene/transcript expression quantification).
Filter low quality reads: Fastp v0.20.0 (Chen et al, 2018)
Strandness library: RseQC v2.6.4 (Wang et al, 2012)
Mapping to reference genome: STAR 2.7.1a(Dobin et al, 2013)
Gene/isoform assembly and quantification: RSEM v1.3.1 (Li & Dewey, 2011)
Basic expression profiling: RawCount, RPKM and TPM

Reference:
Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018,34(17):i884-i890. PMID:30423086
Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012,28(16):2184-2185. PMID:22743226
Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013,29(1):15-21. PMID:23104886
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011,12:323. PMID:221816040

2.4 Meta Curation

Manual curation of metadata have 3 levels : Sample, SubProject and Project.
The structure of curation model is as follows:
Curation model on the 'Project' level
Items Description Value
Data Resource Controlled vocabulary NGDC(GEN), NCBI, EBI, DDBJ
SubProject Project with one species Eg. PRJNA218629_caj
BioProject ID Accession number of each BioProject from data resource Eg. PRJNA218629, GEND000248
Project ID Accession number of each expression project from data resource Eg. GSE50747
Title Publication in which the interaction is described Eg. Origins and functional evolution of Y chromosome gene repertoires across the class Mammalia
Species Controlled vocabulary Macaca mulatta, Macaca fascicularis, Papio anubis, Callithrix jacchus and Pan troglodytes
Strategy Controlled vocabulary Bulk RNA-seq
Tissue Target tissue Eg. Liver
Cell Type Target cell type Eg. T cells
Cell Line Target cell line Eg. iPSC line
Healthy Condition Controlled vocabulary Asthma, Chronic Lymphocytic Leukemia (CLL), Healthy Control, etc
Development Stage The development stage of Samples in Project Conclusion term
Sample Number Statistical data Number of samples included in the project
Summary Brief description of the project scheme Conclusion term
Overall Design Experiment design, mainly including samples grouping Conclusion term
PMID Publication in which the interaction is described PubMed ID or DOI
Release Date Release date of Project in Data Resource Including year, month and day
Submission Date Submission date of Project in Data Resource Including year, month and day
Update Date Update date of Project in Data Resource Including year, month and day
Corresponding.Author Name of corresponding author Author name
Institution.of.Corresponding.Author Institution or Postal address of corresponding author Institution of Corresponding Author
Country Country of corresponding author Conclusion term
DO Term Ontology disease term Eg. COVID-19
DO ID Ontology disease id Eg. DOID:0080600
DO Category Ontology disease category Eg. Infectious agent
Tissue-BTO Term Ontology BTO term Eg. BALL-1 cell
Tissue-BTO ID Ontology BTO id Eg. BTO:0001148
BTO Category Ontology BTO category Eg. hematopoietic system
Cell Type-BTO Term Ontology Cell Type BTO term Eg. hematopoietic system
Cell type-BTO ID Ontology Cell Type BTO ID Eg. BTO:0001008
Curation model on the 'Sample' level
Items Description Value
Data Resource Controlled vocabulary NGDC(GEN), NCBI, EBI, DDBJ
SubProject Project with one species Eg. PRJNA218629_caj
GSE ID GSE or other expression source Eg. GSE50747
BioProject ID Accession number of each BioProject from data resource Eg. PRJNA218629, GEND000248
Sample ID Accession number of each sample from data resource Eg. GSM4232471
Sample_Name_Main Name of each sample in NHP Atlas Conclusion term
Sample Name Name of each sample in data resource Conclusion term
BioSample ID Accession number of each Biosample in data resource Eg. SAMN13678831
Sample Accession Accession number of each raw data sample in data resource CRS, SRS, or ERS
Experiment Accession Accession number of each raw data Experiment in data resource CRX, SRX, or ERX
Release Date Release date of sample data in data resource Including year, month and day
Submission Date Submission date of sample data in data resource Including year, month and day
Update Date Update date of sample data in data resource Including year, month and day
Species Controlled vocabulary Macaca mulatta, Macaca fascicularis, Papio anubis, Callithrix jacchus and Pan troglodytes
Race/Breed/Strain Controlled vocabulary Race refers to a person's physical characteristics, such as bone structure and skin, hair, or eye color. For example, American Indian, Asian, Black, Hispanic, White and etc
Breed refers to a specific group of domestic animals having homogeneous appearance (phenotype), homogeneous behavior, and/or other characteristics that distinguish it from other organisms of the same species
Strain refers to variants of plants, viruses or bacteria; or an inbred animal used for experimental purposes
Cultivar is an assemblage of plants selected for desirable characteristics that are maintained during propagation
Ethnicity/Country Controlled vocabulary Eg. China
Age Statistical data The age of samples (patients, healthy donors, etc)
Age unit Controlled vocabulary The age unit of samples (Year, week, day, etc)
Gender Controlled vocabulary Male, female, etc
Source Name Name of each sample group Conclusion term
Tissue Target tissue Eg. Liver
Cell Type Target cell type Eg. T cells
Cell Line Target cell line Eg. iPSC line
Disease Controlled vocabulary Asthma, Chronic Lymphocytic Leukemia (CLL), Healthy Control, etc
Development Stage The development stage of samples Conclusion term
Mutation Related gene mutation Conclusion term
Strategy Controlled vocabulary Bulk RNA-seq, scRNA 10X Genomics, scRNA Smart-seq2, etc
Library Layout Controlled vocabulary Reverse means First strand, Forward means Second strand, and dash (-) means strand-unspecific
Platform Controlled vocabulary Illumina, BGISEQ, etc
Instrument Model Controlled vocabulary Illumina HiSeq 2000, Illumina NextSeq 500, BGISEQ-500, Illumina NextSeq 500, etc
#Cells Statistical data The estimated number of cells
#Reads Statistical data The number of reads in fastq file
Gbases Statistical data Total bases after filtering
AvgSpotLen1 (bp) Statistical data Average spot1 length (after filtering if filtered)
AvgSpotLen2 (bp) Statistical data Average spot2 length (after filtering if filtered)
Multi_Mapping Rate Statistical data Percent of multi-mapped reads
Coverage Rate Statistical data total mapped reads number*Average read length/total bases of reference genome
Reference Genome Reference genome version, eg. GRCh38 v99 (including ERCC if needed) Eg. Macaque (Mmul_10)
Genome Annotation Genome annotation version, eg. GRCh38 v99 (including ERCC if needed) Eg. Macaque (Mmul_10)

3.Single cell Transcriptome

3.1 Overview

It systematically integrates 179,079 high-quality cells, involving 16 projects and 5 strategy from NCBI, EBI, DDBJ and NGDC (GEN). It also enables Browse, Search, Analysis, Visualization and Download.

3.2 Data Collection

Some of the attributes are listed here:

Same as RNA-seq
'LibraryStrategy'='10x genomics' or
'LibraryStrategy'='Smart-seq2' or
'LibraryStrategy'='SMARTer' or
'LibraryStrategy'='Smart-seq v4' or
'LibraryStrategy'=' Drop-seq'

3.3 Data Processing

scRNA-seq pipeline is constructed with reference to GEN toolkits (https://ngdc.cncb.ac.cn/gen/documentation).It includes quality control, read alignments, gene/transcript expression quantification and cell clustering. 

10X:
1.Extract barcode, UMI, RNA read
2.Correct barcode
3.Aligned reads by STAR
4.Tag reads with genes, transcript hits
5.Count UMIs
6.Select cell barcodes
Drop-seq:
1.Drop tag
2.Extract barcode, UMI, RNA read
3.Aligned reads by STAR
4.Generate dropEST and dropReport
SMART-seq:
1.Filter low quality reads: Fastp v0.20.0 (Chen et al, 2018)
2.Strandness library: RseQC v2.6.4 (Wang et al, 2012)
3.Mapping to reference genome: STAR 2.7.1a(Dobin et al, 2013) 
4.Gene/isoform assembly and quantification: RSEM v1.3.1 (Li & Dewey, 2011)
5.Basic expression profiling

3.4 Meta Curation

The same as 2.4 Meta Curation

4.Methylome

4.1 Overview

It systematically integrates 26 high-quality methylation samples, involving 9 projects. It also enables Browse, Search, Analysis, Visualization and Download. Besides, it provides manually curated knowledge of both featured differentially methylated genes (DMGs) across 12 kinds of biological contexts like disease, as well as methylation tools collection.

4.2 Data Collection

Some of the attributes are listed here:
The status of data resource is open-access
'DataSet Type'= 'methylation profiling by high throughput sequencing'
[All Fields] = "WGBS" or "BS-Seq" or "Whole Genome Bisulfite Sequencing"
The predicted sequencing depth of WGBS sample should be greater than 10

4.3 Data Processing

WGBS-seq pipeline is constructed with reference to MethBank toolkits
Quality control: FastQC v0.11.7, Fastq-dump (sratoolkit.2.8.2-1)
Mapping to the reference genome: Bismark-0.22.3
Visualizing: Mapping rate, Unique mapping rate, Genome coverage, C coverage, Conversion rate and Depth.

4.4 Meta Curation

Manual curation of metadata have 2 levels: Sample and Project.
The structure of curation model is as follows:
Curation model on the 'Project' level
Items Description Value
BioProject ID BioProject from data resource PRJNA668521
Project ID series project from data resource Eg. GSE159347
Data Resource Controlled vocabulary NGDC, NCBI
PMID Publication in which the interaction is described PubMed ID or DOI
Title Title of each project from data resource Conclusion term
Species Controlled vocabulary Macaca mulatta, Macaca fascicularis and Pan troglodytes
Tissue Controlled vocabulary Brain, Liver, Skin, Kidney, Leaf, Root, Seed, etc
Overall Design Experiment design, mainly including samples grouping Conclusion term
Healthy Condition Controlled vocabulary Lymphocytic Leukemia (CLL), Healthy Control, etc
Development Stage The development stage of samples Conclusion term
Sample Number Release date of sample data in data resource Including year, month and day
Submission Date Statistical data Number of samples included in the project
Curation model on the 'Sample' level
Items Description Value
Sample ID Accession number of each sample from data resource Eg. SRX10614838
Project ID series project from data resource Eg. GSE159347
Sample Name Name of each sample in data resource Conclusion term
Source Name of each sample group Conclusion term
Tissue Controlled vocabulary Brain, Liver, Skin, Kidney, Leaf, Root, Seed, etc
Disease Controlled vocabulary Lymphocytic Leukemia (CLL), Healthy Control, etc
Gender Controlled vocabulary Male, female, etc
Development Stage The development stage of samples Conclusion term
Age Statistical data The age of samples (patients, healthy donors, etc)
Cell Type Controlled vocabulary T cell, B cell, etc
Genotype Genotype of sample Eg. Wide_Type
Uniquely mapping rate Calculated data Percent of uniquely mapped reads
Coverage Rate Calculated data total mapped reads number*Average read length/total bases of reference genome

5.Disease Model

5.1 Overview

To further explore human disease and health, we manually curated 1229 NHP model animal disease articles encompassing 308 diseases from 21 disease ontology systems recorded in disease model. Users in any biomedical field can apply the Disease Module to browse disease name or DOID of interest accessing omics data, literature detail information and research situation in other model animal databases.

5.2 How to use

In the “Disease” part, we present all the diseases related to human health and link disease nodes directly to the curated literatures. Users can conveniently browser the specific disease and find corresponding research articles.
The network presented in the top-left corner is analyzed by Cytoscape. It shows MCL clustered subnetworks from the Disease Ontology (DO) tree which includes all the disease nodes from our curations. Each color block represents a type of systematic disease which is labeled and bolded on the map. The small colored points inside each subnetwork reveal the number of curations. White means zero curation. Blue means 1-5 curations. Orange means 6-14 curations. Purple means over 15 curations. After clicking the subnetwork (i.e. the color block), details will appear on the lower region in the webpage. Users could click the nodes of details to explore curations which showed in the right part of the webpage. We provide some basic information for browsing. More info can be obtained by clicking the PubMed ID.

5.3 Methods

First, all the Disease Ontology (DO) IDs and their relationships were downloaded from the DO website (https://disease-ontology.org). Then, we mapped our curated DO terms to the downloaded DO tree and extracted the basic necessarily nodes and relations (“Is a”) by customized python and R scripts. The basic DO structures were loaded into Cytoscape (V3.9.1) network analyzing tool and annotated by curated information. A new network was generated and simplified by removing some high-level terms such as Disease (DOID:4), Nervous system disease (DOID:863) and Disease of anatomical entity (DOID:7). Then the new network was created by clustering with the MCL clustering algorithm from the clusterMaker2 (v1.2.1) Cytoscape app. The layout of the new network was manually edited to orderly arrange these clusters.

5.4 Meta Curation

First, all the Disease Ontology (DO) IDs and their relationships were downloaded from the DO website (https://disease-ontology.org). Then, we mapped our curated DO terms to the downloaded DO tree and extracted the basic necessarily nodes and relations (“Is a”) by customized python and R scripts. The basic DO structures were loaded into Cytoscape (V3.9.1) network analyzing tool and annotated by curated information. A new network was generated and simplified by removing some high-level terms such as Disease (DOID:4), Nervous system disease (DOID:863) and Disease of anatomical entity (DOID:7). Then the new network was created by clustering with the MCL clustering algorithm from the clusterMaker2 (v1.2.1) Cytoscape app. The layout of the new network was manually edited to orderly arrange these clusters.

The structure of curation model is as follows:
Curation model of disease


Items Description Value
PubMed ID Publication in which the interaction is described PubMed ID or DOI
Title Title of each publication Conclusion term
Journal Jorurnal name Conclusion term
Publication year Release date of publications Including year
Species Controlled vocabulary Macaca mulatta, Macaca fascicularis, Papio anubis, Callithrix jacchus and Pan troglodytes
Strain The origin of target species Eg. Chinese-origin
DOID Ontology disease ID Eg. DOID:0080600
Disease name Ontology disease term Eg. COVID-19
Tag Ontology disease category Eg. Infectious agent
Gene & Molecular Literature researched or mentioned gene or molecular Eg. FUT2
Gene & Molecular Description More detail about literature researched or mentioned gene or molecular Conclusion term
Tissue/sample Controlled vocabulary Brain, Liver, Skin, Kidney, Leaf, Root, Seed, and etc.
Tissue/sample Controlled vocabulary Brain, Liver, Skin, Kidney, Leaf, Root, Seed, and etc.
Dataset Accession number or Accession way of each publication Conclusion term
Drug Drug name Conclusion term
Drug ID Accession number or Accession way of each drug Conclusion term

6.Gene

Firstly, users can find more details of gene information on this page. The gene card shows gene symbol name, gene description, species, gene location, type of gene and HGNC ID, which was collected from Ensembl BioMart. Besides, users can also query gene when choosing target species and target gene. Multi-species and multi-omics information is reunited into the Gene page, and users can query Transcripts, Gene Ontology, Homology gene, transcriptome and methylome expression information and JBrowse visualization information.

7.Tools

Tools pages display manually curated NHP related tools and softwares for users’ access and query.

8.Contact us

nhp_atlas@big.ac.cn

Postal Address:
National Genomics Data Center
China National Center for Bioinformation / Beijing Institute of Genomics
Chinese Academy of Sciences
No.1 Beichen West Road
Chaoyang District, Beijing 100101
China