Non-human primates (NHP) provide important biomedical models for many aspects of human health and diseases. With the current amount of NHPs sequences available, there is still a lack of integration of relevant biological resources. To inspect NHP complicated biological processes holistically, it is essential to take an integrative approach that combines multi-omics data so as to illustrate the interrelationships of the involved biomolecules and their functions. Due to the shortage of experimental NHP resources along with high and fast-growing prices of experimental NHP, it could be an enlightening option to go through the data prior their usage.

Therefore, we present NHP Atlas (Non-Human Primate multi-omics Atlas). NHP Atlas aims to integrate large, diverse and continually arriving NHP biological resources. It is the first non-human primate database that encompasses comprehensive species and omics, with NHP model animals as its core. It further integrates NHP resources, so as to contribute to human health.

2.1 Overview

It systematically integrates 3052 high-quality bulk RNA-seq samples, involving 73 projects from NCBI, EBI, DDBJ and NGDC (GEN). Bulk transcriptome model has user-friendly interfaces for access, visualization or further excavation of the curated gene expression data by implementing the functionalities of Browse, Search, Analysis, Visualization and Download.

2.2 Data Collection

Some of the attributes are listed here:

Species: Macaca mulatta, Macaca fascicularis, Papio anubis, Callithrix jacchus and Pan troglodytes

The status of data resource is open-access

'LibraryStrategy'='RNA-Seq'

'Sequencing Platform'='ILLUMINA'

2.3 Data Processing

RNA-seq pipeline is constructed with reference to GEN toolkits. It includes quality control, read alignments, gene/transcript expression quantification).

Filter low quality reads: Fastp v0.20.0 (Chen et al, 2018)

Strandness library: RseQC v2.6.4 (Wang et al, 2012)

Mapping to reference genome: STAR 2.7.1a(Dobin et al, 2013)

Gene/isoform assembly and quantification: RSEM v1.3.1 (Li & Dewey, 2011)

Basic expression profiling: RawCount, RPKM and TPM

Reference:

Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018,34(17):i884-i890. PMID:30423086

Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012,28(16):2184-2185. PMID:22743226

Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013,29(1):15-21. PMID:23104886

Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011,12:323. PMID:221816040

2.4 Meta Curation

Manual curation of metadata have 3 levels : Sample, SubProject and Project.

The structure of curation model is as follows:

Curation model on the 'Project' level

Items	Description	Value
Data Resource	Controlled vocabulary	NGDC(GEN), NCBI, EBI, DDBJ
SubProject	Project with one species	Eg. PRJNA218629_caj
BioProject ID	Accession number of each BioProject from data resource	Eg. PRJNA218629, GEND000248
Project ID	Accession number of each expression project from data resource	Eg. GSE50747
Title	Publication in which the interaction is described	Eg. Origins and functional evolution of Y chromosome gene repertoires across the class Mammalia
Species	Controlled vocabulary	Macaca mulatta, Macaca fascicularis, Papio anubis, Callithrix jacchus and Pan troglodytes
Strategy	Controlled vocabulary	Bulk RNA-seq
Tissue	Target tissue	Eg. Liver
Cell Type	Target cell type	Eg. T cells
Cell Line	Target cell line	Eg. iPSC line
Healthy Condition	Controlled vocabulary	Asthma, Chronic Lymphocytic Leukemia (CLL), Healthy Control, etc
Development Stage	The development stage of Samples in Project	Conclusion term
Sample Number	Statistical data	Number of samples included in the project
Summary	Brief description of the project scheme	Conclusion term
Overall Design	Experiment design, mainly including samples grouping	Conclusion term
PMID	Publication in which the interaction is described	PubMed ID or DOI
Release Date	Release date of Project in Data Resource	Including year, month and day
Submission Date	Submission date of Project in Data Resource	Including year, month and day
Update Date	Update date of Project in Data Resource	Including year, month and day
Corresponding.Author	Name of corresponding author	Author name
Institution.of.Corresponding.Author	Institution or Postal address of corresponding author	Institution of Corresponding Author
Country	Country of corresponding author	Conclusion term
DO Term	Ontology disease term	Eg. COVID-19
DO ID	Ontology disease id	Eg. DOID:0080600
DO Category	Ontology disease category	Eg. Infectious agent
Tissue-BTO Term	Ontology BTO term	Eg. BALL-1 cell
Tissue-BTO ID	Ontology BTO id	Eg. BTO:0001148
BTO Category	Ontology BTO category	Eg. hematopoietic system
Cell Type-BTO Term	Ontology Cell Type BTO term	Eg. hematopoietic system
Cell type-BTO ID	Ontology Cell Type BTO ID	Eg. BTO:0001008

Curation model on the 'Sample' level

Items	Description	Value
Data Resource	Controlled vocabulary	NGDC(GEN), NCBI, EBI, DDBJ
SubProject	Project with one species	Eg. PRJNA218629_caj
GSE ID	GSE or other expression source	Eg. GSE50747
BioProject ID	Accession number of each BioProject from data resource	Eg. PRJNA218629, GEND000248
Sample ID	Accession number of each sample from data resource	Eg. GSM4232471
Sample_Name_Main	Name of each sample in NHP Atlas	Conclusion term
Sample Name	Name of each sample in data resource	Conclusion term
BioSample ID	Accession number of each Biosample in data resource	Eg. SAMN13678831
Sample Accession	Accession number of each raw data sample in data resource	CRS, SRS, or ERS
Experiment Accession	Accession number of each raw data Experiment in data resource	CRX, SRX, or ERX
Release Date	Release date of sample data in data resource	Including year, month and day
Submission Date	Submission date of sample data in data resource	Including year, month and day
Update Date	Update date of sample data in data resource	Including year, month and day
Species	Controlled vocabulary	Macaca mulatta, Macaca fascicularis, Papio anubis, Callithrix jacchus and Pan troglodytes
Race/Breed/Strain	Controlled vocabulary	Race refers to a person's physical characteristics, such as bone structure and skin, hair, or eye color. For example, American Indian, Asian, Black, Hispanic, White and etc Breed refers to a specific group of domestic animals having homogeneous appearance (phenotype), homogeneous behavior, and/or other characteristics that distinguish it from other organisms of the same species Strain refers to variants of plants, viruses or bacteria; or an inbred animal used for experimental purposes Cultivar is an assemblage of plants selected for desirable characteristics that are maintained during propagation
Ethnicity/Country	Controlled vocabulary	Eg. China
Age	Statistical data	The age of samples (patients, healthy donors, etc)
Age unit	Controlled vocabulary	The age unit of samples (Year, week, day, etc)
Gender	Controlled vocabulary	Male, female, etc
Source Name	Name of each sample group	Conclusion term
Tissue	Target tissue	Eg. Liver
Cell Type	Target cell type	Eg. T cells
Cell Line	Target cell line	Eg. iPSC line
Disease	Controlled vocabulary	Asthma, Chronic Lymphocytic Leukemia (CLL), Healthy Control, etc
Development Stage	The development stage of samples	Conclusion term
Mutation	Related gene mutation	Conclusion term
Strategy	Controlled vocabulary	Bulk RNA-seq, scRNA 10X Genomics, scRNA Smart-seq2, etc
Library Layout	Controlled vocabulary	Reverse means First strand, Forward means Second strand, and dash (-) means strand-unspecific
Platform	Controlled vocabulary	Illumina, BGISEQ, etc
Instrument Model	Controlled vocabulary	Illumina HiSeq 2000, Illumina NextSeq 500, BGISEQ-500, Illumina NextSeq 500, etc
#Cells	Statistical data	The estimated number of cells
#Reads	Statistical data	The number of reads in fastq file
Gbases	Statistical data	Total bases after filtering
AvgSpotLen1 (bp)	Statistical data	Average spot1 length (after filtering if filtered)
AvgSpotLen2 (bp)	Statistical data	Average spot2 length (after filtering if filtered)
Multi_Mapping Rate	Statistical data	Percent of multi-mapped reads
Coverage Rate	Statistical data	total mapped reads number*Average read length/total bases of reference genome
Reference Genome	Reference genome version, eg. GRCh38 v99 (including ERCC if needed)	Eg. Macaque (Mmul_10)
Genome Annotation	Genome annotation version, eg. GRCh38 v99 (including ERCC if needed)	Eg. Macaque (Mmul_10)

3.1 Overview

It systematically integrates 179,079 high-quality cells, involving 16 projects and 5 strategy from NCBI, EBI, DDBJ and NGDC (GEN). It also enables Browse, Search, Analysis, Visualization and Download.

3.2 Data Collection

Some of the attributes are listed here:

Same as RNA-seq
'LibraryStrategy'='10x genomics' or
'LibraryStrategy'='Smart-seq2' or
'LibraryStrategy'='SMARTer' or
'LibraryStrategy'='Smart-seq v4' or
'LibraryStrategy'=' Drop-seq'

3.3 Data Processing

scRNA-seq pipeline is constructed with reference to GEN toolkits (https://ngdc.cncb.ac.cn/gen/documentation).It includes quality control, read alignments, gene/transcript expression quantification and cell clustering.

10X:

1.Extract barcode, UMI, RNA read

2.Correct barcode

3.Aligned reads by STAR

4.Tag reads with genes, transcript hits

5.Count UMIs

6.Select cell barcodes

Drop-seq:

1.Drop tag

2.Extract barcode, UMI, RNA read

3.Aligned reads by STAR

4.Generate dropEST and dropReport

SMART-seq:

1.Filter low quality reads: Fastp v0.20.0 (Chen et al, 2018)

2.Strandness library: RseQC v2.6.4 (Wang et al, 2012)

3.Mapping to reference genome: STAR 2.7.1a(Dobin et al, 2013)

4.Gene/isoform assembly and quantification: RSEM v1.3.1 (Li & Dewey, 2011)

5.Basic expression profiling

3.4 Meta Curation

The same as 2.4 Meta Curation

4.1 Overview

It systematically integrates 26 high-quality methylation samples, involving 9 projects. It also enables Browse, Search, Analysis, Visualization and Download. Besides, it provides manually curated knowledge of both featured differentially methylated genes (DMGs) across 12 kinds of biological contexts like disease, as well as methylation tools collection.

4.2 Data Collection

Some of the attributes are listed here:
The status of data resource is open-access

'DataSet Type'= 'methylation profiling by high throughput sequencing'

[All Fields] = "WGBS" or "BS-Seq" or "Whole Genome Bisulfite Sequencing"

The predicted sequencing depth of WGBS sample should be greater than 10

4.3 Data Processing

WGBS-seq pipeline is constructed with reference to MethBank toolkits

Quality control: FastQC v0.11.7, Fastq-dump (sratoolkit.2.8.2-1)

Mapping to the reference genome: Bismark-0.22.3

Visualizing: Mapping rate, Unique mapping rate, Genome coverage, C coverage, Conversion rate and Depth.

4.4 Meta Curation

Manual curation of metadata have 2 levels: Sample and Project.

The structure of curation model is as follows:

Curation model on the 'Project' level

Items	Description	Value
BioProject ID	BioProject from data resource	PRJNA668521
Project ID	series project from data resource	Eg. GSE159347
Data Resource	Controlled vocabulary	NGDC, NCBI
PMID	Publication in which the interaction is described	PubMed ID or DOI
Title	Title of each project from data resource	Conclusion term
Species	Controlled vocabulary	Macaca mulatta, Macaca fascicularis and Pan troglodytes
Tissue	Controlled vocabulary	Brain, Liver, Skin, Kidney, Leaf, Root, Seed, etc
Overall Design	Experiment design, mainly including samples grouping	Conclusion term
Healthy Condition	Controlled vocabulary	Lymphocytic Leukemia (CLL), Healthy Control, etc
Development Stage	The development stage of samples	Conclusion term
Sample Number	Release date of sample data in data resource	Including year, month and day
Submission Date	Statistical data	Number of samples included in the project

Curation model on the 'Sample' level

Items	Description	Value
Sample ID	Accession number of each sample from data resource	Eg. SRX10614838
Project ID	series project from data resource	Eg. GSE159347
Sample Name	Name of each sample in data resource	Conclusion term
Source	Name of each sample group	Conclusion term
Tissue	Controlled vocabulary	Brain, Liver, Skin, Kidney, Leaf, Root, Seed, etc
Disease	Controlled vocabulary	Lymphocytic Leukemia (CLL), Healthy Control, etc
Gender	Controlled vocabulary	Male, female, etc
Development Stage	The development stage of samples	Conclusion term
Age	Statistical data	The age of samples (patients, healthy donors, etc)
Cell Type	Controlled vocabulary	T cell, B cell, etc
Genotype	Genotype of sample	Eg. Wide_Type
Uniquely mapping rate	Calculated data	Percent of uniquely mapped reads
Coverage Rate	Calculated data	total mapped reads number*Average read length/total bases of reference genome

5.1 Overview

To further explore human disease and health, we manually curated 1229 NHP model animal disease articles encompassing 308 diseases from 21 disease ontology systems recorded in disease model. Users in any biomedical field can apply the Disease Module to browse disease name or DOID of interest accessing omics data, literature detail information and research situation in other model animal databases.

5.2 How to use

In the “Disease” part, we present all the diseases related to human health and link disease nodes directly to the curated literatures. Users can conveniently browser the specific disease and find corresponding research articles.
The network presented in the top-left corner is analyzed by Cytoscape. It shows MCL clustered subnetworks from the Disease Ontology (DO) tree which includes all the disease nodes from our curations. Each color block represents a type of systematic disease which is labeled and bolded on the map. The small colored points inside each subnetwork reveal the number of curations. White means zero curation. Blue means 1-5 curations. Orange means 6-14 curations. Purple means over 15 curations. After clicking the subnetwork (i.e. the color block), details will appear on the lower region in the webpage. Users could click the nodes of details to explore curations which showed in the right part of the webpage. We provide some basic information for browsing. More info can be obtained by clicking the PubMed ID.

5.3 Methods

First, all the Disease Ontology (DO) IDs and their relationships were downloaded from the DO website (https://disease-ontology.org). Then, we mapped our curated DO terms to the downloaded DO tree and extracted the basic necessarily nodes and relations (“Is a”) by customized python and R scripts. The basic DO structures were loaded into Cytoscape (V3.9.1) network analyzing tool and annotated by curated information. A new network was generated and simplified by removing some high-level terms such as Disease (DOID:4), Nervous system disease (DOID:863) and Disease of anatomical entity (DOID:7). Then the new network was created by clustering with the MCL clustering algorithm from the clusterMaker2 (v1.2.1) Cytoscape app. The layout of the new network was manually edited to orderly arrange these clusters.

5.4 Meta Curation

The structure of curation model is as follows:

Curation model of disease

Items	Description	Value
PubMed ID	Publication in which the interaction is described	PubMed ID or DOI
Title	Title of each publication	Conclusion term
Journal	Jorurnal name	Conclusion term
Publication year	Release date of publications	Including year
Species	Controlled vocabulary	Macaca mulatta, Macaca fascicularis, Papio anubis, Callithrix jacchus and Pan troglodytes
Strain	The origin of target species	Eg. Chinese-origin
DOID	Ontology disease ID	Eg. DOID:0080600
Disease name	Ontology disease term	Eg. COVID-19
Tag	Ontology disease category	Eg. Infectious agent
Gene & Molecular	Literature researched or mentioned gene or molecular	Eg. FUT2
Gene & Molecular Description	More detail about literature researched or mentioned gene or molecular	Conclusion term
Tissue/sample	Controlled vocabulary	Brain, Liver, Skin, Kidney, Leaf, Root, Seed, and etc.
Tissue/sample	Controlled vocabulary	Brain, Liver, Skin, Kidney, Leaf, Root, Seed, and etc.
Dataset	Accession number or Accession way of each publication	Conclusion term
Drug	Drug name	Conclusion term
Drug ID	Accession number or Accession way of each drug	Conclusion term

Firstly, users can find more details of gene information on this page. The gene card shows gene symbol name, gene description, species, gene location, type of gene and HGNC ID, which was collected from Ensembl BioMart. Besides, users can also query gene when choosing target species and target gene. Multi-species and multi-omics information is reunited into the Gene page, and users can query Transcripts, Gene Ontology, Homology gene, transcriptome and methylome expression information and JBrowse visualization information.

Tools pages display manually curated NHP related tools and softwares for users’ access and query.

nhp_atlas@big.ac.cn

Postal Address:
National Genomics Data Center
China National Center for Bioinformation / Beijing Institute of Genomics
Chinese Academy of Sciences
No.1 Beichen West Road
Chaoyang District, Beijing 100101
China

NHP Atlas

Non-human primate database

1.Introduction

2.Bulk Transcriptome

2.1 Overview

2.2 Data Collection

2.3 Data Processing

2.4 Meta Curation

3.Single cell Transcriptome

3.1 Overview

3.2 Data Collection

3.3 Data Processing

3.4 Meta Curation

4.Methylome

4.1 Overview

4.2 Data Collection

4.3 Data Processing

4.4 Meta Curation

5.Disease Model

5.1 Overview

5.2 How to use

5.3 Methods

5.4 Meta Curation

6.Gene

7.Tools

8.Contact us