Documentation - MethBank

MethBank, a comprehensive database of DNA single-base resolution methylation profiles across a variety of species, not only integrates 1363 whole genome single-base methylome for 23 species, covering 208 tissues/cell lines and 15 biological contexts, but also provides manually curate knowledge of both featured differentially methylated genes (DMGs) across 11 kinds of biological contexts like disease and methylation tools collection. Besides, MethBank also provides analysis tools including Age Predictor, IDMP and DMR toolkit to help related research of biologists.

1. Data Collection

High quality raw sequencing data of WGBS samples are acquired from accessible data repositories (SRA/GSA).

The data included in the database were searched according to the following filtering rules:

(1) The status of data resource is open-access

(2) [DataSet Type] = "methylation profiling by high throughput sequencing"

(3) [All Fields] = "WGBS" or "BS-Seq" or "Whole Genome Bisulfite Sequencing"

(4) The predicted sequencing depth of WGBS sample should be greater than 10.

Knowledge is curated from publications retrieved in PubMed.
For Featured DMGs, publication search followed the following rules:

(1) The keyword matching "WGBS", "whole-genome bisulfite sequencing", "RRBS" and "whole-genome DNA methylation"

(2) Publicated in the past twelve years (2010-present)

(3) Publications associated with featured DMGs.

The data after the initial screening are manually curated again.

2. Data Analysis

WGBS data analysis pipeline includes quality control, trim adapters and low quality bases, align to reference genome, extract CpG methylation, quantify Gene/CpG island average methylation level, high methylated CpG islands analysis (genomic location enrichment, GO & KEGG enrichment), genes related to methylated CpG islands analysis, differentially methylated regions (DMRs) analysis (genomic location enrichment, GO & KEGG enrichment).

First, all bisulfite sequence were subjected to quality control by FastQC v0.11.7 and trimmed to remove adaptors and low quality bases using Fastq-dump (sratoolkit.2.8.2-1). Next, the reads that passed quality control were mapped to the reference genome of the corresponding species using Bismark-0.22.3. Detailed species reference genome information is shown in the table below. We used the Bismark methylation extractor to extract methylation data from aligned, filtered reads. To visualize the quality of the data, we computed 6 indicators: Mapping rate, Unique mapping rate, Genome coverage, C coverage, Conversion rate and Depth. The corresponding calculations are shown below.

Species	Genome Version for Mapping	Genome Version for Annotation
Ailuropoda melanoleuca	ASM200744v2	ASM200744v2
Arabidopsis_thaliana	TAIR10.1	TAIR10.1
Bos taurus	ARS-UCD1.2.105	ARS-UCD1.2.105
Brassica napus	GCF_000686985.2_Bra_napus_v2.0	GCF_000686985.2_Bra_napus_v2.0
Canis lupus familiaris	CanFam3.1.104	CanFam3.1.104
Danio rerio	Zv9	Zv9
Macaca fascicularis	Macaca_fascicularis_5.0	Macaca_fascicularis_5.0.102
Gallus gallus	GRCg6a	GRCg6a.105
Glycine max	Gmax_275_v2.0	Gmax_275_v2.0
Gorilla gorilla	gorGor4	gorGor4.105
Homo sapiens	GRCh38.101	GRCh38.101
Manihot esculenta	Mesculenta_305_v6	Mesculenta_305_v6
Macaca mulatta	Mmul_10.105	Mmul_10.105
Mus musculus	GRCm38.75	GRCm38.75
Oryza sativa	IRGSP-1.0	IRGSP-1.0
Ovis aries	Oar_v3.1.101	Oar_v3.1.101
Pan troglodytes	Pan_tro_3.0	Pan_tro_3.0.105
Phaseolus vulgaris	Pvulgaris_218_v1	Pvulgaris_218_v1
Populus_trichocarpa	P.trichocarpa_v4.1	P.trichocarpa_v4.1
Rattus norvegicus	Rnor_6.0.101	Rnor_6.0.101
Salmo salar	Ssal_v3.1	Ssal_v3.1
Solanum lycopersicum	GCF_000188115.3_SL2.50	GCF_000188115.3_SL2.50
Sus scrofa	Sscrofa11.1	Sscrofa11.1.105
Xenopus laevis	xenlae2	xenlae2
Zea mays	GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0	GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0

Mapping Rate: calculated from the *_PE_report.txt file generated by Bismark (Sequence pairs did not map uniquely + Number of paired/Single-end alignments with a unique best hit) / Sequence pairs analysed in total

Unique Mapping Rate: calculated from the *_PE_report.txt file generated by Bismark
Number of paired/Single-end alignments with a unique best hit / Sequence pairs analysed in total

Depth: calculated by Samtools1.9 using sort module

Conversion Rate: calculated from the *.bismark.cov file generated by Bismark

Genome Coverage: all base size number of sample / all size number of genome reference

C Coverage: C base size number of sample / C base size number of genome reference

Then, bedtools v2.17.0 and python3 were used to analyze gene methylation profiles of promoter, body and downstream regions in the C/CG/CH context. Subsequently, the relationship between CpG island and gene were studied. From the perspective of CpG island, we calculated the DNA methylation level corresponding to CpG island, and selected highly methylated CpG island (average methylation level >= 0.6) as the research object for downstream analysis including genome enrichment, GO & KEGG enrichment. From the perspective of genes, we provide all CpG islands that overlapped with genes and counted the corresponding location information.

Finally, we identify the differential methylation regions (DMRs) by using DSS R-package for the typical biological contexts in single project, and analyze genomic location enrichment, GO & KEGG enrichment.

3. Meta Curation

The manual curation of metadata are done on 2 levels ('Project' and Sample'). The corresponding review contents and standards are as follows:

Curation model on the 'Project' level

Items	Description	Value
Data Resource	Controlled vocabulary	NGDC, NCBI
Project ID	Accession number of the project from data resource	CRA, GSE, HRA, SRP
Sample Number	Number of samples included in the project of MethBank	1, 2, 3……
BioProject ID	Accession number of the BioProject from data resource	PRJCA, PRJNA
Title	Title of each project from data resource	Conclusion term
Summary	A summary description of the project	Conclusion term
Overall Design	Experiment design of the project	Conclusion term
Related Biological Process	Controlled vocabulary	Age, Health, Disease etc
Species	Controlled vocabulary	Homo sapiens, Mus musculus, Brassica napus etc
Tissue/Cell Line	Controlled vocabulary	Liver, Pancreas, Brain etc
Cell Type	Controlled vocabulary	HeLa-S3 Cell, K-562 Cell, Hep-G2 Cell etc
Healthy Condition	Controlled vocabulary	Normal, Healthy, Head and neck squamous cell carcinoma etc
Development Stage	Controlled vocabulary	Embryo, Gamete, Adult etc
Disease State	Controlled vocabulary	T2bN0M0, T3bN0M0, T2cN0M0 etc
Submitter	Submitter of the project	Lab info, Submitter name
Year	Submit year of the project	Feb 15, 2019 etc
Publication	Publication information related to the project	Author, article name, journal name, date, doi etc
PMID	Pubmed id of Publication related to the project	PubMed ID
Status	Public date of the project	Public on Jul 03, 2013 etc
Submission Date	Submission date of the project in data resource	Apr 29, 2020 etc
Last Update Date	Last update date of the project in data resource	Apr 30, 2020 etc

Curation model on the 'Sample' level

Items	Description	Value
Basic Information
Sample Name	Name of the sample from data resource	Conclusion term
Data Resource	Controlled vocabulary	NGDC, NCBI
Description	Description of the sample from data resource	Conclusion term
BioProject ID	Accession number of the BioProject from data resource	PRJNA, PRJCA
Project ID	Accession number of the project from data resource	GSE, SRP, HRA, CRA
Experiment ID	Accession number of the experiment sample in data resource	SRX, HRX, CRX
Status	Public date of the sample	Public on Jul 03, 2013 etc
Submission Date	Submission date of the sample in data resource	Apr 29, 2020 etc
Last Update Date	Last update date of the sample in data resource	Apr 30, 2020 etc
Donor ID	Donor ID of the sample in data resource	ENCBS
Sample Characteristic
Species	Controlled vocabulary	Homo sapiens, Mus musculus, Brassica napus etc
Tissue/Cell Line	Controlled vocabulary	Liver, Pancreas, Brain etc
Cell Type	Controlled vocabulary	HeLa-S3 Cell, K-562 Cell, Hep-G2 Cell etc
Source Name	Name of the sample group	Conclusion term
Strain	Controlled vocabulary	Strain means variants of plants, viruses or bacteria; or an inbred animal used for experimental purposes
Breed	Controlled vocabulary	Breed means a specific group of domestic animals having homogeneous appearance (phenotype), homogeneous behavior, and/or other characteristics that distinguish it from other organisms of the same species.
Cultivar	Controlled vocabulary	Cultivar means a type of plant that people have bred for desired traits, which are reproduced in each new generation by a method such as grafting, tissue culture, or carefully controlled seed production.
Sex	Controlled vocabulary	Male, Female, Pooled male and female
Age	The age/stage of samples	1, 2, 3…… days/weeks/years etc
Biological Condition
Healthy Condition	Controlled vocabulary	Normal, Healthy, Head and neck squamous cell carcinoma etc
Disease State	Controlled vocabulary	T2bN0M0, T3bN0M0, T2cN0M0 etc
Development Stage	Controlled vocabulary	Embryo, Gamete, Adult etc
Genotype	Controlled vocabulary	Genotype of the sample means its complete set of genetic material
Treatment	Brief description of the sample treatment state	Conclusion term
Knockout	Brief description of the sample knockout state	Conclusion term
Resistance Phynotype	Brief description of the sample resistance phynotype state	Conclusion term
Protocol
Growth Protocol	Culture protocols of cells from samples or cell lines	Conclusion term
Treatment Protocol	Protocols of sample treatment	Conclusion term
Extraction Protocol	Protocols of WGBS extraction	Conclusion term
Construction Protocol	Protocols of WGBS library construction	Conclusion term
Library Strategy	Controlled vocabulary	Bisulfite-Seq
Library Source	Controlled vocabulary	GENOMIC
Library Selection	Controlled vocabulary	RANDOM, Other etc
Layout	Controlled vocabulary	PAIRED, SINGLE
Platform	Controlled vocabulary	Illumina HiSeq 2500, HiSeq X Ten etc
Assessing Quality
Mapping Rate	Calculated data	(Sequence pairs did not map uniquely + Number of paired/Single-end alignments with a unique best hit) / Sequence pairs analysed in total
Uniquely Mapping Rate	Calculated data	Number of paired/Single-end alignments with a unique best hit / Sequence pairs analysed in total
Genome Coverage	Calculated data	all base size number of sample / all size number of genome reference
C Coverage	Calculated data	C base size number of sample / C base size number of genome reference
Conversion Rate	Calculated data	Proportion of Cs converted to Ts
Depth	Calculated data	Total mapped reads number*Average read length/total bases of reference genome
Analysis
Genome Version for Mapping	Reference genome version	.fa/.fasta/.fna format file
Genome Version for Annotation	Genome annotation version	.gff/.gtf/.gff3 format file

1. Home Page

You can fill in the entry of interest in the input box to search and view the corresponding data and information that you need.
For example, you can search for a list of projects, samples, publications, and featured DMGs related to human WGBS data by entering "Homo sapiens" and selecting item. You can enter a gene name, such as TP53, to obtain the specific information of the gene in multiple species and its related publications and featured DMGs at the WGBS level.

You can also directly click the specific areas in the four sections of "Data Resource", "Knowledge Curations", "Tools" and "Methylation Snapshots" below to jump to the overall page of the corresponding section.

2. Methylome Browser Page

You can view the reference genome sequence, gene annotation, and distribution of CpG islands for specific species and specific genes on the Methylome Browser. You can also explore the differentially methylated regions in specific experiments in the select Track.

3. Data Resources

3.1 Projects Page

On the "Projects" page, you can click "Display column" to customize the columns of displayed information of projects in the table. On the left side of the page, you can click on “Species”, “Animal Tissue” and “Human Disease”, not only to view the distribution of the internal tree structure, but also to further jump to the table of contents of Projects under a specific category. All forms in the MethBank can be downloaded.

3.2 Samples Page

On the "Samples" page, you can click “Show Columns” and options under “Show Columns” to customize the displayed sample information and filter samples with specific criteria, as shown in the following figure below. A detailed explanation of Quality Assessment can be found in https://ngdc.cncb.ac.cn/methbank/faq#curation

3.3 Genes Page

On the "Genes" page, clicking define gene ID hyperlinks will jump to the page of methylation overview on gene. Take human gene ENSG00000000003 for example, you can not only see the relevant information of this gene, but also the table and line graph of the methylation profiles of CG and CH (H= A, T or C) of this gene in specific samples.

3.4 CpG Islands Page

You can select a specific species and a specific sample to view the results of chromosome distribution, GO, KEGG and Genomics Location of high methylated CpG islands and the result of the catalog of genes related to methylated CpG islands for the corresponding sample.

3.5 DMRs Page

You can select a specific species, project, group, and P-value and length on the left panel to view the corresponding DMR distribution on chromosome, GO&KEGG enrichment, genomics location results.

3.6 Publications Page

The page contains the information on the corresponding publications in Data resource, Tool Collections and Featured DMGs modules.

4 Knowledge

4.1 Featured DMGs Page

The Featured DMGs module summarized biological context-associated featured DMGs via full-scale manual curation to raise the potential availability of retrieving epigenetic marker genes and shared properties for kinds of the biological scene. This page presents 266 DNA methylation-related publications that we have curated by establishing a standardized curation process. You can view detail information in the table in the lower part of the page, including species, tissues, diseases, conditions, enrichment functions, featured DMGs, etc.

You can also explore the relationship between genes and diseases at the DNA methylation level through the interactive graphs in the Disease Network and Gene Network sections. Tissue Sunburst panel illustrated the tissue distribution of every biological conditions.

4.2 Tool Collections Page

Tool Collections page provides 501 methylation related tools collected by predefined keywords from the original literature and web sources. It is characterized by diverse categories, types, operating systems and other indicators. These tools are grouped into five main types: application/script, framework/library, package/module, and toolkit/suite. You can enter specific conditions on the left side of the Tools Collection page to view the corresponding tools.

5. Tools

5.1 Age predictor

Introduction

Age predictor is a predictive tool that uses methylation chip data of human blood to predict the age. You can input idat files, processed data or NCBI GEO Sample ID to get age prediction results of the three methods.

Usage

There are three kinds of input designed for different situations:

Idat Files: You can upload compressed raw data files (.gz format), which contain two files (one for green signal intensity and another for red signal intensity). Age Predictor will process the raw data using standard Illumina pipeline and return a predicted age.
Processed Data: The processed data file should be a tab (\t) delimited text file. The first column of processed data must be CpG probe identifiers (cg numbers), such as cg00000165. The second column of processed data must be beta values (range from 0 to 1). The file must contain 52 probes to meet our model need (list in probe_list.txt).
NCBI GEO Sample ID: You can paste sample ID list to the text box directly.

Result Interpretation

There are three different methods of age prediction: Support Vector Machine (SVM), Random Forest and Elastic Net.

5.2 IDMP

Introduction

IDMP is a tool to identify the differentially methylated promoters between two samples via Fisher's exact test with FDR correction. It is written in Perl and is executed from the command line in LINUX system.

Usage

For details on how to use IDMP, see the link below: https://ngdc.cncb.ac.cn/methbank/tools/idmp#usage

5.3 DMR Toolkit

Introduction

DMR toolkit is a pipeline package for DMRs identification, annotation and enrichment for multiple species.

Usage

Follow these steps for analysis:

(1) Two input files for methylation levels obtained by WGBS to be analyzed as case group and control group need to be prepared. The following are examples of two input file formats. The meaning of these properties are listed in https://ngdc.cncb.ac.cn/methbank/tools/dmr/toolkit#3.4

(2) Install the following software and packages.

-- bedtools 2.28.0

-- R 4.0.5

-- DSS 2.42.0 -- getopt 1.20.3 -- bsseq 1.30.0 -- clusterProfiler 3.18.1 -- tidyverse 1.3.1 -- data.table 1.14.2 -- enrichplot 1.10.2 -- topGO 2.42.0
-- KEGG.db 1.0 -- DO.db 2.9 -- DOSE 3.16.0 -- Rgraphviz 2.34.0 -- ChIPseeker 1.26.2

(3) Run DMR Toolkit as follows.

 # Download DMR Toolkit and extract
                                    it   

                                    wget -c https://download.cncb.ac.cn/methbank/Tool/DMRtoolkit_v1.0.tar.gz 

                                    tar -xzf DMRtoolkit_v1.0.tar.gz 

                                    

                                     # Make your directory and input files 
 
                                    
                                    cd ./DMRtoolkit_v1.0 

                                    mkdir process           
                                            # You can name the directory any way you want
 
                                    
                                    cd ./process

                                    mkdir Homo_sapiens           
                                        # You can name the directory any way you want
  
                                    cd ./Homo_sapiens

                                    vim case.txt          
                                            # Methylation data of the case group
  
                                    vim control.txt          
                                          # Methylation data of the control group
  
                                    

                                    # Run callDMR.sh
  
                                    sh ../../DMRtoolkit_v1.0/callDMR.sh Homo_sapiens ./case.txt ./control.txt
                                    DMR_result_name CG 0.05 0.1 bismark

Note: Make the directory in the same format as the example above. The parameters for "callDMR.sh" are described below. Controlled vocabulary of "species_name" is shown in https://ngdc.cncb.ac.cn/methbank/tools/dmr/toolkit#3.3

(4) Result of the analysis.

The names of final output file:
         "DMR_result_name"_DMR_0.01.txt
         "DMR_result_name"_DMR_0.01_Anno.tsv
         "DMR_result_name"_DMR_0.01_GO.tsv
         "DMR_result_name"_DMR_0.01_KEGG.tsv

The example of output files is shown below. The meaning of these properties is listed in https://ngdc.cncb.ac.cn/methbank/tools/dmr/toolkit#result

For more information about DMR toolkit, see the link below: https://ngdc.cncb.ac.cn/methbank/tools/dmr/toolkit

6. Download

From this page, you can download all the single base precision methylation data and their annotation information provided in the database. We also provide the sex-specific 450K and 850K methylation data of 111 healthy human tissues. Data can be downloaded by clicking the icon on the page or using an FTP tool (such as FileZilla Client).

7. Contact Us

If you have any questions, comments or suggestions, please send us an email at methbank(AT)big.ac.cn, and, we will give corresponding at the first time.

Methbank is free for academic use only. For any commercial use, please contact us for commercial licensing terms.

MethBank

Introduction

Data Integration and Analysis

1. Data Collection

2. Data Analysis

3. Meta Curation

Database Usage

1. Home Page

2. Methylome Browser Page

3. Data Resources

3.1 Projects Page

3.2 Samples Page

3.3 Genes Page

3.4 CpG Islands Page

3.5 DMRs Page

3.6 Publications Page

4 Knowledge

4.1 Featured DMGs Page

4.2 Tool Collections Page

5. Tools

5.1 Age predictor

Introduction

Usage

Result Interpretation

5.2 IDMP

Introduction

Usage

5.3 DMR Toolkit

Introduction

Usage

6. Download

7. Contact Us

Licenses