MethBank
a comprehensive database of DNA methylation across a variety of species

MethBank

a comprehensive database of DNA methylation across a variety of species

Introduction

1. Whole genome bisulphite sequencing (WGBS)

DNA methylation is a stable epigenetic modification and plays an important role in many biological processes such as embryonic development, genomic imprinting, and cell differentiation. Detection and quantification of methylation are critical to understand gene expression and other processes subjected to epigenetic regulation. Aberrant DNA methylation may be associated with a variety of human diseases represented by cancer, which opens new possibilities in diagnosis and therapy of cancers and other severe diseases. Whole genome bisulfite sequencing (WGBS), as a "gold standard" for a comprehensive view of the methylome, provides single-base resolution of methylated cytosines across the genome. With the improvement of library preparation methods and next-generation sequencing technology over the past decade, WGBS has become an increasingly widespread and informative method for analyzing DNA methylation in epigenomic-wide studies.

2. Challenge

Considering the exponentially increasing number of researches based on whole-genome bisulfite sequencing (WGBS), a large amount of WGBS data and knowledge related to methylation studies have been accumulated. Contrary to other techniques such as Infinium 27K, 450k and EPIC, processing WGBS data of large datasets still remains cumbersome considering its resources usage. As a result, most single-base precision methylation databases are no longer available or have not been updated for years. It is essential to establish a dedicated database or platform to deeply integrate these WGBS data.

3. Our Mission

Here we present MethBank, a comprehensive database of DNA single-base resolution methylation profiles across a variety of species. As one of the core resources in National Genomics Data Center (NGDC), Beijing Institute of Genomics (BIG), Chinese Academy of Science (CAS) & China National Center for Bioinformation (CNCB), MethBank not only integrates 1271 high-quality (>20X) whole genome single-base methylome for 19 species, covering 179 tissues/cell lines and 13 biological contexts, but also provides manually curate knowledge of both featured differentially methylated genes (DMGs) across 11 kinds of biological contexts like disease and methylation tools collection. Besides, MethBank also provides analysis tools including Age Predictor, IDMP and DMR toolkit to help related research of biologists.

Data Collection:

High quality raw sequencing data of WGBS samples are acquired from accessible data repositories (SRA/GSA).
The data included in the database were searched according to the following filtering rules:
(1) The status of data resource is open-access
(2) [DataSet Type] = "methylation profiling by high throughput sequencing"
(3) [All Fields] = "WGBS" or "BS-Seq" or "Whole Genome Bisulfite Sequencing"
(4) The predicted sequencing depth of WGBS sample should be greater than 10.
The data after the initial screening are manually curated again.

Data Analysis:

WGBS data analysis pipeline includes quality control, trim adapters and low quality bases, align to reference genome, extract CpG methylation, quantify Gene/CpG island average methylation level, High methylated CpG islands analysis (Genomic location enrichment, GO & KEGG enrichment), Genes related to methylated CpG islands analysis, Differentially methylated regions (DMRs) analysis (Genomic location enrichment, GO & KEGG enrichment).
First, All bisulfite sequence were subjected to quality control by FastQC v0.11.7 and trimmed to remove adaptors and low quality bases using Fastq-dump (sratoolkit.2.8.2-1). Next, the reads that passed quality control were mapped to the reference genome of the corresponding species using Bismark-0.22.3. Detailed species reference genome information can be found and downloaded in the Downloads interface. We used the Bismark methylation extractor to extract methylation data from aligned, filtered reads. To visualize the quality of the data, we compute 6 indicators: Mapping rate, Unique mapping rate, Genome coverage, C coverage, Conversion rate and Depth. The corresponding calculations are shown below.
Mapping rate:calculated from the *_PE_report.txt file generated by Bismark (Sequence pairs did not map uniquely + Number of paired/Single-end alignments with a unique best hit) / Sequence pairs analysed in total
Unique mapping rate: calculated from the *_PE_report.txt file generated by Bismark
Number of paired/Single-end alignments with a unique best hit / Sequence pairs analysed in total
Depth:calculated by Samtools1.9 using sort module
Conversion rate: calculated from the *.bismark.cov file generated by Bismark
Genome Coverage:all base size number of sample / all size number of genome reference
C Coverage: C base size number of sample / C base size number of genome reference
Then, bedtools v2.17.0 and python3 were used to analyze gene methylation profiles of promoter, body and downstream regions in the C/CG/CH context. Subsequently, the relationship between CpG island and gene were studied. From the perspective of CpG island, we calculated the DNA methylation level corresponding to CpG island, and selected highly methylated CpG island (average methylation level >= 0.6) as the research object for downstream analysis including genome enrichment, GO & KEGG enrichment. From the perspective of genes, we provide all CpG islands that overlapped with genes and counted the corresponding location information.

Finally, we identify the differential methylation regions (DMRs) by using DSS R-package for the typical biological contexts in single project, and analyze genomic location enrichment, GO & KEGG enrichment.

Meta Curation:

The manual curation of metadata are done on 2 levels ('Project' and Sample'). The corresponding review contents and standards are as follows:
Curation model on the 'Project' level
Items Description Value
Data Resource Controlled vocabulary NGDC, NCBI
Project ID Accession number of the project from data resource CRA, GSE, HRA, SRP
Sample Number Number of samples included in the project of MethBank 1, 2, 3……
BioProject ID Accession number of the BioProject from data resource PRJCA, PRJNA
Title Title of each project from data resource Conclusion term
Summary A summary description of the project Conclusion term
Overall Design Experiment design of the project Conclusion term
Related Biological Process Controlled vocabulary Age, Health, Disease etc
Species Controlled vocabulary Homo sapiens, Mus musculus, Brassica napus etc
Tissue/Cell Line Controlled vocabulary Liver, Pancreas, Brain etc
Cell Type Controlled vocabulary HeLa-S3 Cell, K-562 Cell, Hep-G2 Cell etc
Healthy Condition Controlled vocabulary Normal, Healthy, Head and neck squamous cell carcinoma etc
Development Stage Controlled vocabulary Embryo, Gamete, Adult etc
Disease State Controlled vocabulary T2bN0M0, T3bN0M0, T2cN0M0 etc
Submitter Submitter of the project Lab info, Submitter name
Year Submit year of the project Feb 15, 2019 etc
Publication Publication information related to the project Author, article name, journal name, date, doi etc
PMID Pubmed id of Publication related to the project PubMed ID
Status Public date of the project Public on Jul 03, 2013 etc
Submission Date Submission date of the project in data resource Apr 29, 2020 etc
Last Update Date Last update date of the project in data resource Apr 30, 2020 etc
Curation model on the 'Sample' level
Items Description Value
Basic Information
Sample Name Name of the sample from data resource Conclusion term
Data Resource Controlled vocabulary NGDC, NCBI
Description Description of the sample from data resource Conclusion term
BioProject ID Accession number of the BioProject from data resource PRJNA, PRJCA
Project ID Accession number of the project from data resource GSE, SRP, HRA, CRA
Experiment ID Accession number of the experiment sample in data resource SRX, HRX, CRX
Status Public date of the sample Public on Jul 03, 2013 etc
Submission Date Submission date of the sample in data resource Apr 29, 2020 etc
Last Update Date Last update date of the sample in data resource Apr 30, 2020 etc
Donor ID Donor ID of the sample in data resource ENCBS
Sample Characteristic
Species Controlled vocabulary Homo sapiens, Mus musculus, Brassica napus etc
Tissue/Cell Line Controlled vocabulary Liver, Pancreas, Brain etc
Cell Type Controlled vocabulary HeLa-S3 Cell, K-562 Cell, Hep-G2 Cell etc
Source Name Name of the sample group Conclusion term
Strain Controlled vocabulary Strain means variants of plants, viruses or bacteria; or an inbred animal used for experimental purposes
Breed Controlled vocabulary Breed means a specific group of domestic animals having homogeneous appearance (phenotype), homogeneous behavior, and/or other characteristics that distinguish it from other organisms of the same species.
Cultivar Controlled vocabulary Cultivar means a type of plant that people have bred for desired traits, which are reproduced in each new generation by a method such as grafting, tissue culture, or carefully controlled seed production.
Sex Controlled vocabulary Male, Female, Pooled male and female
Age The age/stage of samples 1, 2, 3…… days/weeks/years etc
Biological Condition
Healthy Condition Controlled vocabulary Normal, Healthy, Head and neck squamous cell carcinoma etc
Disease State Controlled vocabulary T2bN0M0, T3bN0M0, T2cN0M0 etc
Development Stage Controlled vocabulary Embryo, Gamete, Adult etc
Genotype Controlled vocabulary Genotype of the sample means its complete set of genetic material
Treatment Brief description of the sample treatment state Conclusion term
Knockout Brief description of the sample knockout state Conclusion term
Resistance Phynotype Brief description of the sample resistance phynotype state Conclusion term
Protocol
Growth Protocol Culture protocols of cells from samples or cell lines Conclusion term
Treatment Protocol Protocols of sample treatment Conclusion term
Extraction Protocol Protocols of WGBS extraction Conclusion term
Construction Protocol Protocols of WGBS library construction Conclusion term
Library Strategy Controlled vocabulary Bisulfite-Seq
Library Source Controlled vocabulary GENOMIC
Library Selection Controlled vocabulary RANDOM, Other etc
Layout Controlled vocabulary PAIRED, SINGLE
Platform Controlled vocabulary Illumina HiSeq 2500, HiSeq X Ten etc
Assessing Quality
Mapping rate Calculated data (Sequence pairs did not map uniquely + Number of paired/Single-end alignments with a unique best hit) / Sequence pairs analysed in total
Uniquely mapping rate Calculated data Number of paired/Single-end alignments with a unique best hit / Sequence pairs analysed in total
Genome Coverage Calculated data all base size number of sample / all size number of genome reference
C Coverage Calculated data C base size number of sample / C base size number of genome reference
Conversion Rate Calculated data Proportion of Cs converted to Ts
Depth Calculated data Total mapped reads number*Average read length/total bases of reference genome
Analysis
Reference Genome Reference genome version .fa/.fasta/.fna format file
Genome Annotation Genome annotation version .gff/.gtf/.gff3 format file