1. Whole genome bisulphite sequencing (WGBS)
DNA methylation is a stable epigenetic modification and plays an important role in many biological
processes such as embryonic development, genomic imprinting, and cell differentiation. Detection and
quantification of methylation are critical to understand gene expression and other processes
subjected to epigenetic regulation. Aberrant DNA methylation may be associated with a variety of
human diseases represented by cancer, which opens new possibilities in diagnosis and therapy of
cancers and other severe diseases. Whole genome bisulfite sequencing (WGBS), as a "gold standard"
for a comprehensive view of the methylome, provides single-base resolution of methylated cytosines
across the genome. With the improvement of library preparation methods and next-generation
sequencing technology over the past decade, WGBS has become an increasingly widespread and
informative method for analyzing DNA methylation in epigenomic-wide studies.
2. Challenge
Considering the exponentially increasing number of researches based on whole-genome bisulfite
sequencing (WGBS), a large amount of WGBS data and knowledge related to methylation studies have
been accumulated. Contrary to other techniques such as Infinium 27K, 450k and EPIC, processing WGBS
data of large datasets still remains cumbersome considering its resources usage. As a result, most
single-base precision methylation databases are no longer available or have not been updated for
years. It is essential to establish a dedicated database or platform to deeply integrate these WGBS
data.
3. Our Mission
Here we present MethBank, a comprehensive database of DNA single-base resolution methylation
profiles across a variety of species. As one of the core resources in National Genomics Data Center
(NGDC), Beijing Institute of Genomics (BIG), Chinese Academy of Science (CAS) & China National
Center for Bioinformation (CNCB), MethBank not only integrates 1271 high-quality (>20X) whole genome
single-base methylome for 19 species, covering 179 tissues/cell lines and 13 biological contexts,
but also provides manually curate knowledge of both featured differentially methylated genes (DMGs)
across 11 kinds of biological contexts like disease and methylation tools collection. Besides,
MethBank also provides analysis tools including Age Predictor, IDMP and DMR toolkit to help related
research of biologists.
High quality raw sequencing data of WGBS samples are acquired from accessible data repositories
(SRA/GSA).
The data included in the database were searched according to the following filtering rules:
(1) The status of data resource is open-access
(2) [DataSet Type] = "methylation profiling by high throughput sequencing"
(3) [All Fields] = "WGBS" or "BS-Seq" or "Whole Genome Bisulfite Sequencing"
(4) The predicted sequencing depth of WGBS sample should be greater than 10.
The data after the initial screening are manually curated again.
WGBS data analysis pipeline includes quality control, trim adapters and low quality bases, align to
reference genome, extract CpG methylation, quantify Gene/CpG island average methylation level,
High methylated CpG islands analysis (Genomic location enrichment, GO & KEGG enrichment), Genes
related to methylated CpG islands analysis, Differentially methylated regions (DMRs) analysis
(Genomic location enrichment, GO & KEGG enrichment).
 |
First, All bisulfite sequence were subjected to quality control by FastQC v0.11.7 and trimmed to
remove adaptors and low quality bases using Fastq-dump (sratoolkit.2.8.2-1). Next, the reads that
passed quality control were mapped to the reference genome of the corresponding species using
Bismark-0.22.3. Detailed species reference genome information can be found and downloaded in the
Downloads interface. We used the Bismark methylation extractor to extract methylation data from
aligned, filtered reads. To visualize the quality of the data, we compute 6 indicators: Mapping
rate, Unique mapping rate, Genome coverage, C coverage, Conversion rate and Depth. The corresponding
calculations are shown below.
Mapping rate:calculated from the *_PE_report.txt file generated by
Bismark
(Sequence pairs did not map uniquely + Number of paired/Single-end alignments with a
unique best hit) / Sequence pairs analysed in total
Unique mapping rate: calculated from the *_PE_report.txt file
generated by Bismark
Number of paired/Single-end alignments with a unique best hit / Sequence pairs
analysed in total
Depth:calculated by Samtools1.9 using sort module
Conversion rate: calculated from the *.bismark.cov file generated
by Bismark
Genome Coverage:all base size number of sample / all size number of
genome reference
C Coverage: C base size number of sample / C base size number of
genome reference
Then, bedtools v2.17.0 and python3 were used to analyze gene methylation profiles of promoter, body and downstream regions in the C/CG/CH context. Subsequently, the relationship between CpG island and gene were studied. From the perspective of CpG island, we calculated the DNA methylation level corresponding to CpG island, and selected highly methylated CpG island (average methylation level >= 0.6) as the research object for downstream analysis including genome enrichment, GO & KEGG enrichment. From the perspective of genes, we provide all CpG islands that overlapped with genes and counted the corresponding location information.
Finally, we identify the differential methylation regions (DMRs) by using DSS R-package for the typical biological contexts in single project, and analyze genomic location enrichment, GO & KEGG enrichment.
The manual curation of metadata are done on 2 levels ('Project' and
Sample'). The corresponding review contents and standards are as follows:
Curation model on the
'Project' level
Items |
Description |
Value
|
Data Resource |
Controlled vocabulary |
NGDC, NCBI |
Project ID |
Accession number of the project from data resource |
CRA, GSE, HRA, SRP |
Sample Number |
Number of samples included in the project of MethBank |
1, 2, 3…… |
BioProject ID |
Accession number of the BioProject from data resource |
PRJCA, PRJNA |
Title |
Title of each project from data resource |
Conclusion term |
Summary |
A summary description of the project |
Conclusion term |
Overall Design |
Experiment design of the project |
Conclusion term |
Related Biological Process |
Controlled vocabulary |
Age, Health, Disease etc |
Species |
Controlled vocabulary |
Homo sapiens, Mus musculus, Brassica napus etc |
Tissue/Cell Line |
Controlled vocabulary |
Liver, Pancreas, Brain etc |
Cell Type |
Controlled vocabulary |
HeLa-S3 Cell, K-562 Cell, Hep-G2 Cell etc |
Healthy Condition |
Controlled vocabulary |
Normal, Healthy, Head and neck squamous cell carcinoma etc |
Development Stage |
Controlled vocabulary |
Embryo, Gamete, Adult etc |
Disease State |
Controlled vocabulary |
T2bN0M0, T3bN0M0, T2cN0M0 etc |
Submitter |
Submitter of the project |
Lab info, Submitter name |
Year |
Submit year of the project |
Feb 15, 2019 etc |
Publication |
Publication information related to the project |
Author, article name, journal name, date, doi etc |
PMID |
Pubmed id of Publication related to the project |
PubMed ID |
Status |
Public date of the project |
Public on Jul 03, 2013 etc |
Submission Date |
Submission date of the project in data resource |
Apr 29, 2020 etc |
Last Update Date |
Last update date of the project in data resource |
Apr 30, 2020 etc |
Curation model on the
'Sample' level
Items |
Description |
Value |
Basic Information |
Sample Name |
Name of the sample from data resource |
Conclusion term |
Data Resource |
Controlled vocabulary |
NGDC, NCBI |
Description |
Description of the sample from data resource |
Conclusion term |
BioProject ID |
Accession number of the BioProject from data resource |
PRJNA, PRJCA |
Project ID |
Accession number of the project from data resource |
GSE, SRP, HRA, CRA |
Experiment ID |
Accession number of the experiment sample in data resource |
SRX, HRX, CRX |
Status |
Public date of the sample |
Public on Jul 03, 2013 etc |
Submission Date |
Submission date of the sample in data resource |
Apr 29, 2020 etc |
Last Update Date |
Last update date of the sample in data resource |
Apr 30, 2020 etc |
Donor ID |
Donor ID of the sample in data resource |
ENCBS |
Sample Characteristic |
Species |
Controlled vocabulary |
Homo sapiens, Mus musculus, Brassica napus etc |
Tissue/Cell Line |
Controlled vocabulary |
Liver, Pancreas, Brain etc |
Cell Type |
Controlled vocabulary |
HeLa-S3 Cell, K-562 Cell, Hep-G2 Cell etc |
Source Name |
Name of the sample group |
Conclusion term |
Strain |
Controlled vocabulary |
Strain means variants of plants, viruses or bacteria; or an inbred animal used for experimental purposes |
Breed |
Controlled vocabulary |
Breed means a specific group of domestic animals having homogeneous appearance (phenotype), homogeneous behavior, and/or other characteristics that distinguish it from other organisms of the same species. |
Cultivar |
Controlled vocabulary |
Cultivar means a type of plant that people have bred for desired traits, which are reproduced in each new generation by a method such as grafting, tissue culture, or carefully controlled seed production. |
Sex |
Controlled vocabulary |
Male, Female, Pooled male and female |
Age |
The age/stage of samples |
1, 2, 3…… days/weeks/years etc |
Biological Condition |
Healthy Condition |
Controlled vocabulary |
Normal, Healthy, Head and neck squamous cell carcinoma etc |
Disease State |
Controlled vocabulary |
T2bN0M0, T3bN0M0, T2cN0M0 etc |
Development Stage |
Controlled vocabulary |
Embryo, Gamete, Adult etc |
Genotype |
Controlled vocabulary |
Genotype of the sample means its complete set of genetic material |
Treatment |
Brief description of the sample treatment state |
Conclusion term |
Knockout |
Brief description of the sample knockout state |
Conclusion term |
Resistance Phynotype |
Brief description of the sample resistance phynotype state |
Conclusion term |
Protocol |
Growth Protocol |
Culture protocols of cells from samples or cell lines |
Conclusion term |
Treatment Protocol |
Protocols of sample treatment |
Conclusion term |
Extraction Protocol |
Protocols of WGBS extraction |
Conclusion term |
Construction Protocol |
Protocols of WGBS library construction |
Conclusion term |
Library Strategy |
Controlled vocabulary |
Bisulfite-Seq |
Library Source |
Controlled vocabulary |
GENOMIC |
Library Selection |
Controlled vocabulary |
RANDOM, Other etc |
Layout |
Controlled vocabulary |
PAIRED, SINGLE |
Platform |
Controlled vocabulary |
Illumina HiSeq 2500, HiSeq X Ten etc |
Assessing Quality |
Mapping rate |
Calculated data |
(Sequence pairs did not map uniquely + Number of paired/Single-end alignments with a unique best hit) / Sequence pairs analysed in total |
Uniquely mapping rate |
Calculated data |
Number of paired/Single-end alignments with a unique best hit / Sequence pairs analysed in total |
Genome Coverage |
Calculated data |
all base size number of sample / all size number of genome reference |
C Coverage |
Calculated data |
C base size number of sample / C base size number of genome reference |
Conversion Rate |
Calculated data |
Proportion of Cs converted to Ts |
Depth |
Calculated data |
Total mapped reads number*Average read length/total bases of reference genome |
Analysis |
Reference Genome |
Reference genome version |
.fa/.fasta/.fna format file |
Genome Annotation |
Genome annotation version |
.gff/.gtf/.gff3 format file |