MethBank
A database of DNA methylation across a variety of species

MethBank

A database of DNA methylation across a variety of species
DMR Toolkit
DMR toolkit is a pipline package for DMRs identification, annotation and enrichment for multiple species. Step-by-step example of its use is provided in Documentation.

1. Prerequisite software and packages

DMR toolkit requires the installation of R (version 4.0.5) and bedtools (version 2.28) to run locally. DMR toolkit will work properly only when both R and bedtools software are set up correctly and added to the working path.
Identification of differentially methylated regions (DMRs)
Downstream Analysis (Annotation, GO & KEGG enrichment)
When using clusterprofiler package to do KEGG analysis for some samples, the package may report errors. You can generate your KEGG.db package with the following steps:
# download the package named createKEGGdb
remotes::install_github("YuLab-SMU/createKEGGdb")

# download and package multiple KEGG data of multiple species
library(createKEGGdb)
species <-c("aml","cfa","bna","zma","mmu","mcc","mcf","ptr","ggo","gga","bta","ssc","dre","gmx","mesc","osa","pvu","sly","hsa")
create_kegg_db(species)

# install the package
install.packages("./KEGG.db_1.0.tar.gz", repos=NULL)
library(KEGG.db)

2.Download and install

DMRtoolkit_v1.0.tar.gz
tar -xzf DMRtoolkit_v1.0.tar.gz

3. Usage and options summary

3.1 Recommended file structure

DMR toolkit requires that the user creates the folder named process and species_name with the structure described below. Case group file and control group file should be moved to this path by user. Subsequently, running the program under the species_name folder will generate the corresponding output files. The following directory structure is recommended.

DMR_toolkit

|----callDMR.sh
|----example
|----script
|----txdb
|----process   (made by users)
|----Homo_sapiens    (made by users)
|----case_file    (made by users)
|----control_file   (made by users)
|----result (made by DMRtoolkit)
|----0.01
|----DMR_result_name_DMR_0.01.txt
|----DMR_result_name_DMR_0.01_Anno.txt
|----DMR_result_name_DMR_0.01_GO.txt
|----DMR_result_name_DMR_0.01_KEGG.txt
|----0.05
|----DMR_result_name_DMR_0.05.txt
|----DMR_result_name_DMR_0.05_Anno.txt
|----DMR_result_name_DMR_0.05_GO.txt
|----DMR_result_name_DMR_0.05_KEGG.txt

3.2 Typical command

sh ../../callDMR.sh [options] species_name case_file control_file DMR_result_name type pvalue_cutoff padj_cutoff input_file_type

3.3 Command parameters

Parameter Type Description Example
--species_name String Species of processed sample data
(see the controlled vocabulary below)
Homo_sapiens
--case_file String Input file of case sample Hom_Tumor_Liver_CG_result_chr22.txt
--control_file String Input file of control sample Hom_Normal_Liver_CG_result_chr22.txt
--DMR_result_name String Name of output file TumorLiverVSNormalLiver
--type String Methylation site type CG or non-CG
--pvalue_cutoff Float Enrichment analysis cutoff of p value 0.05
--padj_cutoff Float Enrichment analysis cutoff of q value (p_adjust value) 0.1
input_file_type String Bismark output file format or DSS input file format bismark or dss
Note: Controlled vocabulary of species name is shown below.
Controlled vocabulary Common name Taxonomy ID Version
Ailuropoda_melanoleuca Giant panda 9646 ASM200744v2
Arabidopsis_thaliana thale cress 3702 TAIR10.1
Bos_taurus Cattle 9913 ARS-UCD1.2.105
Brassica_napus Rape 3708 GCF_000686985.2_Bra_napus_v2.0
Canis_lupus_familiaris Dog 9615 CanFam3.1
Danio_rerio Zebrafish 7955 Zv9
Gallus_gallus Chicken 9031 GRCg6a
Glycine_max Soybean 3847 Gmax_275_v2.0
Gorilla_gorilla Western gorilla 9593 gorGor4
Homo_sapiens Human 9606 GRCh38 (hg38)
Macaca_mulatta Rhesus monkey 9544 Mmul_10.105
Mus_musculus House mouse 10090 GRCm38 (mm10)
Oryza_sativa Asian cultivated rice 9940 Oar_v3.1.101
Ovis_aries Sheep 4530 IRGSP-1.0
Pan_troglodytes Chimpanzee 9598 Pan_tro_3.0
Populus_trichocarpa black cottonwood 3694 P.trichocarpa_v4.1
Rattus_norvegicus Norway rat 10116 Rnor_6.0.101
Salmo_salar Atlantic salmon 8030 Ssal_v3.1
Solanum_lycopersicum Tomato 4081 GCF_000188115.3_SL2.50
Sus_scrofa Pig 9823 Sscrofa11.1
Xenopus_laevis African clawed frog 8355 xenlae2
Zea_mays Maize, Corn 4577 GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0

3.4 Input file format

The input files ("case_file" & " control_file") contain whole genome cytosine methylation patterns. Two formats of input files are supported by DMR toolkit, one is the format of *.CX_report files which is the output by bismark (BED format), and the other is the format required by DSS. The program can perform the corresponding downstream analysis by judging the "input_file_type" parameter, and decide which format the input file is in.

The format of *.CX_report files output by Bismark (BED format) should be exactly same like following:
chr pos strand methy_num unmethy_num type base
1 4 + 0 0 CHH CAC
The meaning of these properties is listed below:
chr Name of the chromosome (support "1" format or "chr1" format)
pos Cytosine chromosomal coordinates (0-based)
strand "+" means forward strand and "-" means reverse strand
methy_num Total number of reads that support a methylated cytosine at this position
unmethy_num Total number of reads that support a unmethylated cytosine at this position
type Methylation site type, one of the following [CG, CHG, CHH]
base Trinucleotide context
The format required by DSS should be exactly same like following:
chr pos N X
chr22 10510235 5 3
chr22 10510275 4 1
chr22 10510284 4 0
The meanings of these properties are listed below:
chr Name of the chromosome
pos Cytosine chromosomal coordinates (0-based)
N Total number of reads mapped to the cytosine position
X Total number of reads that support a methylated cytosine at this position
Note: The header line should be saved in both the case and control group file, otherwise DMR toolkit will report an error.

4. Result Analysis

The names of final output file are
  • "DMR_result_name"_DMR_0.01.txt
  • "DMR_result_name"_DMR_0.01_Anno.tsv
  • "DMR_result_name"_DMR_0.01_GO.tsv
  • "DMR_result_name"_DMR_0.01_KEGG.tsv

  • The example of output files is shown below:

    4.1 Output of DMRs identification

    chr start end length nCG meanMethy1 meanMethy2 diff.Methy
    chr22 48575046 48577135 2090 633 0.395725065064858 0.869503399366761 -0.473778334301903
    The meaning of these properties is listed below:
    chr Name of the chromosome
    start Start position of the differentially methylated region (DMR)
    end End position of the differentially methylated region (DMR)
    length Length of the differentially methylated region (DMR)
    mCG mCG number contained in the differentially methylated region (DMR)
    meanMethy1 Average methylation level of the differentially methylated region (DMR) in case sample
    meanMethy2 Average methylation level of the differentially methylated region (DMR) in control sample
    diff.Methy Differentially methylated value of the differentially methylated region (DMR)
    diff.Methy = meanMethy1 - meanMethy2

    4.2 Output of annotation

    chr start end genomic_location gene_id gene_name length mCG diff.Methy meanMethy1 meanMethy2
    chr22 15890181 15890416 Intron ENSG00000232775 BMS1P22 235 6 0.6074933 0.8716893 0.2641960
    All columns of DMRs identification output file are contained in annotation output file. Only the non-repeating properties are listed in the following table.
    genomic_location Genomic feature of the differentially methylated region (DMR)
    gene_id Gene symbol of the nearest gene
    gene_name Gene symbol of the nearest gene (If the gene does not have a corresponding name then NA is dispalyed)

    4.3 Output of GO enrichment

    The output file of GO enrichment shows in the format of the table below. If the corresponding term is not enriched, the content of the file is displayed as "no term is enriched".
    ONTOLOGY ID Description GeneRatio BgRatio pvalue padjust qvalue geneID Count
    BP GO:1901748 metabolic process 1/346 10/18866 7.194e-09 1.24e-05 1.22e-05 GGT1 1
    The meaning of these properties is listed below:
    ONTOLOGY "BP" subontologies
    ID Unique tagging ID in the Gene Ontology knowledgebase
    Description Descriptive information
    GeneRatio Ratio of the number of genes associated with this term to the total number of all genes
    BgRatio Ratio of the number of genes associated with this term to the total number of all genes (Bg type)
    pvalue p value
    qvalue q value
    geneID Gene id
    Count Number of genes associated with this term

    4.4 Output of KEGG enrichment

    The output file of KEGG enrichment shows like the table below. If the corresponding term is not enriched, the content of the file is displayed as "no KEGG term enriched".
    ID Description GeneRatio BgRatio pvalue padjust qvalue geneID Count
    hsa05170 Human immunodeficiency virus 1 infection 6/346 10/18866 7.194e-09 1.24e-05 1.22e-05 GGT1/GGT5/GGT2/GGTLC2/GGT3P/GGTLC3 16
    The meaning of these properties is listed below:
    ID Unique tagging ID in the KEGG Ontology knowledgebase
    Description Descriptive information
    GeneRatio Ratio of the number of genes associated with this term to the total number of all genes
    BgRatio Ratio of the number of genes associated with this term to the total number of all genes (Bg type)
    pvalue p value
    qvalue q value
    geneID Gene id
    Count Number of genes associated with this term