DMR Toolkit
DMR toolkit is a pipline package for DMRs identification, annotation and enrichment for multiple species. Step-by-step example of its use is provided in Documentation.
1. Prerequisite software and packages
DMR toolkit requires the installation of R (version 4.0.5) and bedtools (version 2.28) to run locally. DMR toolkit will work properly only when both R and bedtools software are set up correctly and added to the working path.
Identification of differentially methylated regions (DMRs)
Downstream Analysis (Annotation, GO & KEGG enrichment)
# download the package named createKEGGdb
remotes::install_github("YuLab-SMU/createKEGGdb")
# download and package multiple KEGG data of multiple species
library(createKEGGdb)
species <-c("aml","cfa","bna","zma","mmu","mcc","mcf","ptr","ggo","gga","bta","ssc","dre","gmx","mesc","osa","pvu","sly","hsa")
create_kegg_db(species)
# install the package
install.packages("./KEGG.db_1.0.tar.gz", repos=NULL)
library(KEGG.db)
3. Usage and options summary
3.1 Recommended file structure
DMR toolkit requires that the user creates the folder named process and species_name with the structure described below. Case group file and control group file should be moved to this path by user. Subsequently, running the program under the species_name folder will generate the corresponding output files. The following directory structure is recommended.
DMR_toolkit
|----callDMR.sh|----example
|----script
|----txdb
|----process (made by users)
|----Homo_sapiens (made by users)
|----case_file (made by users)
|----control_file (made by users)
|----result (made by DMRtoolkit)
|----0.01
|----DMR_result_name_DMR_0.01.txt
|----DMR_result_name_DMR_0.01_Anno.txt
|----DMR_result_name_DMR_0.01_GO.txt
|----DMR_result_name_DMR_0.01_KEGG.txt
|----0.05
|----DMR_result_name_DMR_0.05.txt
|----DMR_result_name_DMR_0.05_Anno.txt
|----DMR_result_name_DMR_0.05_GO.txt
|----DMR_result_name_DMR_0.05_KEGG.txt
3.2 Typical command
sh ../../callDMR.sh [options] species_name case_file control_file DMR_result_name type pvalue_cutoff padj_cutoff input_file_type
3.3 Command parameters
Parameter | Type | Description | Example |
---|---|---|---|
--species_name | String | Species of processed sample data (see the controlled vocabulary below) |
Homo_sapiens |
--case_file | String | Input file of case sample | Hom_Tumor_Liver_CG_result_chr22.txt |
--control_file | String | Input file of control sample | Hom_Normal_Liver_CG_result_chr22.txt |
--DMR_result_name | String | Name of output file | TumorLiverVSNormalLiver |
--type | String | Methylation site type | CG or non-CG |
--pvalue_cutoff | Float | Enrichment analysis cutoff of p value | 0.05 |
--padj_cutoff | Float | Enrichment analysis cutoff of q value (p_adjust value) | 0.1 |
input_file_type | String | Bismark output file format or DSS input file format | bismark or dss |
Controlled vocabulary | Common name | Taxonomy ID | Version |
---|---|---|---|
Ailuropoda_melanoleuca | Giant panda | 9646 | ASM200744v2 |
Arabidopsis_thaliana | thale cress | 3702 | TAIR10.1 |
Bos_taurus | Cattle | 9913 | ARS-UCD1.2.105 |
Brassica_napus | Rape | 3708 | GCF_000686985.2_Bra_napus_v2.0 |
Canis_lupus_familiaris | Dog | 9615 | CanFam3.1 |
Danio_rerio | Zebrafish | 7955 | Zv9 |
Gallus_gallus | Chicken | 9031 | GRCg6a |
Glycine_max | Soybean | 3847 | Gmax_275_v2.0 |
Gorilla_gorilla | Western gorilla | 9593 | gorGor4 |
Homo_sapiens | Human | 9606 | GRCh38 (hg38) |
Macaca_mulatta | Rhesus monkey | 9544 | Mmul_10.105 |
Mus_musculus | House mouse | 10090 | GRCm38 (mm10) |
Oryza_sativa | Asian cultivated rice | 9940 | Oar_v3.1.101 |
Ovis_aries | Sheep | 4530 | IRGSP-1.0 |
Pan_troglodytes | Chimpanzee | 9598 | Pan_tro_3.0 |
Populus_trichocarpa | black cottonwood | 3694 | P.trichocarpa_v4.1 |
Rattus_norvegicus | Norway rat | 10116 | Rnor_6.0.101 |
Salmo_salar | Atlantic salmon | 8030 | Ssal_v3.1 |
Solanum_lycopersicum | Tomato | 4081 | GCF_000188115.3_SL2.50 |
Sus_scrofa | Pig | 9823 | Sscrofa11.1 |
Xenopus_laevis | African clawed frog | 8355 | xenlae2 |
Zea_mays | Maize, Corn | 4577 | GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0 |
3.4 Input file format
The input files ("case_file" & " control_file") contain whole genome cytosine methylation patterns. Two formats of input files are supported by DMR toolkit, one is the format of *.CX_report files which is the output by bismark (BED format), and the other is the format required by DSS.
The program can perform the corresponding downstream analysis by judging the "input_file_type" parameter, and decide which format the input file is in.
The format of *.CX_report files output by Bismark (BED format) should be exactly same like following:
chr | pos | strand | methy_num | unmethy_num | type | base |
---|---|---|---|---|---|---|
1 | 4 | + | 0 | 0 | CHH | CAC |
The meaning of these properties is listed below:
chr | Name of the chromosome (support "1" format or "chr1" format) |
pos | Cytosine chromosomal coordinates (0-based) |
strand | "+" means forward strand and "-" means reverse strand |
methy_num | Total number of reads that support a methylated cytosine at this position |
unmethy_num | Total number of reads that support a unmethylated cytosine at this position |
type | Methylation site type, one of the following [CG, CHG, CHH] |
base | Trinucleotide context |
The format required by DSS should be exactly same like following:
chr | pos | N | X |
---|---|---|---|
chr22 | 10510235 | 5 | 3 |
chr22 | 10510275 | 4 | 1 |
chr22 | 10510284 | 4 | 0 |
The meanings of these properties are listed below:
chr | Name of the chromosome |
pos | Cytosine chromosomal coordinates (0-based) |
N | Total number of reads mapped to the cytosine position |
X | Total number of reads that support a methylated cytosine at this position |
4. Result Analysis
The names of final output file are
"DMR_result_name"_DMR_0.01.txt
"DMR_result_name"_DMR_0.01_Anno.tsv
"DMR_result_name"_DMR_0.01_GO.tsv
"DMR_result_name"_DMR_0.01_KEGG.tsv
The example of output files is shown below:
The example of output files is shown below:
4.1 Output of DMRs identification
chr | start | end | length | nCG | meanMethy1 | meanMethy2 | diff.Methy |
---|---|---|---|---|---|---|---|
chr22 | 48575046 | 48577135 | 2090 | 633 | 0.395725065064858 | 0.869503399366761 | -0.473778334301903 |
The meaning of these properties is listed below:
chr | Name of the chromosome |
start | Start position of the differentially methylated region (DMR) |
end | End position of the differentially methylated region (DMR) |
length | Length of the differentially methylated region (DMR) |
mCG | mCG number contained in the differentially methylated region (DMR) |
meanMethy1 | Average methylation level of the differentially methylated region (DMR) in case sample |
meanMethy2 | Average methylation level of the differentially methylated region (DMR) in control sample |
diff.Methy | Differentially methylated value of the differentially methylated region (DMR) diff.Methy = meanMethy1 - meanMethy2 |
4.2 Output of annotation
chr | start | end | genomic_location | gene_id | gene_name | length | mCG | diff.Methy | meanMethy1 | meanMethy2 |
---|---|---|---|---|---|---|---|---|---|---|
chr22 | 15890181 | 15890416 | Intron | ENSG00000232775 | BMS1P22 | 235 | 6 | 0.6074933 | 0.8716893 | 0.2641960 |
All columns of DMRs identification output file are contained in annotation output file. Only the non-repeating properties are listed in the following table.
genomic_location | Genomic feature of the differentially methylated region (DMR) |
gene_id | Gene symbol of the nearest gene |
gene_name | Gene symbol of the nearest gene (If the gene does not have a corresponding name then NA is dispalyed) |
4.3 Output of GO enrichment
The output file of GO enrichment shows in the format of the table below. If the corresponding term is not enriched, the content of the file is displayed as "no term is enriched".
ONTOLOGY | ID | Description | GeneRatio | BgRatio | pvalue | padjust | qvalue | geneID | Count |
---|---|---|---|---|---|---|---|---|---|
BP | GO:1901748 | metabolic process | 1/346 | 10/18866 | 7.194e-09 | 1.24e-05 | 1.22e-05 | GGT1 | 1 |
The meaning of these properties is listed below:
ONTOLOGY | "BP" subontologies |
ID | Unique tagging ID in the Gene Ontology knowledgebase |
Description | Descriptive information |
GeneRatio | Ratio of the number of genes associated with this term to the total number of all genes |
BgRatio | Ratio of the number of genes associated with this term to the total number of all genes (Bg type) |
pvalue | p value |
qvalue | q value |
geneID | Gene id |
Count | Number of genes associated with this term |
4.4 Output of KEGG enrichment
The output file of KEGG enrichment shows like the table below. If the corresponding term is not enriched, the content of the file is displayed as "no KEGG term enriched".
ID | Description | GeneRatio | BgRatio | pvalue | padjust | qvalue | geneID | Count |
---|---|---|---|---|---|---|---|---|
hsa05170 | Human immunodeficiency virus 1 infection | 6/346 | 10/18866 | 7.194e-09 | 1.24e-05 | 1.22e-05 | GGT1/GGT5/GGT2/GGTLC2/GGT3P/GGTLC3 | 16 |
The meaning of these properties is listed below:
ID | Unique tagging ID in the KEGG Ontology knowledgebase |
Description | Descriptive information |
GeneRatio | Ratio of the number of genes associated with this term to the total number of all genes |
BgRatio | Ratio of the number of genes associated with this term to the total number of all genes (Bg type) |
pvalue | p value |
qvalue | q value |
geneID | Gene id |
Count | Number of genes associated with this term |