DMR Toolkit - MethBank

1. Prerequisite software and packages

DMR toolkit requires the installation of R (version 4.0.5) and bedtools (version 2.28) to run locally. DMR toolkit will work properly only when both R and bedtools software are set up correctly and added to the working path.

Identification of differentially methylated regions (DMRs)

-- DSS 2.42.0

-- getopt 1.20.3

-- bsseq 1.30.0

Downstream Analysis (Annotation, GO & KEGG enrichment)

-- clusterProfiler 3.18.1

-- tidyverse 1.3.1

-- data.table 1.14.2

-- enrichplot 1.10.2

-- topGO 2.42.0

-- KEGG.db 1.0

-- DO.db 2.9

-- DOSE 3.16.0

-- Rgraphviz 2.34.0

-- ChIPseeker 1.26.2

When using clusterprofiler package to do KEGG analysis for some samples, the package may report errors. You can generate your KEGG.db package with the following steps:


                            # download the package named createKEGGdb

                            remotes::install_github("YuLab-SMU/createKEGGdb")

                            

                            # download and package multiple KEGG data of multiple species

                            library(createKEGGdb)

                            species <-c("aml","cfa","bna","zma","mmu","mcc","mcf","ptr","ggo","gga","bta","ssc","dre","gmx","mesc","osa","pvu","sly","hsa")

                            create_kegg_db(species)

                            

                            # install the package

                            install.packages("./KEGG.db_1.0.tar.gz", repos=NULL)

                            library(KEGG.db)

2.Download and install

Download: DMRtoolkit_v1.0.tar.gz

Install:


                                tar -xzf DMRtoolkit_v1.0.tar.gz

3. Usage and options summary

3.1 Recommended file structure

DMR toolkit requires that the user creates the folder named process and species_name with the structure described below. Case group file and control group file should be moved to this path by user. Subsequently, running the program under the species_name folder will generate the corresponding output files. The following directory structure is recommended.

DMR_toolkit

3.2 Typical command


                                        sh ../../callDMR.sh [options] species_name case_file control_file DMR_result_name type pvalue_cutoff padj_cutoff input_file_type

3.3 Command parameters

Parameter	Type	Description	Example
--species_name	String	Species of processed sample data (see the controlled vocabulary below)	Homo_sapiens
--case_file	String	Input file of case sample	Hom_Tumor_Liver_CG_result_chr22.txt
--control_file	String	Input file of control sample	Hom_Normal_Liver_CG_result_chr22.txt
--DMR_result_name	String	Name of output file	TumorLiverVSNormalLiver
--type	String	Methylation site type	CG or non-CG
--pvalue_cutoff	Float	Enrichment analysis cutoff of p value	0.05
--padj_cutoff	Float	Enrichment analysis cutoff of q value (p_adjust value)	0.1
input_file_type	String	Bismark output file format or DSS input file format	bismark or dss

Note: Controlled vocabulary of species name is shown below.

Controlled vocabulary	Common name	Taxonomy ID	Version
Ailuropoda_melanoleuca	Giant panda	9646	ASM200744v2
Arabidopsis_thaliana	thale cress	3702	TAIR10.1
Bos_taurus	Cattle	9913	ARS-UCD1.2.105
Brassica_napus	Rape	3708	GCF_000686985.2_Bra_napus_v2.0
Canis_lupus_familiaris	Dog	9615	CanFam3.1
Danio_rerio	Zebrafish	7955	Zv9
Gallus_gallus	Chicken	9031	GRCg6a
Glycine_max	Soybean	3847	Gmax_275_v2.0
Gorilla_gorilla	Western gorilla	9593	gorGor4
Homo_sapiens	Human	9606	GRCh38 (hg38)
Macaca_mulatta	Rhesus monkey	9544	Mmul_10.105
Mus_musculus	House mouse	10090	GRCm38 (mm10)
Oryza_sativa	Asian cultivated rice	9940	Oar_v3.1.101
Ovis_aries	Sheep	4530	IRGSP-1.0
Pan_troglodytes	Chimpanzee	9598	Pan_tro_3.0
Populus_trichocarpa	black cottonwood	3694	P.trichocarpa_v4.1
Rattus_norvegicus	Norway rat	10116	Rnor_6.0.101
Salmo_salar	Atlantic salmon	8030	Ssal_v3.1
Solanum_lycopersicum	Tomato	4081	GCF_000188115.3_SL2.50
Sus_scrofa	Pig	9823	Sscrofa11.1
Xenopus_laevis	African clawed frog	8355	xenlae2
Zea_mays	Maize, Corn	4577	GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0

3.4 Input file format

The input files ("case_file" & " control_file") contain whole genome cytosine methylation patterns. Two formats of input files are supported by DMR toolkit, one is the format of *.CX_report files which is the output by bismark (BED format), and the other is the format required by DSS. The program can perform the corresponding downstream analysis by judging the "input_file_type" parameter, and decide which format the input file is in.

The format of *.CX_report files output by Bismark (BED format) should be exactly same like following:

chr	pos	strand	methy_num	unmethy_num	type	base
1	4	+	0	0	CHH	CAC

The meaning of these properties is listed below:

chr	Name of the chromosome (support "1" format or "chr1" format)
pos	Cytosine chromosomal coordinates (0-based)
strand	"+" means forward strand and "-" means reverse strand
methy_num	Total number of reads that support a methylated cytosine at this position
unmethy_num	Total number of reads that support a unmethylated cytosine at this position
type	Methylation site type, one of the following [CG, CHG, CHH]
base	Trinucleotide context

The format required by DSS should be exactly same like following:

chr	pos	N	X
chr22	10510235	5	3
chr22	10510275	4	1
chr22	10510284	4	0

The meanings of these properties are listed below:

chr	Name of the chromosome
pos	Cytosine chromosomal coordinates (0-based)
N	Total number of reads mapped to the cytosine position
X	Total number of reads that support a methylated cytosine at this position

Note: The header line should be saved in both the case and control group file, otherwise DMR toolkit will report an error.

4. Result Analysis

The names of final output file are

"DMR_result_name"_DMR_0.01.txt

"DMR_result_name"_DMR_0.01_Anno.tsv

"DMR_result_name"_DMR_0.01_GO.tsv

"DMR_result_name"_DMR_0.01_KEGG.tsv

The example of output files is shown below:

4.1 Output of DMRs identification

chr	start	end	length	nCG	meanMethy1	meanMethy2	diff.Methy
chr22	48575046	48577135	2090	633	0.395725065064858	0.869503399366761	-0.473778334301903

The meaning of these properties is listed below:

chr	Name of the chromosome
start	Start position of the differentially methylated region (DMR)
end	End position of the differentially methylated region (DMR)
length	Length of the differentially methylated region (DMR)
mCG	mCG number contained in the differentially methylated region (DMR)
meanMethy1	Average methylation level of the differentially methylated region (DMR) in case sample
meanMethy2	Average methylation level of the differentially methylated region (DMR) in control sample
diff.Methy	Differentially methylated value of the differentially methylated region (DMR) diff.Methy = meanMethy1 - meanMethy2

4.2 Output of annotation

chr	start	end	genomic_location	gene_id	gene_name	length	mCG	diff.Methy	meanMethy1	meanMethy2
chr22	15890181	15890416	Intron	ENSG00000232775	BMS1P22	235	6	0.6074933	0.8716893	0.2641960

All columns of DMRs identification output file are contained in annotation output file. Only the non-repeating properties are listed in the following table.

genomic_location	Genomic feature of the differentially methylated region (DMR)
gene_id	Gene symbol of the nearest gene
gene_name	Gene symbol of the nearest gene (If the gene does not have a corresponding name then NA is dispalyed)

4.3 Output of GO enrichment

The output file of GO enrichment shows in the format of the table below. If the corresponding term is not enriched, the content of the file is displayed as "no term is enriched".

ONTOLOGY	ID	Description	GeneRatio	BgRatio	pvalue	padjust	qvalue	geneID	Count
BP	GO:1901748	metabolic process	1/346	10/18866	7.194e-09	1.24e-05	1.22e-05	GGT1	1

The meaning of these properties is listed below:

ONTOLOGY	"BP" subontologies
ID	Unique tagging ID in the Gene Ontology knowledgebase
Description	Descriptive information
GeneRatio	Ratio of the number of genes associated with this term to the total number of all genes
BgRatio	Ratio of the number of genes associated with this term to the total number of all genes (Bg type)
pvalue	p value
qvalue	q value
geneID	Gene id
Count	Number of genes associated with this term

4.4 Output of KEGG enrichment

The output file of KEGG enrichment shows like the table below. If the corresponding term is not enriched, the content of the file is displayed as "no KEGG term enriched".

ID	Description	GeneRatio	BgRatio	pvalue	padjust	qvalue	geneID	Count
hsa05170	Human immunodeficiency virus 1 infection	6/346	10/18866	7.194e-09	1.24e-05	1.22e-05	GGT1/GGT5/GGT2/GGTLC2/GGT3P/GGTLC3	16

The meaning of these properties is listed below:

ID	Unique tagging ID in the KEGG Ontology knowledgebase
Description	Descriptive information
GeneRatio	Ratio of the number of genes associated with this term to the total number of all genes
BgRatio	Ratio of the number of genes associated with this term to the total number of all genes (Bg type)
pvalue	p value
qvalue	q value
geneID	Gene id
Count	Number of genes associated with this term

MethBank

1. Prerequisite software and packages

2. Download and install

3. Usage and options summary

4. Result Analysis

1. Prerequisite software and packages

Identification of differentially methylated regions (DMRs)

Downstream Analysis (Annotation, GO & KEGG enrichment)

2.Download and install

3. Usage and options summary

3.1 Recommended file structure

3.2 Typical command

3.3 Command parameters

3.4 Input file format

4. Result Analysis

4.1 Output of DMRs identification

4.2 Output of annotation

4.3 Output of GO enrichment

4.4 Output of KEGG enrichment