Workflow to Mine Frequent DNA Co-methylation Clusters in DNA Methylome Data.

Jie Zhang, Kun Huang
Author Information
  1. Jie Zhang: Department of Medical & Molecular Genetics, School of Medicine, Indiana University, Indianapolis, IN, USA. jizhan@iu.edu.
  2. Kun Huang: Department of Biostatistics and Health Data Science, School of Medicine, Indiana University, Indianapolis, IN, USA.

Abstract

The advances in high-throughput nucleotide sequencing technology revolutionized biomedical research. Vast amount of genomic data rapidly accumulates in a daily basis, which in turn calls for the development of powerful bioinformatics tools and efficient workflows to analyze them. One of the approaches to address the "big data" issue is to mine highly correlated clusters/networks of biological molecules, which may provide rich yet hidden information about the underlying functional, regulatory, or structural relationships among genes, proteins, genomic loci or various types of biological molecules or events. A network mining algorithm lmQCM has recently been developed, which can be applied to mine tightly connected correlation clusters (networks) in large biological data with big sample size, and it guarantees a lower bound of the cluster density. This algorithm has been used in a variety of cancer transcriptomic data to mine gene co-expression networks (GCNs), but it can be applied to any correlational matrix. lmQCM is available through R package lmQCM as well as the online tool package TSUNAMI ( https://biolearns.medicine.iu.edu ). In this study, the purpose is to establish an analytical workflow to apply lmQCM for frequent (consensus) cluster mining in multiple DNA methylation datasets in different cancers and extract the underlying common co-methylation networks for genes.Specifically, the workflow is applied to analyze DNA methylome data across different cancer types using lmQCM. It mines frequent DNA methylation clusters based on individual clustering mining results, identifying common as well as distinctive DNA methylation patterns among different cancer types. This workflow has successfully identified frequent GCNs in 33 types of cancers, thus proven to be a powerful tool to analyze large biological data. It helps to identify common features as well as distinctions among different diseases, disease subtypes, or among different biological processes. The resulted frequent clusters may provide new insights on the pathway/function networks. In the case of disease study, the results lead to new directions for biomarker and drug target discovery. The advantages of this workflow include the highly efficient processing of large biological data generated from high-throughput experiments, quick identification of highly correlated interaction networks, substantial reduction of the data dimensionality to a manageable number of variables for downstream comparative analysis, and consequently increased statistical power for detecting differences between conditions.

Keywords

References

  1. Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 98(9):5116–5121 [DOI: 10.1073/pnas.091062498]
  2. Narayanan A et al (2004) Single-layer artificial neural networks for gene expression analysis. Neurocomputing 61:217–237 [DOI: 10.1016/j.neucom.2003.10.017]
  3. Tibshirani R et al (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A 99(10):6567–6572 [DOI: 10.1073/pnas.082099299]
  4. Yeung KY, Ruzzo WL (2001) Principal component analysis for clustering gene expression data. Bioinformatics 17(9):763–774 [DOI: 10.1093/bioinformatics/17.9.763]
  5. Hu H et al (2005) Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics 21(Suppl 1):i213–i221 [DOI: 10.1093/bioinformatics/bti1049]
  6. Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9(1):559 [DOI: 10.1186/1471-2105-9-559]
  7. Zhang J, Huang K (2014) Normalized lmQCM: an algorithm for detecting weak quasi-cliques in weighted graph with applications in gene co-expression module discovery in cancers. Cancer Inform 13(Suppl 3):137–146 [PMID: 27486298]
  8. Ou Y, Zhang CQ (2007) A new multimembership clustering method. J Indust Manag Optim 3(4):619–624 [DOI: 10.3934/jimo.2007.3.619]
  9. Shroff S, Zhang J, Huang K (2016) Gene co-expression analysis predicts genetic variants associated with drug responsiveness in lung cancer. AMIA Jt Summits Transl Sci Proc 2016:32–41 [PMID: 27570645]
  10. Cheng J et al (2018) Identification of topological features in renal tumor microenvironment associated with patient survival. Bioinformatics 34(6):1024–1030 [DOI: 10.1093/bioinformatics/btx723]
  11. Cheng J et al (2017) Integrative analysis of histopathological images and genomic data predicts clear cell renal cell carcinoma prognosis. Cancer Res 77(21):e91–e100 [DOI: 10.1158/0008-5472.CAN-17-0313]
  12. Huang Z, Han Z, Wang T, Salama P, Huang K, Zhang J (2021) TSUNAMI: Translational bioinformatics tool suite for network analysis and mining, Genom Proteom Bioinform. https://doi.org/10.1016/j.gpb.2019.05.006
  13. Kulis M, Esteller M (2010) DNA methylation and cancer. Adv Genet 70:27–56 [DOI: 10.1016/B978-0-12-380866-0.60002-2]
  14. Zhang J, Huang K (2016) Normalized ImQCM: an algorithm for detecting weak quasi-cliques in weighted graph with applications in gene co-expression module discovery in cancers. Cancer Inform 13(Suppl 3):137–146 [PMID: 27486298]
  15. Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. Adv Neural Inform Process Syst 14:849–856
  16. Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Syst Biol 1:54 [DOI: 10.1186/1752-0509-1-54]

MeSH Term

DNA
DNA Methylation
Epigenome
Humans
Neoplasms
Workflow

Chemicals

DNA

Word Cloud

Created with Highcharts 10.0.0dataDNAbiologicallmQCMminingnetworksdifferentamongtypesworkflowfrequentmethylationanalyzeminehighlyappliedclusterslargecancerwellcommonhigh-throughputgenomicpowerfulefficientcorrelatedmoleculesmayprovideunderlyinggenesnetworkalgorithmcanGCNspackagetoolstudycancersco-methylationresultsdiseasenewFrequentadvancesnucleotidesequencingtechnologyrevolutionizedbiomedicalresearchVastamountrapidlyaccumulatesdailybasisturncallsdevelopmentbioinformaticstoolsworkflowsOneapproachesaddress"bigdata"issueclusters/networksrichyethiddeninformationfunctionalregulatorystructuralrelationshipsproteinslocivariouseventsrecentlydevelopedtightlyconnectedcorrelationbigsamplesizeguaranteeslowerboundclusterdensityusedvarietytranscriptomicgeneco-expressioncorrelationalmatrixavailableRonlineTSUNAMIhttps://biolearnsmedicineiuedu Inpurposeestablishanalyticalapplyconsensus clustermultipledatasetsextractSpecificallymethylomeacrossusingminesbasedindividualclusteringidentifyingdistinctivepatternssuccessfullyidentified33thusprovenhelpsidentifyfeaturesdistinctionsdiseasessubtypesprocessesresultedinsightspathway/functioncaseleaddirectionsbiomarkerdrugtargetdiscoveryadvantagesincludeprocessinggeneratedexperimentsquickidentificationinteractionsubstantialreductiondimensionalitymanageablenumbervariablesdownstreamcomparativeanalysisconsequentlyincreasedstatisticalpowerdetectingdifferencesconditionsWorkflowMineCo-methylationClustersMethylomeDataClusterEpigeneticsPan-cancer

Similar Articles

Cited By

No available data.