MethParquet: an R package for rapid and efficient DNA methylation association analysis adopting Apache Parquet.

Ziqing Wang, Michael Cassidy, Danielle A Wallace, Tamar Sofer
Author Information
  1. Ziqing Wang: Department of Medicine, Cardiovascular Institute, Beth Israel Deaconess Medical Center, Boston, MA, 02215, United States. ORCID
  2. Michael Cassidy: Department of Medicine, Cardiovascular Institute, Beth Israel Deaconess Medical Center, Boston, MA, 02215, United States.
  3. Danielle A Wallace: Department of Medicine, Cardiovascular Institute, Beth Israel Deaconess Medical Center, Boston, MA, 02215, United States.
  4. Tamar Sofer: Department of Medicine, Cardiovascular Institute, Beth Israel Deaconess Medical Center, Boston, MA, 02215, United States. ORCID

Abstract

SUMMARY: Genome-wide DNA methylation (DNAm) profiling is indispensable for unveiling how DNAm regulates biological pathways and individual phenotypes. However, managing and analyzing extensive DNAm data generated from large cohort studies present computational obstacles. Apache Parquet is a data file format that allows for efficient data storage, retrieval, and manipulation, alleviating computational hurdles associated with conventional row-based formats. We here introduce MethParquet, the first R package leveraging the columnar Parquet format for efficient DNAm data analysis. It can be used for data extraction, methylation risk score calculation, epigenome-wide association analyses, and other standard post-quality control tasks. The package flexibly implements diverse regression models. Via a public methylation dataset, we show the efficiency of this package in reducing running time and RAM usage in large-scale EWAS.
AVAILABILITY AND IMPLEMENTATION: The MethParquet R package is publicly available on the GitHub repository https://github.com/ZWangTen/MethParquet. It includes a vignette and a toy dataset derived from a public resource.

References

  1. Nucleic Acids Res. 2020 Jan 8;48(D1):D890-D895 [PMID: 31584095]
  2. Bioinformatics. 2014 May 15;30(10):1363-9 [PMID: 24478339]
  3. Epigenetics. 2020 Jan - Feb;15(1-2):1-11 [PMID: 31318318]
  4. Nat Methods. 2011 Sep 04;8(10):833-5 [PMID: 21892150]
  5. Bioinformatics. 2019 Dec 15;35(24):5346-5348 [PMID: 31329242]
  6. Cell Genom. 2024 May 8;4(5):100544 [PMID: 38692281]
  7. Bioinformatics. 2017 Jun 15;33(12):1870-1872 [PMID: 28177067]
  8. Epigenetics Chromatin. 2016 Jun 29;9:26 [PMID: 27358654]
  9. Philos Trans R Soc Lond B Biol Sci. 2013 Jan 5;368(1609):20110328 [PMID: 23166392]
  10. Nat Methods. 2014 Mar;11(3):309-11 [PMID: 24464286]
  11. Nucleic Acids Res. 2015 Apr 20;43(7):e47 [PMID: 25605792]

Grants

  1. K99 HL166700/NHLBI NIH HHS
  2. R01 HL161012/NHLBI NIH HHS
  3. R01HL161012/NIH HHS

MeSH Term

DNA Methylation
Software
Humans
Genome-Wide Association Study

Word Cloud

Created with Highcharts 10.0.0datapackagemethylationDNAmParquetefficientRDNAcomputationalApacheformatMethParquetanalysisassociationpublicdatasetSUMMARY:Genome-wideprofilingindispensableunveilingregulatesbiologicalpathwaysindividualphenotypesHowevermanaginganalyzingextensivegeneratedlargecohortstudiespresentobstaclesfileallowsstorageretrievalmanipulationalleviatinghurdlesassociatedconventionalrow-basedformatsintroducefirstleveragingcolumnarcanusedextractionriskscorecalculationepigenome-wideanalysesstandardpost-qualitycontroltasksflexiblyimplementsdiverseregressionmodelsViashowefficiencyreducingrunningtimeRAMusagelarge-scaleEWASAVAILABILITYANDIMPLEMENTATION:publiclyavailableGitHubrepositoryhttps://githubcom/ZWangTen/MethParquetincludesvignettetoyderivedresourceMethParquet:rapidadopting

Similar Articles

Cited By