A Zero-Inflated Latent Dirichlet Allocation Model for Microbiome Studies.

Rebecca A Deek, Hongzhe Li
Author Information
  1. Rebecca A Deek: Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
  2. Hongzhe Li: Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.

Abstract

The human microbiome consists of a community of microbes in varying abundances and is shown to be associated with many diseases. An important first step in many microbiome studies is to identify possible distinct microbial communities in a given data set and to identify the important bacterial taxa that characterize these communities. The data from typical microbiome studies are high dimensional count data with excessive zeros due to both absence of species (structural zeros) and low sequencing depth or dropout. Although methods have been developed for identifying the microbial communities based on mixture models of counts, these methods do not account for excessive zeros observed in the data and do not differentiate structural from sampling zeros. In this paper, we introduce a zero-inflated Latent Dirichlet Allocation model (zinLDA) for sparse count data observed in microbiome studies. zinLDA builds on the flexible Latent Dirichlet Allocation model and allows for zero inflation in observed counts. We develop an efficient Markov chain Monte Carlo (MCMC) sampling procedure to fit the model. Results from our simulations show zinLDA provides better fits to the data and is able to separate structural zeros from sampling zeros. We apply zinLDA to the data set from the American Gut Project and identify microbial communities characterized by different bacterial genera.

Keywords

References

  1. Microbiome. 2020 Jun 23;8(1):95 [PMID: 32576288]
  2. BMC Biol. 2014 Aug 22;12:69 [PMID: 25184604]
  3. PLoS One. 2019 Feb 13;14(2):e0205474 [PMID: 30759084]
  4. Proc Natl Acad Sci U S A. 2004 Apr 6;101 Suppl 1:5228-35 [PMID: 14872004]
  5. PLoS Comput Biol. 2015 May 07;11(5):e1004226 [PMID: 25950956]
  6. Science. 2019 Jul 12;365(6449): [PMID: 31296739]
  7. Nature. 2007 Oct 18;449(7164):804-10 [PMID: 17943116]
  8. Biostatistics. 2013 Apr;14(2):244-58 [PMID: 23074263]
  9. Biostatistics. 2019 Oct 1;20(4):698-713 [PMID: 29939212]
  10. Biostatistics. 2019 Oct 1;20(4):599-614 [PMID: 29868846]
  11. Trends Microbiol. 2017 Mar;25(3):217-228 [PMID: 27916383]
  12. Nat Rev Microbiol. 2012 Jul 16;10(8):538-50 [PMID: 22796884]
  13. PLoS Comput Biol. 2012;8(9):e1002687 [PMID: 23028285]
  14. PLoS One. 2012;7(2):e30126 [PMID: 22319561]
  15. PLoS Comput Biol. 2018 Jun 6;14(6):e1006143 [PMID: 29874232]
  16. PLoS Comput Biol. 2012;8(7):e1002606 [PMID: 22807668]
  17. mSystems. 2018 May 15;3(3): [PMID: 29795809]
  18. Mol Ecol Resour. 2020 Mar;20(2):371-386 [PMID: 31650682]

Grants

  1. R01 GM123056/NIGMS NIH HHS
  2. R01 GM129781/NIGMS NIH HHS

Word Cloud

Created with Highcharts 10.0.0datazerosmicrobiomemicrobialcommunitiessamplingzinLDAstudiesidentifystructuralobservedLatentDirichletAllocationmodelcommunitymanyimportantsetbacterialcountexcessivemethodsmixturemodelscountszerohumanconsistsmicrobesvaryingabundancesshownassociateddiseasesfirststeppossibledistinctgiventaxacharacterizetypicalhighdimensionaldueabsencespecieslowsequencingdepthdropoutAlthoughdevelopedidentifyingbasedaccountdifferentiatepaperintroducezero-inflatedsparsebuildsflexibleallowsinflationdevelopefficientMarkovchainMonteCarloMCMCprocedurefitResultssimulationsshowprovidesbetterfitsableseparateapplyAmericanGutProjectcharacterizeddifferentgeneraZero-InflatedModelMicrobiomeStudiesgibbsmetagenomicsinflateddirchletdistribution

Similar Articles

Cited By