Maximum-likelihood model averaging to profile clustering of site types across discrete linear sequences.

Zhang Zhang, Jeffrey P Townsend
Author Information
  1. Zhang Zhang: Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut, United States of America.

Abstract

A major analytical challenge in computational biology is the detection and description of clusters of specified site types, such as polymorphic or substituted sites within DNA or protein sequences. Progress has been stymied by a lack of suitable methods to detect clusters and to estimate the extent of clustering in discrete linear sequences, particularly when there is no a priori specification of cluster size or cluster count. Here we derive and demonstrate a maximum likelihood method of hierarchical clustering. Our method incorporates a tripartite divide-and-conquer strategy that models sequence heterogeneity, delineates clusters, and yields a profile of the level of clustering associated with each site. The clustering model may be evaluated via model selection using the Akaike Information Criterion, the corrected Akaike Information Criterion, and the Bayesian Information Criterion. Furthermore, model averaging using weighted model likelihoods may be applied to incorporate model uncertainty into the profile of heterogeneity across sites. We evaluated our method by examining its performance on a number of simulated datasets as well as on empirical polymorphism data from diverse natural alleles of the Drosophila alcohol dehydrogenase gene. Our method yielded greater power for the detection of clustered sites across a breadth of parameter ranges, and achieved better accuracy and precision of estimation of clusters, than did the existing empirical cumulative distribution function statistics.

References

  1. Mol Biol Evol. 2002 Jan;19(1):49-57 [PMID: 11752189]
  2. Trends Ecol Evol. 2004 Feb;19(2):101-8 [PMID: 16701236]
  3. Proc Natl Acad Sci U S A. 1994 Dec 20;91(26):12837-41 [PMID: 7809131]
  4. J Mol Evol. 1999 Jan;48(1):86-93 [PMID: 9873080]
  5. Syst Biol. 2004 Jun;53(3):485-95 [PMID: 15503675]
  6. Nature. 1984 May 31-Jun 6;309(5967):425-30 [PMID: 6427630]
  7. Biosystems. 1993;30(1-3):93-111 [PMID: 8374084]
  8. Trends Genet. 2001 Sep;17(9):481-5 [PMID: 11525814]
  9. J Mol Evol. 1992 Jul;35(1):17-31 [PMID: 1518082]
  10. Genome Biol. 2007;8(10):R223 [PMID: 17949488]
  11. Genome Biol. 2007;8(6):R118 [PMID: 17578567]
  12. Mol Biol Evol. 2007 Aug;24(8):1769-82 [PMID: 17522088]
  13. Genetics. 1993 Jun;134(2):597-608 [PMID: 8325490]
  14. Biochemistry. 1993 Apr 6;32(13):3342-6 [PMID: 8461298]
  15. Syst Biol. 2004 Oct;53(5):793-808 [PMID: 15545256]
  16. Mol Phylogenet Evol. 1992 Jun;1(2):91-6 [PMID: 1342931]
  17. Genome Res. 2000 Dec;10(12):1986-95 [PMID: 11116093]
  18. Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W686-91 [PMID: 16845098]
  19. Mol Biol Evol. 2001 Apr;18(4):557-62 [PMID: 11264407]
  20. Nucleic Acids Res. 2008 Jan;36(Database issue):D588-93 [PMID: 18160408]
  21. Eur J Biochem. 1991 Sep 1;200(2):537-43 [PMID: 1889416]
  22. Mol Phylogenet Evol. 1998 Feb;9(1):64-71 [PMID: 9479695]
  23. Genetics. 1999 Dec;153(4):1717-29 [PMID: 10581279]
  24. J Mol Evol. 2005 Jun;60(6):748-63 [PMID: 15959677]
  25. Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W382-4 [PMID: 16845032]
  26. Mol Biol Cell. 2005 Mar;16(3):1026-42 [PMID: 15616197]
  27. Bioinformatics. 2002 Apr;18(4):631-3 [PMID: 12016061]
  28. J Mol Evol. 2002 Nov;55(5):509-21 [PMID: 12399925]
  29. J Mol Biol. 2005 Jan 21;345(3):579-98 [PMID: 15581900]
  30. Nucleic Acids Res. 2004 Dec 02;32(21):6347-57 [PMID: 15576679]
  31. Genetics. 2007 Aug;176(4):2451-63 [PMID: 17603100]
  32. Mol Biol Evol. 1995 Jul;12(4):546-57 [PMID: 7659011]
  33. Mol Biol Evol. 2005 Dec;22(12):2375-85 [PMID: 16107593]
  34. Nature. 1989 Jan 19;337(6204):283-5 [PMID: 2911369]
  35. Genetics. 1996 May;143(1):589-602 [PMID: 8722807]
  36. Genetics. 2001 Oct;159(2):673-87 [PMID: 11606543]
  37. Curr Opin Microbiol. 1998 Oct;1(5):598-610 [PMID: 10066522]
  38. Mol Biol Evol. 2008 Sep;25(9):1995-2007 [PMID: 18586695]
  39. Mol Biol Evol. 1985 Nov;2(6):539-56 [PMID: 3870876]
  40. Bioinformatics. 1998 Jun;14(5):467-8 [PMID: 9682061]
  41. Mol Biol Evol. 1994 Jul;11(4):620-9 [PMID: 8078401]
  42. Genomics Proteomics Bioinformatics. 2006 Nov;4(4):259-63 [PMID: 17531802]
  43. Proc Natl Acad Sci U S A. 1981 May;78(5):2717-21 [PMID: 6789320]
  44. FEBS Lett. 1997 Aug 18;413(2):191-3 [PMID: 9280279]
  45. PLoS Biol. 2004 Apr;2(4):E81 [PMID: 15094797]
  46. Science. 1992 Jul 3;257(5066):39-49 [PMID: 1621093]
  47. FEBS Lett. 1993 Mar 15;319(1-2):90-4 [PMID: 8454065]
  48. Genetics. 1999 Sep;153(1):485-95 [PMID: 10471728]
  49. FEBS Lett. 1992 Aug 24;308(3):235-9 [PMID: 1505661]
  50. Genetics. 2000 May;155(1):431-49 [PMID: 10790415]
  51. J Mol Evol. 2001 Jan;52(1):17-28 [PMID: 11139291]
  52. Genetics. 1998 Jun;149(2):959-70 [PMID: 9611206]
  53. Annu Rev Genet. 2005;39:197-218 [PMID: 16285858]
  54. Trends Ecol Evol. 1996 Sep;11(9):367-72 [PMID: 21237881]
  55. Genetics. 1993 Jul;134(3):837-45 [PMID: 8349114]
  56. Proc Natl Acad Sci U S A. 1998 Mar 31;95(7):3720-5 [PMID: 9520433]
  57. J Mol Evol. 2006 Jun;62(6):682-92 [PMID: 16752209]
  58. Cell Biophys. 1985 Dec;7(4):239-50 [PMID: 2420451]
  59. PLoS One. 2008;3(11):e3746 [PMID: 19015730]
  60. Mol Biol Evol. 2007 Dec;24(12):2707-15 [PMID: 17884826]
  61. Genetics. 1986 Aug;113(4):1077-91 [PMID: 3744027]

Grants

  1. S10 RR019895/NCRR NIH HHS
  2. RR19895/NCRR NIH HHS

MeSH Term

Alcohol Dehydrogenase
Algorithms
Amino Acid Sequence
Animals
Base Sequence
Cluster Analysis
Computational Biology
Computer Simulation
Drosophila
Drosophila Proteins
Likelihood Functions
Models, Genetic
Sequence Analysis

Chemicals

Drosophila Proteins
Alcohol Dehydrogenase

Word Cloud

Similar Articles

Cited By