Duplication count distributions in DNA sequences.

Suzanne S Sindi, Brian R Hunt, James A Yorke
Author Information
  1. Suzanne S Sindi: Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA. suzanne_sindi@brown.edu

Abstract

We study quantitative features of complex repetitive DNA in several genomes by studying sequences that are sufficiently long that they are unlikely to have repeated by chance. For each genome we study, we determine the number of identical copies, the "duplication count," of each sequence of length 40, that is of each "40-mer." We say a 40-mer is "repeated" if its duplication count is at least 2. We focus mainly on "complex" 40-mers, those without short internal repetitions. We find that we can classify most of the complex repeated 40-mers into two categories: one category has its copies clustered closely together on one chromosome, the other has its copies distributed widely across multiple chromosomes. For each genome and each of the categories above, we compute N(c), the number of 40-mers that have duplication count c, for each integer c. In each case, we observe a power-law-like decay in N(c) as c increases from 3 to 50 or higher. In particular, we find that N(c) decays much more slowly than would be predicted by evolutionary models where each 40-mer is equally likely to be duplicated. We also analyze an evolutionary model that does reflect the slow decay of N(c).

References

  1. Nature. 2000 Dec 14;408(6814):796-815 [PMID: 11130711]
  2. Phys Rev Lett. 2003 Jan 10;90(1):018101 [PMID: 12570650]
  3. Cytogenet Genome Res. 2005;110(1-4):462-7 [PMID: 16093699]
  4. Genome Biol. 2002;3(12):RESEARCH0081 [PMID: 12537570]
  5. Genome Biol. 2002 Jul 25;3(8):RESEARCH0040 [PMID: 12186647]
  6. Nat Rev Genet. 2002 Jan;3(1):65-72 [PMID: 11823792]
  7. BMC Evol Biol. 2002 Oct 14;2:18 [PMID: 12379152]
  8. Proc Natl Acad Sci U S A. 1998 Sep 1;95(18):10774-8 [PMID: 9724780]
  9. Ann N Y Acad Sci. 2002 Dec;981:111-34 [PMID: 12547677]
  10. Genome Res. 2004 Nov;14(11):2245-52 [PMID: 15520288]
  11. Nucleic Acids Res. 2008 Jan;36(Database issue):D25-30 [PMID: 18073190]
  12. PLoS Biol. 2003 Nov;1(2):E45 [PMID: 14624247]
  13. Nature. 1994 Sep 15;371(6494):215-20 [PMID: 8078581]
  14. Nucleic Acids Res. 2008 Jan;36(Database issue):D588-93 [PMID: 18160408]
  15. Genome Biol. 2002;3(12):RESEARCH0088 [PMID: 12537577]
  16. Curr Opin Genet Dev. 2005 Dec;15(6):640-4 [PMID: 16214334]
  17. Proc Natl Acad Sci U S A. 2004 Jul 13;101(28):10349-54 [PMID: 15240876]
  18. Phys Rev Lett. 1994 Dec 5;73(23):3169-72 [PMID: 10057305]
  19. Nature. 2002 Nov 14;420(6912):218-23 [PMID: 12432406]
  20. Genome Res. 1999 Jul;9(7):629-38 [PMID: 10413401]
  21. Science. 2004 Mar 12;303(5664):1626-32 [PMID: 15016989]
  22. Nucleic Acids Res. 2002 Jun 1;30(11):2478-83 [PMID: 12034836]
  23. Evolution. 2001 Jan;55(1):1-24 [PMID: 11263730]
  24. Genome Res. 2000 Aug;10(8):1108-14 [PMID: 10958629]
  25. Comput Chem. 1996 Mar;20(1):35-8 [PMID: 16718864]
  26. Chromosoma. 2000 Sep;109(6):365-71 [PMID: 11072791]
  27. Phys Rev Lett. 1996 Mar 11;76(11):1977 [PMID: 10060572]
  28. J Comput Biol. 2007 May;14(4):479-95 [PMID: 17572025]
  29. J Mol Biol. 1990 Oct 5;215(3):403-10 [PMID: 2231712]
  30. J Mol Biol. 2001 Nov 2;313(4):673-81 [PMID: 11697896]

Grants

  1. R01 HG002945-01/NHGRI NIH HHS
  2. 1R01HG0294501/NHGRI NIH HHS

MeSH Term

Animals
Base Sequence
Biophysical Phenomena
Chromosomes
DNA
Gene Duplication
Genomics
Humans
Markov Chains
Models, Chemical
Models, Genetic
Multigene Family
Repetitive Sequences, Nucleic Acid

Chemicals

DNA

Word Cloud

Created with Highcharts 10.0.0ccountNcopies40-mersstudycomplexDNAsequencesrepeatedgenomenumber"40-merduplicationfindonedecayevolutionaryquantitativefeaturesrepetitiveseveralgenomesstudyingsufficientlylongunlikelychancedetermineidentical"duplicationsequencelength40"40-mersay"repeated"least2focusmainly"complex"withoutshortinternalrepetitionscanclassifytwocategories:categoryclusteredcloselytogetherchromosomedistributedwidelyacrossmultiplechromosomescategoriescomputeintegercaseobservepower-law-likeincreases350higherparticulardecaysmuchslowlypredictedmodelsequallylikelyduplicatedalsoanalyzemodelreflectslowDuplicationdistributions

Similar Articles

Cited By