Duplication count distributions in DNA sequences.

Advanced Search

Suzanne S Sindi, Brian R Hunt, James A Yorke

Author Information

Suzanne S Sindi: Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA. suzanne_sindi@brown.edu

PMID: 19256873 DOI: 10.1103/PhysRevE.78.061912

We study quantitative features of complex repetitive DNA in several genomes by studying sequences that are sufficiently long that they are unlikely to have repeated by chance. For each genome we study, we determine the number of identical copies, the "duplication count," of each sequence of length 40, that is of each "40-mer." We say a 40-mer is "repeated" if its duplication count is at least 2. We focus mainly on "complex" 40-mers, those without short internal repetitions. We find that we can classify most of the complex repeated 40-mers into two categories: one category has its copies clustered closely together on one chromosome, the other has its copies distributed widely across multiple chromosomes. For each genome and each of the categories above, we compute N(c), the number of 40-mers that have duplication count c, for each integer c. In each case, we observe a power-law-like decay in N(c) as c increases from 3 to 50 or higher. In particular, we find that N(c) decays much more slowly than would be predicted by evolutionary models where each 40-mer is equally likely to be duplicated. We also analyze an evolutionary model that does reflect the slow decay of N(c).

Nature. 2000 Dec 14;408(6814):796-815 [PMID: 11130711]
Phys Rev Lett. 2003 Jan 10;90(1):018101 [PMID: 12570650]
Cytogenet Genome Res. 2005;110(1-4):462-7 [PMID: 16093699]
Genome Biol. 2002;3(12):RESEARCH0081 [PMID: 12537570]
Genome Biol. 2002 Jul 25;3(8):RESEARCH0040 [PMID: 12186647]
Nat Rev Genet. 2002 Jan;3(1):65-72 [PMID: 11823792]
BMC Evol Biol. 2002 Oct 14;2:18 [PMID: 12379152]
Proc Natl Acad Sci U S A. 1998 Sep 1;95(18):10774-8 [PMID: 9724780]
Ann N Y Acad Sci. 2002 Dec;981:111-34 [PMID: 12547677]
Genome Res. 2004 Nov;14(11):2245-52 [PMID: 15520288]
Nucleic Acids Res. 2008 Jan;36(Database issue):D25-30 [PMID: 18073190]
PLoS Biol. 2003 Nov;1(2):E45 [PMID: 14624247]
Nature. 1994 Sep 15;371(6494):215-20 [PMID: 8078581]
Nucleic Acids Res. 2008 Jan;36(Database issue):D588-93 [PMID: 18160408]
Genome Biol. 2002;3(12):RESEARCH0088 [PMID: 12537577]
Curr Opin Genet Dev. 2005 Dec;15(6):640-4 [PMID: 16214334]
Proc Natl Acad Sci U S A. 2004 Jul 13;101(28):10349-54 [PMID: 15240876]
Phys Rev Lett. 1994 Dec 5;73(23):3169-72 [PMID: 10057305]
Nature. 2002 Nov 14;420(6912):218-23 [PMID: 12432406]
Genome Res. 1999 Jul;9(7):629-38 [PMID: 10413401]
Science. 2004 Mar 12;303(5664):1626-32 [PMID: 15016989]
Nucleic Acids Res. 2002 Jun 1;30(11):2478-83 [PMID: 12034836]
Evolution. 2001 Jan;55(1):1-24 [PMID: 11263730]
Genome Res. 2000 Aug;10(8):1108-14 [PMID: 10958629]
Comput Chem. 1996 Mar;20(1):35-8 [PMID: 16718864]
Chromosoma. 2000 Sep;109(6):365-71 [PMID: 11072791]
Phys Rev Lett. 1996 Mar 11;76(11):1977 [PMID: 10060572]
J Comput Biol. 2007 May;14(4):479-95 [PMID: 17572025]
J Mol Biol. 1990 Oct 5;215(3):403-10 [PMID: 2231712]
J Mol Biol. 2001 Nov 2;313(4):673-81 [PMID: 11697896]

R01 HG002945-01/NHGRI NIH HHS
1R01HG0294501/NHGRI NIH HHS

Animals

Base Sequence

Biophysical Phenomena

Chromosomes

DNA

Gene Duplication

Genomics

Humans

Markov Chains

Models, Chemical

Models, Genetic

Multigene Family

Repetitive Sequences, Nucleic Acid

DNA

Journal Article Research Support, N.I.H., Extramural Research Support, U.S. Gov't, Non-P.H.S.

OpenLB
Open Library of Bioscience