Detection of Regional Variation in Selection Intensity within Protein-Coding Genes Using DNA Sequence Polymorphism and Divergence.

Zi-Ming Zhao, Michael C Campbell, Ning Li, Daniel S W Lee, Zhang Zhang, Jeffrey P Townsend
Author Information
  1. Zi-Ming Zhao: Department of Biostatistics, Yale University, New Haven, CT.
  2. Michael C Campbell: Department of Biostatistics, Yale University, New Haven, CT.
  3. Ning Li: Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT.
  4. Daniel S W Lee: Department of Biostatistics, Yale University, New Haven, CT.
  5. Zhang Zhang: CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China.
  6. Jeffrey P Townsend: Department of Biostatistics, Yale University, New Haven, CT.

Abstract

Numerous approaches have been developed to infer natural selection based on the comparison of polymorphism within species and divergence between species. These methods are especially powerful for the detection of uniform selection operating across a gene. However, empirical analyses have demonstrated that regions of protein-coding genes exhibiting clusters of amino acid substitutions are subject to different levels of selection relative to other regions of the same gene. To quantify this heterogeneity of selection within coding sequences, we developed Model Averaged Site Selection via Poisson Random Field (MASS-PRF). MASS-PRF identifies an ensemble of intragenic clustering models for polymorphic and divergent sites. This ensemble of models is used within the Poisson Random Field framework to estimate selection intensity on a site-by-site basis. Using simulations, we demonstrate that MASS-PRF has high power to detect clusters of amino acid variants in small genic regions, can reliably estimate the probability of a variant occurring at each nucleotide site in sequence data and is robust to historical demographic trends and recombination. We applied MASS-PRF to human gene polymorphism derived from the 1,000 Genomes Project and divergence data from the common chimpanzee. On the basis of this analysis, we discovered striking regional variation in selection intensity, indicative of positive or negative selection, in well-defined domains of genes that have previously been associated with neurological processing, immunity, and reproduction. We suggest that amino acid-altering substitutions within these regions likely are or have been selectively advantageous in the human lineage, playing important roles in protein function.

Keywords

MeSH Term

Algorithms
Amino Acid Substitution
Animals
Cluster Analysis
Evolution, Molecular
Exons
Genetic Variation
Humans
Models, Genetic
Open Reading Frames
Polymorphism, Genetic
Polymorphism, Single Nucleotide
Selection, Genetic
Sequence Analysis, DNA

Word Cloud