IC4R003-Epigenomic-2011-21984925

From RiceWiki
Jump to: navigation, search

Project Title

  • Comparison of Four ChIP-Seq Analytical Algorithms Using Rice Endosperm H3K27 Trimethylation Profiling Data

The Background of This Project

  • Chromatin immunoprecipitation (ChIP) coupled with high throughput sequencing (ChIP-Seq) has emerged as one of the most promising tools for profiling protein-DNA binding sites and chromatin modifications on a genome-wide scale [1]. The goal of ChIP-Seq studies is to find those genomic DNA fragments that are enriched in immunoprecipitation fractions using antibodies specific for DNA associated proteins of interest. Enriched regions, those with a high density of short DNA reads after immunopre- cipitation and DNA sequencing, are referred to as peaks. Many programs for identification of peaks with ChIP-Seq data have been developed in recent years [2,3,4,5,6,7,8,9,10,11]. The reported algorithms differ in their approaches for identifying potential enriched regions of the genome. Some algorithms, for example MACS [10] and PeakSeq [8], use a simple sliding window and group all reads within each window together. Others use a finer resolution method, either considering each base pair singly as in FindPeaks [4] or defining the windows based on the read locations as represented by USeq [7]. After identifying windows, the algorithms must then determine which windows are the true enriched regions. Methods without a control (FindPeaks) either simply report the number of reads in the windows or make an assumption about the background distribution, such as assuming the reads follow a Poisson distribution (FindPeaks), and calculate significance based on the assumed distribution.
  • Those including a control sample (MACS, PeakSeq) use the control to more accurately model the background distribution of the reads and calculate an empirical False Discovery Rate (FDR) via, for example, a sample swap technique. Distinguishing between multiple small peaks or a single large peak is also challenging. While some algorithms merge overlapping peaks (MACS) or peaks within a user-supplied threshold (USeq, PeakSeq), others (Find-Peaks) compare the height of peaks to the depth of the separating valley to differentiate multiple small peaks from one large peak. Pepke et al. [12] discussed a number of additional peak identification algorithms in a review article. They made distinc- tions among the algorithms, including how the algorithms aggregated the reads, the criteria for significant peak identification, read shifting to account for reading the end of the reads, use of control, and input parameters. Similarly, Barski and Zhao [13] also reviewed a number of algorithms for peak identification. Thus far, however, no program has emerged as the consensus best approach for identifying peaks in histone modification and DNA binding studies. Therefore, it is important to compare these available algorithms and to suggest essential parameters to assist molecular biology laboratories in selecting the best program for their data analysis.
  • In this project, the researchers identified H3K27me3 modification sites within rice (Oryza sativa) young endosperm using the ChIP-Seq approach. Four different peak identification algorithms (PeakSeq, USeq, MACS, and FindPeaks) were used to locate H3K27me3 enrichment sites. ChIP-PCR was used to evaluate the quality of the peaks identified by these algorithms. We also analyzed the relative location of the peaks with respect to gene expression. Finally, we examined the Gene Ontology (GO) annotations [27] of the ChIP enriched genes.

Plant Materials & Treatment

  • All plants used in this study were rice strain Oryza sativa ssp japonica cv Nipponbare. The immature rice seeds were harvested 6–7 days after pollination. The cross-linking of the chromatin was achieved by vacuum infiltrating PBS (pH7.4) with 1% formaldehyde for 15 min at room temperature. The cross-linking reaction was stopped by adding glycine to a final concentration of 0.125 M and incubating for 5 min under vacuum. The tissues were rinsed 3 times with PBS.

Research Findings

  • Four peak calling programs (MACS [10], PeakSeq [8], FindPeaks version 3.1.9 [4], and USeq [7]) were used to identify peaks of the mapped reads, respectively. Table 1 summarizes characteristics of the peaks identified by each of the programs and Figure 1 shows an example of some of the peaks identified by the different programs displayed in GBrowse [30]. This image displays a 200 kb region of rice chromosome 1. The top track indicates the positions of all genes identified by TIGR, v6. The next two tracks display the distribution of ChIP and Input (control) reads respectively. The following four tracks display the predicted peaks by each of the four examined peak calling programs. The sequence read numbers were normalized to ensure that the ChIP and the Input had identical read numbers over the total genome. Therefore, the height of the graph in this figure directly correlates with the read number in the region to visually display the DNA enrichment.

IC4R003-Epigenomic-2011-21984925-1.png


  • The results show that the average peak bandwidth called by PeakSeq ranges from 2,393 bp to 11,284 bp depending on the max_gap parameter. Smaller values of the max_gap parameter result in shorter peak bandwidths. In the remainder of this manuscript, all PeakSeq results use max_gap = 200 and the program is referred to as PeakSeq(200). The peaks produced by FindPeaks have an average bandwidth of 846 bp while those identified by Useq average 2,313 bp—similar to those identified by PeakSeq(200). The peak bandwidth identified by MACS is the shortest at 778 bp on average.

Figure 1. GBrowse visualization of identified peaks.

  • FindPeaks identified 41,516 peaks covering 9.0% of the genome, USeq identified 9,094 peaks covering 5.4% of the genome, and MACS identified 15,738 peaks covering 3.1% of the genome. In contrast, peaks identified by PeakSeq with different max_gap values covers from 44% to 68.9% of the genome and identifies from 23,760 to 71,269 peaks.
  • Figure 2 displays the distribution of peaks identified by each of the peak identification programs. It is clear that the peaks identified by different programs displayed substantial differences. MACS, USeq, FindPeaks and PeakSeq(200) identified 97%, 86%, 65% and 20% of the total genes with no peak, respectively. For H3K27me3 peaks identified to be within a gene, the ratio was 2% for MACS, 5% for USeq, 22% for FindPeaks, and 71% for PeakSeq. For H3K27me3 peaks in the promoter region, the ratio was about 0.6% for MACS, 5% for USeq, 7% for FindPeaks, and 5% for PeakSeq. For the H3K27me3 peaks downstream of a gene, the ratio was about 0.4% for MACS, 4% for USeq, 6% for FindPeaks and 4% for PeakSeq, respectively. These results demonstrate that different programs will identify different peaks although the same dataset is used.

Labs working on this Project

  • Department of Computer Science and Engineering, Mississippi State University, Mississippi, United States of America,
  • Institute for Genomics, Biocomputing, and Biotechnology, Mississippi State University, Mississippi, United States of America
  • Department of Biochemistry and Molecular Biology, Mississippi State University, Mississippi, United States of America

Corresponding Author

  • Zhaohua Peng (E-mail: zp7@BCH.msstate.edu) & Susan M. Bridges (E-mail: bridges@cse.msstate.edu)