Discovering unknown human and mouse transcription factor binding sites and their characteristics from ChIP-seq data.

Chun-Ping Yu, Chen-Hao Kuo, Chase W Nelson, Chi-An Chen, Zhi Thong Soh, Jinn-Jy Lin, Ru-Xiu Hsiao, Chih-Yao Chang, Wen-Hsiung Li
Author Information
  1. Chun-Ping Yu: Biodiversity Research Center, Academia Sinica, 115 Taipei, Taiwan. ORCID
  2. Chen-Hao Kuo: Biodiversity Research Center, Academia Sinica, 115 Taipei, Taiwan. ORCID
  3. Chase W Nelson: Biodiversity Research Center, Academia Sinica, 115 Taipei, Taiwan. ORCID
  4. Chi-An Chen: Biodiversity Research Center, Academia Sinica, 115 Taipei, Taiwan.
  5. Zhi Thong Soh: Biodiversity Research Center, Academia Sinica, 115 Taipei, Taiwan.
  6. Jinn-Jy Lin: Biodiversity Research Center, Academia Sinica, 115 Taipei, Taiwan.
  7. Ru-Xiu Hsiao: Biodiversity Research Center, Academia Sinica, 115 Taipei, Taiwan. ORCID
  8. Chih-Yao Chang: Biodiversity Research Center, Academia Sinica, 115 Taipei, Taiwan. ORCID
  9. Wen-Hsiung Li: Biodiversity Research Center, Academia Sinica, 115 Taipei, Taiwan; whli@uchicago.edu.

Abstract

Transcription factor binding sites (TFBSs) are essential for gene regulation, but the number of known TFBSs remains limited. We aimed to discover and characterize unknown TFBSs by developing a computational pipeline for analyzing ChIP-seq (chromatin immunoprecipitation followed by sequencing) data. Applying it to the latest ENCODE ChIP-seq data for human and mouse, we found that using the irreproducible discovery rate as a quality-control criterion resulted in many experiments being unnecessarily discarded. By contrast, the number of motif occurrences in ChIP-seq peak regions provides a highly effective criterion, which is reliable even if supported by only one experimental replicate. In total, we obtained 2,058 motifs from 1,089 experiments for 354 human TFs and 163 motifs from 101 experiments for 34 mouse TFs. Among these motifs, 487 have not previously been reported. Mapping the canonical motifs to the human genome reveals a high TFBS density ±2 kb around transcription start sites (TSSs) with a peak at -50 bp. On average, a promoter contains 5.7 TFBSs. However, 70% of TFBSs are in introns (41%) and intergenic regions (29%), whereas only 12% are in promoters (-1 kb to +100 bp from TSSs). Notably, some TFs (e.g., CTCF, JUN, JUNB, and NFE2) have motifs enriched in intergenic regions, including enhancers. We inferred 142 cobinding TF pairs and 186 (including 115 completely) tethered binding TF pairs, indicating frequent interactions between TFs and a higher frequency of tethered binding than cobinding. This study provides a large number of previously undocumented motifs and insights into the biological and genomic features of TFBSs.

Keywords

References

  1. Nat Biotechnol. 2006 Nov;24(11):1429-35 [PMID: 16998473]
  2. PLoS One. 2010 Jan 20;5(1):e8797 [PMID: 20098703]
  3. Bioinformatics. 2011 Jun 15;27(12):1696-7 [PMID: 21486936]
  4. Nature. 2014 Mar 27;507(7493):455-461 [PMID: 24670763]
  5. Mol Cell Biol. 1989 Jul;9(7):2944-9 [PMID: 2674675]
  6. Genome Res. 2012 Sep;22(9):1813-31 [PMID: 22955991]
  7. Nature. 2020 Jul;583(7818):699-710 [PMID: 32728249]
  8. Genome Res. 2003 Nov;13(11):2498-504 [PMID: 14597658]
  9. Bioinformatics. 2015 Sep 1;31(17):2879-81 [PMID: 25953800]
  10. Mol Cell. 2010 May 28;38(4):576-89 [PMID: 20513432]
  11. Cell. 2016 May 19;165(5):1280-1292 [PMID: 27203113]
  12. Sci Rep. 2019 Jun 27;9(1):9354 [PMID: 31249361]
  13. Nucleic Acids Res. 2014 Mar;42(5):2976-87 [PMID: 24335146]
  14. Nat Methods. 2007 Aug;4(8):651-7 [PMID: 17558387]
  15. Nucleic Acids Res. 2013 Jan;41(Database issue):D171-6 [PMID: 23203885]
  16. Genome Res. 2012 Sep;22(9):1798-812 [PMID: 22955990]
  17. Cell. 2009 Jun 26;137(7):1194-211 [PMID: 19563753]
  18. Nat Protoc. 2011 Dec 15;7(1):45-61 [PMID: 22179591]
  19. Bioinformatics. 2014 Aug 1;30(15):2114-20 [PMID: 24695404]
  20. Nature. 2020 Jul;583(7818):720-728 [PMID: 32728244]
  21. Nucleic Acids Res. 2018 Jan 4;46(D1):D794-D801 [PMID: 29126249]
  22. Brief Bioinform. 2017 Mar 1;18(2):279-290 [PMID: 26979602]
  23. Nat Biotechnol. 2019 Aug;37(8):925-936 [PMID: 31375813]
  24. Genome Res. 2020 Jul;30(7):1060-1072 [PMID: 32718982]
  25. Biochim Biophys Acta Gene Regul Mech. 2020 Jun;1863(6):194430 [PMID: 31678629]
  26. Genome Biol. 2008;9(9):R137 [PMID: 18798982]
  27. Nat Methods. 2012 Mar 04;9(4):357-9 [PMID: 22388286]
  28. Science. 2007 Jun 8;316(5830):1497-502 [PMID: 17540862]
  29. Nat Methods. 2012 Jun;9(6):609-14 [PMID: 22522655]
  30. Cell. 2018 Oct 4;175(2):598-599 [PMID: 30290144]
  31. Bioinformatics. 2010 Mar 15;26(6):841-2 [PMID: 20110278]
  32. Protein Sci. 2021 Jan;30(1):187-200 [PMID: 33070389]
  33. Nature. 2020 Jul;583(7818):729-736 [PMID: 32728250]
  34. Cell. 2008 Jan 25;132(2):311-22 [PMID: 18243105]

MeSH Term

Animals
Binding Sites
Chromatin Immunoprecipitation Sequencing
Humans
Mice
Nucleotide Motifs
Promoter Regions, Genetic
Transcription Factors

Chemicals

Transcription Factors

Word Cloud

Created with Highcharts 10.0.0TFBSsmotifsbindingChIP-seqhumanTFsfactorsitesnumberdatamouseexperimentsregionstranscriptionunknowncriterionpeakprovidespreviouslykbTSSsbppromoterintergenicincludingcobindingTFpairstetheredTranscriptionessentialgeneregulationknownremainslimitedaimeddiscovercharacterizedevelopingcomputationalpipelineanalyzingchromatinimmunoprecipitationfollowedsequencingApplyinglatestENCODEfoundusingirreproduciblediscoveryratequality-controlresultedmanyunnecessarilydiscardedcontrastmotifoccurrenceshighlyeffectivereliableevensupportedoneexperimentalreplicatetotalobtained2058108935416310134Among487reportedMappingcanonicalgenomerevealshighTFBSdensity±2aroundstart-50averagecontains57However70%introns41%29%whereas12%promoters-1+100NotablyegCTCFJUNJUNBNFE2enrichedenhancersinferred142186115completelyindicatingfrequentinteractionshigherfrequencystudylargeundocumentedinsightsbiologicalgenomicfeaturesDiscoveringcharacteristicssitepositionweightmatrix

Similar Articles

Cited By