Introduction

Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery.To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice.The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies.

Publications

  1. Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data.
    Cite this
    Eggeling R, Roos T, Myllymäki P, Grosse I, 2015-11-01 - BMC bioinformatics

Credits

  1. Ralf Eggeling
    Developer

    Department of Computer Science, Helsinki Institute for Information Technology HIIT, Finland

  2. Teemu Roos
    Developer

    Department of Computer Science, Helsinki Institute for Information Technology HIIT, Finland

  3. Petri Myllymäki
    Developer

    Department of Computer Science, Helsinki Institute for Information Technology HIIT, Finland

  4. Ivo Grosse
    Investigator

    German Center for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig

Community Ratings

UsabilityEfficiencyReliabilityRated By
0 user
Sign in to rate
Summary
AccessionBT006981
Tool TypeApplication
Category
PlatformsLinux/Unix
Technologies
User InterfaceTerminal Command Line
Download Count0
Submitted ByIvo Grosse