START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries.

Xinjie Zhu, Qiang Zhang, Eric Dun Ho, Ken Hung-On Yu, Chris Liu, Tim H Huang, Alfred Sze-Lok Cheng, Ben Kao, Eric Lo, Kevin Y Yip
Author Information
  1. Xinjie Zhu: Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, Hong Kong.
  2. Qiang Zhang: School of Computing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
  3. Eric Dun Ho: Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
  4. Ken Hung-On Yu: Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
  5. Chris Liu: Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
  6. Tim H Huang: Department of Molecular Medicine, University of Texas Health Science Center at San Antonio, San Antonio, Texas, USA.
  7. Alfred Sze-Lok Cheng: School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
  8. Ben Kao: Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, Hong Kong. kao@cs.hku.hk.
  9. Eric Lo: Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. ericlo@cse.cuhk.edu.hk.
  10. Kevin Y Yip: Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. kevinyip@cse.cuhk.edu.hk.

Abstract

BACKGROUND: A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions.
RESULTS: Here we propose Signal Track Query Language (STQL) for simple analysis of signal tracks. It is a Structured Query Language (SQL)-like declarative language, which means one only specifies what computations need to be done but not how these computations are to be carried out. STQL provides a rich set of constructs for manipulating genomic intervals and their values. To run STQL queries, we have developed the Signal Track Analytical Research Tool (START, http://yiplab.cse.cuhk.edu.hk/start/ ), a system that includes a Web-based user interface and a back-end execution system. The user interface helps users select data from our database of around 10,000 commonly-used public signal tracks, manage their own tracks, and construct, store and share STQL queries. The back-end system automatically translates STQL queries into optimized low-level programs and runs them on a computer cluster in parallel. We use STQL to perform 14 representative analytical tasks. By repeating these analyses using bedtools, Galaxy and custom Python scripts, we show that the STQL solution is usually the simplest, and the parallel execution achieves significant speed-up with large data files. Finally, we describe how a biologist with minimal formal training in computer programming self-learned STQL to analyze DNA methylation data we produced from 60 pairs of hepatocellular carcinoma (HCC) samples.
CONCLUSIONS: Overall, STQL and START provide a generic way for analyzing a large number of genomic signal tracks in parallel easily.

Keywords

References

  1. Blood. 2000 Apr 15;95(8):2691-8 [PMID: 10753852]
  2. Nature. 2015 Feb 19;518(7539):317-30 [PMID: 25693563]
  3. Genes Dev. 1996 Jan 1;10(1):70-9 [PMID: 8557196]
  4. Genome Biol. 2010;11(8):R86 [PMID: 20738864]
  5. Bioinformatics. 2014 Sep 15;30(18):2652-3 [PMID: 24845651]
  6. Cancer Cell. 2011 Mar 8;19(3):347-58 [PMID: 21397858]
  7. Genome Biol. 2012 Sep 26;13(9):R48 [PMID: 22950945]
  8. Nucleic Acids Res. 2001 Jan 1;29(1):308-11 [PMID: 11125122]
  9. Cell. 2013 Apr 11;153(2):307-19 [PMID: 23582322]
  10. Nature. 2012 Sep 6;489(7414):57-74 [PMID: 22955616]
  11. Genome Res. 2012 Sep;22(9):1760-74 [PMID: 22955987]
  12. Genes Dev. 2012 Sep 1;26(17 ):1959-71 [PMID: 22948661]
  13. Bioinformatics. 2010 Mar 15;26(6):841-2 [PMID: 20110278]
  14. Bioinformatics. 2014 Jan 1;30(1):1-8 [PMID: 23751181]
  15. Bioinformatics. 2015 Jun 15;31(12):1881-8 [PMID: 25649616]
  16. Nature. 2014 Mar 27;507(7493):455-461 [PMID: 24670763]
  17. Bioinformatics. 2013 Dec 1;29(23):3014-9 [PMID: 24021384]
  18. Clin Cancer Res. 2013 Nov 15;19(22):6272-85 [PMID: 24077349]
  19. IEEE/ACM Trans Comput Biol Bioinform. 2013 Jan-Feb;10(1):200-6 [PMID: 23702556]
  20. Nat Genet. 2013 Oct;45(10):1113-20 [PMID: 24071849]
  21. Nucleic Acids Res. 2017 Jan 4;45(D1):D626-D634 [PMID: 27899642]
  22. Bioinformatics. 2014 Jan 1;30(1):119-20 [PMID: 24149054]
  23. Hepatology. 2013 Apr;57(4):1407-15 [PMID: 22890726]
  24. Nat Methods. 2012 Feb 28;9(3):215-6 [PMID: 22373907]

Grants

  1. U54 CA217297/NCI NIH HHS

MeSH Term

Carcinoma, Hepatocellular
Genomics
Humans
Liver Neoplasms
Programming Languages

Word Cloud

Created with Highcharts 10.0.0STQLtrackssignalgenomicqueriessystemSignalanalysisdataparallelsetintervalsvaluescomputationalTrackQueryLanguagecomputationsSTARTuserinterfaceback-endexecutioncomputerlargeBACKGROUND:trackassociatedvarioustypesmeasurementshigh-throughputexperimentsAnalysisrequirescomplexmethodsoftenmakeanalystsfocusmuchdetailedstepsratherbiologicalquestionsRESULTS:proposesimpleStructuredSQL-likedeclarativelanguagemeansonespecifiesneeddonecarriedprovidesrichconstructsmanipulatingrundevelopedAnalyticalResearchToolhttp://yiplabcsecuhkeduhk/start/includesWeb-basedhelpsusersselectdatabasearound10000commonly-usedpublicmanageconstructstoreshareautomaticallytranslatesoptimizedlow-levelprogramsrunsclusteruseperform14representativeanalyticaltasksrepeatinganalysesusingbedtoolsGalaxycustomPythonscriptsshowsolutionusuallysimplestachievessignificantspeed-upfilesFinallydescribebiologistminimalformaltrainingprogrammingself-learnedanalyzeDNAmethylationproduced60pairshepatocellularcarcinomaHCCsamplesCONCLUSIONS:OverallprovidegenericwayanalyzingnumbereasilySTART:flexiblehundredslinesSQL-likeDataHumangenomics

Similar Articles

Cited By