CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.

Wei-Chun Chung, Chien-Chih Chen, Jan-Ming Ho, Chung-Yen Lin, Wen-Lian Hsu, Yu-Chun Wang, D T Lee, Feipei Lai, Chih-Wei Huang, Yu-Jung Chang
Author Information
  1. Wei-Chun Chung: Institute of Information Science, Academia Sinica, Taipei, Taiwan; Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan; Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan.
  2. Chien-Chih Chen: Institute of Information Science, Academia Sinica, Taipei, Taiwan; Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan.
  3. Jan-Ming Ho: Institute of Information Science, Academia Sinica, Taipei, Taiwan; Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan.
  4. Chung-Yen Lin: Institute of Information Science, Academia Sinica, Taipei, Taiwan.
  5. Wen-Lian Hsu: Institute of Information Science, Academia Sinica, Taipei, Taiwan.
  6. Yu-Chun Wang: Institute of Information Science, Academia Sinica, Taipei, Taiwan.
  7. D T Lee: Institute of Information Science, Academia Sinica, Taipei, Taiwan; Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan; Department of Computer Science and Information Engineering, National Chung Hsing University, Taichung, Taiwan.
  8. Feipei Lai: Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan.
  9. Chih-Wei Huang: Institute of Information Science, Academia Sinica, Taipei, Taiwan.
  10. Yu-Jung Chang: Institute of Information Science, Academia Sinica, Taipei, Taiwan.

Abstract

BACKGROUND: Explosive growth of next-generation sequencing data has resulted in ultra-large-scale data sets and ensuing computational problems. Cloud computing provides an on-demand and scalable environment for large-scale data analysis. Using a MapReduce framework, data and workload can be distributed via a network to computers in the cloud to substantially reduce computational latency. Hadoop/MapReduce has been successfully adopted in bioinformatics for genome assembly, mapping reads to genomes, and finding single nucleotide polymorphisms. Major cloud providers offer Hadoop cloud services to their users. However, it remains technically challenging to deploy a Hadoop cloud for those who prefer to run MapReduce programs in a cluster without built-in Hadoop/MapReduce.
RESULTS: We present CloudDOE, a platform-independent software package implemented in Java. CloudDOE encapsulates technical details behind a user-friendly graphical interface, thus liberating scientists from having to perform complicated operational procedures. Users are guided through the user interface to deploy a Hadoop cloud within in-house computing environments and to run applications specifically targeted for bioinformatics, including CloudBurst, CloudBrush, and CloudRS. One may also use CloudDOE on top of a public cloud. CloudDOE consists of three wizards, i.e., Deploy, Operate, and Extend wizards. Deploy wizard is designed to aid the system administrator to deploy a Hadoop cloud. It installs Java runtime environment version 1.6 and Hadoop version 0.20.203, and initiates the service automatically. Operate wizard allows the user to run a MapReduce application on the dashboard list. To extend the dashboard list, the administrator may install a new MapReduce application using Extend wizard.
CONCLUSIONS: CloudDOE is a user-friendly tool for deploying a Hadoop cloud. Its smart wizards substantially reduce the complexity and costs of deployment, execution, enhancement, and management. Interested users may collaborate to improve the source code of CloudDOE to further incorporate more MapReduce bioinformatics tools into CloudDOE and support next-generation big data open source tools, e.g., Hadoop BigTop and Spark.
AVAILABILITY: CloudDOE is distributed under Apache License 2.0 and is freely available at http://clouddoe.iis.sinica.edu.tw/.

References

  1. Genome Biol. 2009;10(11):R134 [PMID: 19930550]
  2. BMC Bioinformatics. 2012 Mar 19;13:42 [PMID: 22429538]
  3. Brief Bioinform. 2014 Jul;15(4):637-47 [PMID: 23396756]
  4. Genome Biol. 2010;11(8):R86 [PMID: 20738864]
  5. Bioinformatics. 2014 Jan 1;30(1):119-20 [PMID: 24149054]
  6. BMC Genomics. 2012;13 Suppl 7:S28 [PMID: 23282094]
  7. BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S1 [PMID: 21210976]
  8. Nat Biotechnol. 2010 Jul;28(7):691-3 [PMID: 20622843]
  9. Genome Biol. 2010;11(8):R83 [PMID: 20701754]
  10. Bioinformatics. 2013 Dec 1;29(23):3014-9 [PMID: 24021384]
  11. Bioinformatics. 2012 Jun 1;28(11):1542-3 [PMID: 22492314]
  12. Bioinformatics. 2009 Jun 1;25(11):1363-9 [PMID: 19357099]

MeSH Term

Algorithms
Computational Biology
High-Throughput Nucleotide Sequencing
Sequence Analysis, DNA
Software

Word Cloud

Created with Highcharts 10.0.0cloudHadoopCloudDOEdataMapReducebioinformaticsdeployrunuser-friendlymaywizardswizardnext-generationsequencingcomputationalcomputingenvironmentdistributedsubstantiallyreduceHadoop/MapReduceusersJavainterfaceusereDeployOperateExtendadministratorversion0applicationdashboardlisttooldeployingsourcetoolsBACKGROUND:Explosivegrowthresultedultra-large-scalesetsensuingproblemsCloudprovideson-demandscalablelarge-scaleanalysisUsingframeworkworkloadcanvianetworkcomputerslatencysuccessfullyadoptedgenomeassemblymappingreadsgenomesfindingsinglenucleotidepolymorphismsMajorprovidersofferservicesHoweverremainstechnicallychallengingpreferprogramsclusterwithoutbuilt-inRESULTS:presentplatform-independentsoftwarepackageimplementedencapsulatestechnicaldetailsbehindgraphicalthusliberatingscientistsperformcomplicatedoperationalproceduresUsersguidedwithinin-houseenvironmentsapplicationsspecificallytargetedincludingCloudBurstCloudBrushCloudRSOnealsousetoppublicconsiststhreeidesignedaidsysteminstallsruntime1620203initiatesserviceautomaticallyallowsextendinstallnewusingCONCLUSIONS:smartcomplexitycostsdeploymentexecutionenhancementmanagementInterestedcollaborateimprovecodeincorporatesupportbigopengBigTopSparkAVAILABILITY:ApacheLicense2freelyavailablehttp://clouddoeiissinicaedutw/CloudDOE:cloudsanalyzinghigh-throughput

Similar Articles

Cited By