TopHat: discovering splice junctions with RNA-Seq.

Cole Trapnell, Lior Pachter, Steven L Salzberg
Author Information
  1. Cole Trapnell: Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA. cole@cs.umd.edu

Abstract

MOTIVATION: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or 'reads', can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites.
RESULTS: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20,000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development.
AVAILABILITY: TopHat is free, open-source software available from http://tophat.cbcb.umd.edu.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

References

  1. Nat Methods. 2008 Feb;5(2):183-8 [PMID: 18204455]
  2. Trends Genet. 2007 Jan;23(1):20-4 [PMID: 17070957]
  3. Nat Genet. 1993 Aug;4(4):373-80 [PMID: 8401585]
  4. Nature. 2008 Nov 27;456(7221):470-6 [PMID: 18978772]
  5. Genome Res. 2008 Sep;18(9):1509-17 [PMID: 18550803]
  6. Bioinformatics. 2005 May 1;21(9):1859-75 [PMID: 15728110]
  7. Bioinformatics. 2008 Aug 15;24(16):i174-80 [PMID: 18689821]
  8. BMC Bioinformatics. 2008 Jan 09;9:11 [PMID: 18184432]
  9. Science. 2008 Aug 15;321(5891):956-60 [PMID: 18599741]
  10. Genome Biol. 2009;10(3):R25 [PMID: 19261174]
  11. Proc Natl Acad Sci U S A. 1996 Nov 12;93(23):12856-60 [PMID: 8917509]
  12. Genome Res. 2008 May;18(5):821-9 [PMID: 18349386]
  13. Nat Methods. 2008 Jul;5(7):613-9 [PMID: 18516046]
  14. BMC Bioinformatics. 2006 Mar 23;7:169 [PMID: 16556303]
  15. Nat Methods. 2008 Jul;5(7):621-8 [PMID: 18516045]
  16. Genome Res. 2008 Nov;18(11):1851-8 [PMID: 18714091]
  17. Genome Res. 2002 Apr;12(4):656-64 [PMID: 11932250]

Grants

  1. R01 GM083873/NIGMS NIH HHS
  2. R01 LM006845-10/NLM NIH HHS
  3. R01 LM006845-09/NLM NIH HHS
  4. R01 LM006845/NLM NIH HHS
  5. R01-LM06845/NLM NIH HHS
  6. R01-GM083873/NIGMS NIH HHS
  7. R01 GM083873-06/NIGMS NIH HHS

MeSH Term

Algorithms
Gene Expression Profiling
Models, Genetic
RNA Splicing
RNA, Messenger
Sequence Alignment
Sequence Analysis, RNA
Software

Chemicals

RNA, Messenger

Word Cloud

Created with Highcharts 10.0.0RNA-SeqsplicejunctionsreadsknownsoftwareTopHatexperimentfragmentsidentifynoveldatagenomealgorithmnearly2availableMOTIVATION:newprotocolsequencingmessengerRNAcellgeneratesmillionsshortsequencesinglerun'reads'canusedmeasurelevelsgeneexpressionvariantsgenesHowevercurrentaligningreliesonesefficientread-mappingdesignedalignreferencewithoutrelyingsitesRESULTS:mappedrecentmammalianrecovered72%reportedannotation-basedstudyalong20000previouslyunreportedpipelinemuchfasterprevioussystemsmappingmillionperCPUhoursufficientprocessentirelessdaystandarddesktopcomputerdescribeseveralchallengesuniqueabinitiositediscoverywillrequiredevelopmentAVAILABILITY:freeopen-sourcehttp://tophatcbcbumdeduSUPPLEMENTARYINFORMATION:SupplementaryBioinformaticsonlineTopHat:discovering

Similar Articles

Cited By