CAARS: comparative assembly and annotation of RNA-Seq data.

Carine Rey, Philippe Veber, Bastien Boussau, Marie Sémon
Author Information
  1. Carine Rey: UnivLyon, Université Claude Bernard Lyon 1, ENS de Lyon, CNRS UMR, INSERM U1210, LBMC, F-69007, Lyon, France.
  2. Philippe Veber: UnivLyon, Université Claude Bernard Lyon 1, CNRS, UMR, LBBE, F-69100, Villeurbanne, France.
  3. Bastien Boussau: UnivLyon, Université Claude Bernard Lyon 1, CNRS, UMR, LBBE, F-69100, Villeurbanne, France.
  4. Marie Sémon: UnivLyon, Université Claude Bernard Lyon 1, ENS de Lyon, CNRS UMR, INSERM U1210, LBMC, F-69007, Lyon, France.

Abstract

MOTIVATION: RNA sequencing (RNA-Seq) is a widely used approach to obtain transcript sequences in non-model organisms, notably for performing comparative analyses. However, current bioinformatic pipelines do not take full advantage of pre-existing reference data in related species for improving RNA-Seq assembly, annotation and gene family reconstruction.
RESULTS: We built an automated pipeline named CAARS to combine novel data from RNA-Seq experiments with existing multi-species gene family alignments. RNA-Seq reads are assembled into transcripts by both de novo and assisted assemblies. Then, CAARS incorporates transcripts into gene families, builds gene alignments and trees and uses phylogenetic information to classify the genes as orthologs and paralogs of existing genes. We used CAARS to assemble and annotate RNA-Seq data in rodents and fishes using distantly related genomes as reference, a difficult case for this kind of analysis. We showed CAARS assemblies are more complete and accurate than those assembled by a standard pipeline consisting of de novo assembly coupled with annotation by sequence similarity on a guide species. In addition to annotated transcripts, CAARS provides gene family alignments and trees, annotated with orthology relationships, directly usable for downstream comparative analyses.
AVAILABILITY AND IMPLEMENTATION: CAARS is implemented in Python and Ocaml and is freely available at https://github.com/carinerey/caars.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

References

  1. Database (Oxford). 2016 Feb 20;2016: [PMID: 26896847]
  2. Mol Ecol. 2014 Jun;23(11):2699-711 [PMID: 24754676]
  3. Bioinformatics. 2017 Sep 1;33(17):2789 [PMID: 28903539]
  4. Syst Biol. 2015 Nov;64(6):969-82 [PMID: 26130236]
  5. BMC Evol Biol. 2007 Nov 30;7:241 [PMID: 18053139]
  6. BMC Genomics. 2016 Jan 14;17:54 [PMID: 26763976]
  7. Genome Biol. 2016 Jan 26;17:13 [PMID: 26813401]
  8. BMC Evol Biol. 2012 Jun 14;12:88 [PMID: 22697210]
  9. Mol Biol Evol. 2014 Nov;31(11):3081-92 [PMID: 25158799]
  10. Bioinformatics. 2012 Dec 1;28(23):3150-2 [PMID: 23060610]
  11. Nucleic Acids Res. 2016 Jan 4;44(D1):D710-6 [PMID: 26687719]
  12. Nat Methods. 2011 Jun;8(6):469-77 [PMID: 21623353]
  13. Trends Genet. 2008 Nov;24(11):539-51 [PMID: 18819722]
  14. Mol Biol Evol. 2015 Apr;32(4):835-45 [PMID: 25739733]
  15. BMC Bioinformatics. 2015 Mar 25;16:98 [PMID: 25887972]
  16. Bioinformatics. 2013 May 15;29(10):1250-9 [PMID: 23493323]
  17. PLoS Comput Biol. 2009 Jan;5(1):e1000262 [PMID: 19148271]
  18. Science. 2015 Jan 23;347(6220):1260419 [PMID: 25613900]
  19. Mol Cell Proteomics. 2014 Feb;13(2):397-406 [PMID: 24309898]
  20. PLoS One. 2007 Apr 18;2(4):e383 [PMID: 17440619]
  21. Nucleic Acids Res. 2002 Jul 15;30(14):3059-66 [PMID: 12136088]
  22. Mol Biol Evol. 2016 Sep;33(9):2391-5 [PMID: 27297470]
  23. Nat Biotechnol. 2010 May;28(5):511-5 [PMID: 20436464]
  24. Mol Ecol Resour. 2014 Mar;14(2):381-92 [PMID: 24119300]
  25. BMC Bioinformatics. 2013 Nov 19;14:330 [PMID: 24252138]
  26. Nat Biotechnol. 2011 May 15;29(7):644-52 [PMID: 21572440]
  27. Brief Bioinform. 2011 Sep;12(5):379-91 [PMID: 21690100]
  28. Mol Ecol. 2013 Feb;22(3):620-34 [PMID: 22998089]
  29. Nucleic Acids Res. 2014 Jan;42(Database issue):D897-902 [PMID: 24275491]
  30. Nat Biotechnol. 2016 May;34(5):525-7 [PMID: 27043002]
  31. Mol Ecol Resour. 2016 Mar;16(2):446-58 [PMID: 26358618]
  32. PLoS One. 2017 Sep 20;12(9):e0185020 [PMID: 28931057]
  33. BMC Genomics. 2016 May 24;17:392 [PMID: 27220689]
  34. Genome Biol Evol. 2016 Aug 03;8(7):2155-63 [PMID: 27324918]
  35. Mol Ecol. 2016 Mar;25(6):1224-41 [PMID: 26756714]
  36. Brief Bioinform. 2017 May 1;18(3):530-536 [PMID: 27013646]
  37. Mol Ecol. 2016 Apr;25(7):1478-93 [PMID: 26859844]
  38. Nucleic Acids Res. 2014 Jan;42(Database issue):D222-30 [PMID: 24288371]
  39. Ecol Lett. 2015 May;18(5):441-50 [PMID: 25808114]
  40. Nat Rev Genet. 2009 Jan;10(1):57-63 [PMID: 19015660]
  41. PLoS Biol. 2009 May 5;7(5):e1000112 [PMID: 19468303]
  42. Bioinformatics. 2009 May 1;25(9):1105-11 [PMID: 19289445]
  43. Nat Rev Genet. 2011 Feb;12(2):87-98 [PMID: 21191423]
  44. Mol Phylogenet Evol. 2013 Jan;66(1):417-22 [PMID: 23000819]
  45. Genome Res. 1999 Sep;9(9):868-77 [PMID: 10508846]
  46. Genomics Insights. 2016 Feb 25;9:17-28 [PMID: 26966373]
  47. Genome Res. 2013 Feb;23(2):323-30 [PMID: 23132911]
  48. Proc Natl Acad Sci U S A. 1998 May 26;95(11):6239-44 [PMID: 9600949]
  49. BMC Bioinformatics. 2009 Jun 16;10 Suppl 6:S3 [PMID: 19534752]
  50. BMC Bioinformatics. 2009 Dec 15;10:421 [PMID: 20003500]

MeSH Term

Genome
Molecular Sequence Annotation
Phylogeny
RNA
Sequence Analysis, RNA
Software
Transcriptome

Chemicals

RNA

Word Cloud

Created with Highcharts 10.0.0RNA-SeqCAARSdatagenecomparativeassemblyannotationfamilyalignmentstranscriptsusedanalysesreferencerelatedspeciespipelineexistingassembleddenovoassembliestreesgenesannotatedavailableMOTIVATION:RNAsequencingwidelyapproachobtaintranscriptsequencesnon-modelorganismsnotablyperformingHowevercurrentbioinformaticpipelinestakefulladvantagepre-existingimprovingreconstructionRESULTS:builtautomatednamedcombinenovelexperimentsmulti-speciesreadsassistedincorporatesfamiliesbuildsusesphylogeneticinformationclassifyorthologsparalogsassembleannotaterodentsfishesusingdistantlygenomesdifficultcasekindanalysisshowedcompleteaccuratestandardconsistingcoupledsequencesimilarityguideadditionprovidesorthologyrelationshipsdirectlyusabledownstreamAVAILABILITYANDIMPLEMENTATION:implementedPythonOcamlfreelyhttps://githubcom/carinerey/caarsSUPPLEMENTARYINFORMATION:SupplementaryBioinformaticsonlineCAARS:

Similar Articles

Cited By