Exploiting orthology and de novo transcriptome assembly to refine target sequence information.

Julia F S��llner, Germ��n Leparc, Matthias Zwick, Tanja Sch��nberger, Tobias Hildebrandt, Kay Nieselt, Eric Simon
Author Information
  1. Julia F S��llner: Computational Biology & Genomics, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Strasse 65, 88397, Biberach an der Riss, Germany.
  2. Germ��n Leparc: Transl. Medicine + Clin. Pharmacology, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Strasse 65, 88397, Biberach an der Riss, Germany.
  3. Matthias Zwick: Computational Biology & Genomics, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Strasse 65, 88397, Biberach an der Riss, Germany.
  4. Tanja Sch��nberger: Drug Discovery Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Strasse 65, 88397, Biberach an der Riss, Germany.
  5. Tobias Hildebrandt: Computational Biology & Genomics, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Strasse 65, 88397, Biberach an der Riss, Germany.
  6. Kay Nieselt: Integrative Transcriptomics, Center for Bioinformatics, University of T��bingen, Sand 14, 72076, T��bingen, Germany.
  7. Eric Simon: Computational Biology & Genomics, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Strasse 65, 88397, Biberach an der Riss, Germany. eric.simon@boehringer-ingelheim.com. ORCID

Abstract

BACKGROUND: The ability to generate recombinant drug target proteins is important for drug discovery research as it facilitates the investigation of drug-target-interactions in vitro. To accomplish this, the target's exact protein sequence is required. Public databases, such as Ensembl, UniProt and RefSeq, are extensive protein and nucleotide sequence repositories. However, many sequences for non-human organisms are predicted by computational pipelines and may thus be incomplete or incorrect. This could lead to misinterpreted experimental outcomes due to gaps or errors in orthologous drug target sequences. Transcriptome analysis by RNA-Seq has been established as a standard method for gene expression analysis. Apart from this common application, paired-end RNA-Seq data can also be used to obtain full coverage cDNA sequences via de novo transcriptome assembly.
METHODS: To assess whether de novo transcriptome assemblies can be used to determine a protein's sequence by searching the assembly for a known orthologous sequence, we generated 3��������6���=���18 tissue specific assemblies (three organs: brain, kidney and liver; six species: human, mouse, rat, dog, pig and cynomolgus monkey). These assemblies and the manually curated human protein sequences from UniProtKB/Swiss-Prot were used in a reciprocal BLAST search to identify best matching hits. We automated and generalised our approach and present the a&o-tool, a workflow which exploits de novo assemblies of paired-end RNA-Seq data and orthology information for target sequence validation and refinement across related species. Furthermore, the a&o-tool extracts best hits' sequences from a reciprocal BLAST search, translates them into protein sequences, computes a multiple sequence alignment and quantifies the refinement.
RESULTS: For the three human assemblies we observed a hit rate greater than 60% with 100% sequence coverage and identity. For assemblies from the other species we observed similar hit rates and coverage with highest identities for cynomolgus monkey.
CONCLUSIONS: In summary, we show how to refine protein sequences using RNA-Seq data and sequence information from closely related species. With the a&o-tool we provide a fully automated pipeline to perform refinement including cDNA translation and multiple sequence alignment for visual inspection. The major prerequisite for applying the a&o-tool is high quality sequencing data.

Keywords

References

  1. Nat Protoc. 2009;4(8):1184-91 [PMID: 19617889]
  2. Nucleic Acids Res. 2018 Mar 16;46(5):2699 [PMID: 29425356]
  3. Genome Res. 2016 Aug;26(8):1134-44 [PMID: 27252236]
  4. Genome Res. 2015 Jun;25(6):918-25 [PMID: 25883319]
  5. IEEE Trans Vis Comput Graph. 2014 Dec;20(12):1983-92 [PMID: 26356912]
  6. Bioinformatics. 2016 Oct 1;32(19):3047-8 [PMID: 27312411]
  7. PLoS Comput Biol. 2010 Mar 26;6(3):e1000703 [PMID: 20361041]
  8. J Comput Biol. 2012 May;19(5):455-77 [PMID: 22506599]
  9. Sci Data. 2017 Dec 12;4:170185 [PMID: 29231921]
  10. Nat Methods. 2015 Feb;12(2):115-21 [PMID: 25633503]
  11. Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45 [PMID: 26553804]
  12. Science. 2015 Jan 23;347(6220):1260419 [PMID: 25613900]
  13. Bioinformatics. 2014 Apr 1;30(7):923-30 [PMID: 24227677]
  14. Nat Biotechnol. 2010 Dec;28(12):1248-50 [PMID: 21139605]
  15. Aging Cell. 2015 Jun;14(3):352-65 [PMID: 25677554]
  16. Nucleic Acids Res. 2014 Jan;42(Database issue):D60-7 [PMID: 24163100]
  17. Gigascience. 2018 Aug 1;7(8): [PMID: 30052957]
  18. Genome Biol. 2018 Nov 28;19(1):208 [PMID: 30486838]
  19. Nucleic Acids Res. 2018 Jan 4;46(D1):D754-D761 [PMID: 29155950]
  20. Nat Biotechnol. 2017 Apr 11;35(4):316-319 [PMID: 28398311]
  21. Nat Protoc. 2013 Aug;8(8):1494-512 [PMID: 23845962]
  22. PLoS Comput Biol. 2016 Feb 19;12(2):e1004772 [PMID: 26894997]
  23. BMC Bioinformatics. 2008 Jun 13;9:278 [PMID: 18554390]
  24. Bioinformatics. 2005 Aug 15;21(16):3439-40 [PMID: 16082012]
  25. J Mol Biol. 1990 Oct 5;215(3):403-10 [PMID: 2231712]
  26. Nucleic Acids Res. 2004 Mar 19;32(5):1792-7 [PMID: 15034147]

MeSH Term

Animals
Gene Expression Profiling
Genomics
Humans
Sequence Analysis, RNA
Sequence Homology, Nucleic Acid