A novel algorithm for computational identification of contaminated EST libraries.

Rotem Sorek, Hershel M Safer
Author Information
  1. Rotem Sorek: Compugen Ltd, 72 Pinchas Rosen Street, Tel Aviv 69512, Israel. rotem@compugen.co.il

Abstract

A key goal of the Human Genome Project was to understand the complete set of Human proteins, the proteome. Since the genome sequence by itself is not sufficient for predicting new genes and alternative splicing events that lead to new proteins, expressed sequence tags (ESTs) are used as the primary tool for these purposes. The high prevalence of artifacts in dbEST, however, often leads to invalid predictions. Here we describe a novel method for recognizing genomic DNA contamination and other artifacts that cannot be identified using current EST cleaning techniques. Our method uses the alignment of the entire set of ESTs to the Human genome to identify highly contaminated EST libraries. We discovered 53 highly contaminated libraries and a subset of 24 766 ESTs from these libraries that probably represent contamination with genomic DNA, pre-mRNA, and ESTs that span non-canonical introns. Although this is only a small fraction of the entire EST dataset, each contaminating sequence could create a spurious transcript prediction. Indeed, in the clustering and assembly tool that we used, these sequences would have caused incorrect inference of 9575 new splice variants and 6370 new genes. Conclusions based on EST analysis, including prediction of alternative splicing, should be re-evaluated in light of these results. Our method, along with the identified set of contaminated sequences, will be essential for applications that depend on large EST datasets.

References

  1. Genome Res. 1999 Nov;9(11):1087-92 [PMID: 10568748]
  2. Nat Genet. 2002 Feb;30(2):141-2 [PMID: 11788827]
  3. Genome Res. 1999 Dec;9(12):1288-93 [PMID: 10613851]
  4. Nat Genet. 2000 Apr;24(4):340-1 [PMID: 10742092]
  5. Genome Res. 2000 Apr;10(4):483-501 [PMID: 10779488]
  6. FEBS Lett. 2000 May 26;474(1):83-6 [PMID: 10828456]
  7. Nat Genet. 2000 Jun;25(2):232-4 [PMID: 10835644]
  8. Nat Genet. 2000 Jun;25(2):239-40 [PMID: 10835646]
  9. Electrophoresis. 2000 May;21(9):1823-31 [PMID: 10870968]
  10. Proc Natl Acad Sci U S A. 2000 Nov 7;97(23):12690-3 [PMID: 11070084]
  11. Nucleic Acids Res. 2001 Jan 1;29(1):159-64 [PMID: 11125077]
  12. Science. 1995 Dec 22;270(5244):1945-54 [PMID: 8533086]
  13. Science. 1996 Oct 25;274(5287):540-6 [PMID: 8849440]
  14. Genome Res. 1996 Sep;6(9):791-806 [PMID: 8889548]
  15. Genome Res. 1996 Sep;6(9):807-28 [PMID: 8889549]
  16. Genome Res. 1996 Sep;6(9):829-45 [PMID: 8889550]
  17. Cancer Res. 1996 Dec 1;56(23):5380-3 [PMID: 8968089]
  18. Nucleic Acids Res. 1997 Apr 15;25(8):1626-32 [PMID: 9092672]
  19. Adv Exp Med Biol. 1997;419:163-8 [PMID: 9193649]
  20. Nature. 1997 Jul 24;388(6640):394-7 [PMID: 9237759]
  21. Science. 1997 Aug 15;277(5328):955-9 [PMID: 9252327]
  22. J Mol Med (Berl). 1997 Oct;75(10):694-8 [PMID: 9382993]
  23. Trends Genet. 1998 Jan;14(1):4-7 [PMID: 9448457]
  24. Genome Res. 1998 Mar;8(3):186-94 [PMID: 9521922]
  25. Genome Res. 1998 Mar;8(3):276-90 [PMID: 9521931]
  26. Science. 1998 Oct 23;282(5389):744-6 [PMID: 9784132]
  27. Cancer Res. 1998 Dec 1;58(23):5326-8 [PMID: 9850058]
  28. Nat Genet. 1999 Mar;21(3):323-5 [PMID: 10080189]
  29. Genome Res. 1999 Sep;9(9):868-77 [PMID: 10508846]
  30. Cancer Res. 2002 Feb 1;62(3):947-52 [PMID: 11830556]
  31. Genome Res. 2002 Jul;12(7):1060-7 [PMID: 12097342]
  32. Genomics. 2002 Sep;80(3):326-30 [PMID: 12213203]
  33. Bioinformatics. 2002 Sep;18(9):1162-6 [PMID: 12217907]
  34. Science. 1991 Jun 21;252(5013):1651-6 [PMID: 2047873]
  35. Nature. 1992 Feb 13;355(6361):632-4 [PMID: 1538749]
  36. Nat Genet. 1992 May;1(2):114-23 [PMID: 1302004]
  37. Nat Genet. 1992 May;1(2):124-31 [PMID: 1302005]
  38. Nat Genet. 1993 Jul;4(3):256-67 [PMID: 8358434]
  39. Nat Genet. 1993 Aug;4(4):332-3 [PMID: 8401577]
  40. Science. 1994 Mar 18;263(5153):1625-9 [PMID: 8128251]
  41. Hum Mol Genet. 1994;3 Spec No:1509-17 [PMID: 7849746]
  42. Mamm Genome. 1995 Feb;6(2):114-7 [PMID: 7766993]
  43. Genomics. 1995 Jan 1;25(1):238-47 [PMID: 7774924]
  44. Trends Biochem Sci. 1995 Aug;20(8):295-6 [PMID: 7667885]
  45. Nature. 1995 Sep 28;377(6547 Suppl):3-174 [PMID: 7566098]
  46. Science. 2001 Feb 16;291(5507):1304-51 [PMID: 11181995]
  47. Nature. 2001 Feb 15;409(6822):860-921 [PMID: 11237011]
  48. Genome Res. 2001 May;11(5):889-900 [PMID: 11337482]
  49. Ann Neurol. 2001 May;49(5):643-9 [PMID: 11357955]
  50. Proc Natl Acad Sci U S A. 2001 Oct 9;98(21):12103-8 [PMID: 11593022]
  51. Bioinformatics. 2001 Dec;17(12):1093-104 [PMID: 11751217]
  52. Nucleic Acids Res. 2002 Jan 1;30(1):299-300 [PMID: 11752319]
  53. Genome Res. 1999 Nov;9(11):1143-55 [PMID: 10568754]

MeSH Term

Algorithms
Artifacts
Computational Biology
DNA
Expressed Sequence Tags
Genome, Human
Genomic Library
Humans
Introns
RNA Precursors

Chemicals

RNA Precursors
DNA

Word Cloud

Created with Highcharts 10.0.0ESTnewESTscontaminatedlibrariessetsequencemethodhumanproteinsgenomegenesalternativesplicingusedtoolartifactsnovelgenomicDNAcontaminationidentifiedentirehighlypredictionsequenceskeygoalHumanGenomeProjectunderstandcompleteproteomeSincesufficientpredictingeventsleadexpressedtagsprimarypurposeshighprevalencedbESThoweveroftenleadsinvalidpredictionsdescriberecognizingusingcurrentcleaningtechniquesusesalignmentidentifydiscovered53subset24766probablyrepresentpre-mRNAspannon-canonicalintronsAlthoughsmallfractiondatasetcontaminatingcreatespurioustranscriptIndeedclusteringassemblycausedincorrectinference9575splicevariants6370Conclusionsbasedanalysisincludingre-evaluatedlightresultsalongwillessentialapplicationsdependlargedatasetsalgorithmcomputationalidentification

Similar Articles

Cited By (35)