Modeling transcriptome based on transcript-sampling data.

Jiang Zhu, Fuhong He, Jing Wang, Jun Yu
Author Information
  1. Jiang Zhu: Chinese Academy of Sciences (CAS) Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China.

Abstract

BACKGROUND: Newly-evolved multiplex sequencing technology has been bringing transcriptome sequencing into an unprecedented depth. Millions of transcript tags now can be acquired in a single experiment through parallelization. The significant increase in throughput and reduction in cost required us to address some fundamental questions, such as how many transcript tags do we have to sequence for a given transcriptome? How could we estimate the total number of unique transcripts for different cell types (transcriptome diversity) and the distribution of their copy numbers (transcriptome dynamics)? What is the probability that a transcript with a given expression level to be detected at a certain sampling depth?
METHODOLOGY/PRINCIPAL FINDINGS: We developed a statistical model to evaluate these parameters based on transcriptome-sampling data. Three mixture models were exploited for their potentials to model the sampling frequencies. We demonstrated that relative abundances of all transcripts in a transcriptome follow the generalized inverse Gaussian distribution. The widely known beta and gamma distributions failed to fulfill the singular characteristics of relative abundance distribution, i.e., highly skewed toward zero and with a long tail. An estimator of transcriptome diversity and an analytical form of sampling growth curve were proposed in a coherent framework. Experimental data fitted this model very well and Monte Carlo simulations based on this model replicated sampling experiments in a remarkable precision.
CONCLUSIONS: Taking human embryonic stem cell as a prototype, we demonstrated that sequencing tens of thousands of transcript tags in an ordinary EST/SAGE experiment was far from sufficient. In order to fully characterize a human transcriptome, millions of transcript tags had to be sequenced. This model lays a statistical basis for transcriptome-sampling experiments and in essence can be used in all sampling-based data.

References

  1. Nat Biotechnol. 2000 Jun;18(6):630-4 [PMID: 10835600]
  2. Proc Natl Acad Sci U S A. 2004 Aug 10;101(32):11701-6 [PMID: 15272081]
  3. BMC Bioinformatics. 2006 Mar 20;7:157 [PMID: 16549008]
  4. Nucleic Acids Res. 2004 Nov 23;32(20):6104-10 [PMID: 15562001]
  5. Genome Res. 2004 May;14(5):976-87 [PMID: 15123595]
  6. Science. 2004 Dec 24;306(5705):2242-6 [PMID: 15539566]
  7. Science. 1995 Oct 20;270(5235):467-70 [PMID: 7569999]
  8. Science. 1991 Jun 21;252(5013):1651-6 [PMID: 2047873]
  9. Science. 2002 May 3;296(5569):916-9 [PMID: 11988577]
  10. Nat Biotechnol. 1996 Dec;14(13):1675-80 [PMID: 9634850]
  11. Nat Biotechnol. 2002 May;20(5):508-12 [PMID: 11981567]
  12. Genome Biol. 2007;8(6):R113 [PMID: 17570852]
  13. Proc Natl Acad Sci U S A. 2002 Aug 20;99(17):11287-92 [PMID: 12119410]
  14. Bioinformatics. 2003 Mar 1;19(4):443-8 [PMID: 12611798]
  15. Nucleic Acids Res. 2006 Jul 13;34(12):e84 [PMID: 16840528]
  16. Nat Methods. 2005 Jul;2(7):495-502 [PMID: 15973418]
  17. BMC Dev Biol. 2004 Aug 10;4:10 [PMID: 15304200]
  18. Science. 2000 Dec 22;290(5500):2306-9 [PMID: 11125145]
  19. Dev Dyn. 2004 Feb;229(2):243-58 [PMID: 14745950]
  20. Proc Natl Acad Sci U S A. 2000 Feb 15;97(4):1665-70 [PMID: 10677516]
  21. Genome Res. 2007 Jan;17(1):108-16 [PMID: 17135571]
  22. Nucleic Acids Res. 2003 Feb 1;31(3):1067-74 [PMID: 12560505]
  23. Nat Methods. 2005 Feb;2(2):105-11 [PMID: 15782207]
  24. Philos Trans R Soc Lond B Biol Sci. 1978 May 11;283(997):373-4 [PMID: 26083]
  25. Science. 1995 Oct 20;270(5235):484-7 [PMID: 7570003]
  26. Proc Natl Acad Sci U S A. 2002 Sep 17;99(19):12257-62 [PMID: 12213963]
  27. Nature. 1974 Jul 19;250(463):199-204 [PMID: 4855195]
  28. Nucleic Acids Res. 2001 Apr 15;29(8):1690-4 [PMID: 11292841]
  29. Biometrics. 2003 Sep;59(3):476-86 [PMID: 14601748]
  30. Nature. 2005 Sep 15;437(7057):376-80 [PMID: 16056220]
  31. Nat Methods. 2005 Jan;2(1):47-53 [PMID: 15782160]
  32. Cell. 1998 Nov 25;95(5):717-28 [PMID: 9845373]
  33. Proc Natl Acad Sci U S A. 2006 May 9;103(19):7240-5 [PMID: 16648246]
  34. Genome Res. 2002 Jun;12(6):996-1006 [PMID: 12045153]
  35. Genomics. 2002 Apr;79(4):598-602 [PMID: 11944993]
  36. Science. 2005 May 20;308(5725):1149-54 [PMID: 15790807]
  37. Trends Biotechnol. 2004 Jan;22(1):23-30 [PMID: 14690619]
  38. Bioinformatics. 2001 Sep;17(9):840-2 [PMID: 11590101]
  39. Cell. 2006 Jan 13;124(1):207-19 [PMID: 16413492]
  40. Proc Natl Acad Sci U S A. 2003 Dec 23;100(26):15776-81 [PMID: 14663149]
  41. Science. 2005 Sep 9;309(5741):1728-32 [PMID: 16081699]
  42. Genetics. 2002 Jul;161(3):1321-32 [PMID: 12136033]
  43. Proc Natl Acad Sci U S A. 2002 Apr 30;99(9):6152-6 [PMID: 11972056]
  44. Genome Res. 2005 Apr;15(4):566-76 [PMID: 15805497]

MeSH Term

Embryonic Stem Cells
Expressed Sequence Tags
Gene Expression Profiling
Humans
Models, Genetic
Monte Carlo Method
Normal Distribution
RNA, Messenger

Chemicals

RNA, Messenger

Word Cloud

Created with Highcharts 10.0.0transcriptometranscriptmodeltagssamplingdatasequencingdistributionbasedcanexperimentgiventranscriptscelldiversitystatisticaltranscriptome-samplingdemonstratedrelativeexperimentshumanBACKGROUND:Newly-evolvedmultiplextechnologybringingunprecedenteddepthMillionsnowacquiredsingleparallelizationsignificantincreasethroughputreductioncostrequiredusaddressfundamentalquestionsmanysequencetranscriptome?estimatetotalnumberuniquedifferenttypescopynumbersdynamics?probabilityexpressionleveldetectedcertaindepth?METHODOLOGY/PRINCIPALFINDINGS:developedevaluateparametersThreemixturemodelsexploitedpotentialsfrequenciesabundancesfollowgeneralizedinverseGaussianwidelyknownbetagammadistributionsfailedfulfillsingularcharacteristicsabundanceiehighlyskewedtowardzerolongtailestimatoranalyticalformgrowthcurveproposedcoherentframeworkExperimentalfittedwellMonteCarlosimulationsreplicatedremarkableprecisionCONCLUSIONS:TakingembryonicstemprototypetensthousandsordinaryEST/SAGEfarsufficientorderfullycharacterizemillionssequencedlaysbasisessenceusedsampling-basedModelingtranscript-sampling

Similar Articles

Cited By