A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets.

Mikhail M Savitski, Mathias Wilhelm, Hannes Hahne, Bernhard Kuster, Marcus Bantscheff
Author Information
  1. Mikhail M Savitski: From the ‡Cellzome GmbH, Meyerhofstrasse 1, 69117 Heidelberg, Germany;
  2. Mathias Wilhelm: §Chair for Proteomics and Bioanalytics, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354 Freising, Germany; ¶SAP SE, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany;
  3. Hannes Hahne: §Chair for Proteomics and Bioanalytics, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354 Freising, Germany;
  4. Bernhard Kuster: §Chair for Proteomics and Bioanalytics, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354 Freising, Germany; ‖Center for Integrated Protein Science Munich, Emil Erlenmeyer Forum 5, 85354 Freising, Germany marcus.x.bantscheff@gsk.com kuster@tum.de.
  5. Marcus Bantscheff: From the ‡Cellzome GmbH, Meyerhofstrasse 1, 69117 Heidelberg, Germany; marcus.x.bantscheff@gsk.com kuster@tum.de.

Abstract

Calculating the number of confidently identified proteins and estimating false discovery rate (FDR) is a challenge when analyzing very large proteomic data sets such as entire human proteomes. Biological and technical heterogeneity in proteomic experiments further add to the challenge and there are strong differences in opinion regarding the conceptual validity of a protein FDR and no consensus regarding the methodology for protein FDR determination. There are also limitations inherent to the widely used classic target-decoy strategy that particularly show when analyzing very large data sets and that lead to a strong over-representation of decoy identifications. In this study, we investigated the merits of the classic, as well as a novel target-decoy-based protein FDR estimation approach, taking advantage of a heterogeneous data collection comprised of ∼19,000 LC-MS/MS runs deposited in ProteomicsDB (https://www.proteomicsdb.org). The "picked" protein FDR approach treats target and decoy sequences of the same protein as a pair rather than as individual entities and chooses either the target or the decoy sequence depending on which receives the highest score. We investigated the performance of this approach in combination with q-value based peptide scoring to normalize sample-, instrument-, and search engine-specific differences. The "picked" target-decoy strategy performed best when protein scoring was based on the best peptide q-value for each protein yielding a stable number of true positive protein identifications over a wide range of q-value thresholds. We show that this simple and unbiased strategy eliminates a conceptual issue in the commonly used "classic" protein FDR approach that causes overprediction of false-positive protein identification in large data sets. The approach scales from small to very large data sets without losing performance, consistently increases the number of true-positive protein identifications and is readily implemented in proteomics analysis software.

References

  1. BMC Bioinformatics. 2012;13 Suppl 16:S2 [PMID: 23176207]
  2. J Proteome Res. 2004 Sep-Oct;3(5):958-64 [PMID: 15473683]
  3. J Proteomics. 2010 Oct 10;73(11):2092-123 [PMID: 20816881]
  4. Anal Chem. 2002 Oct 15;74(20):5383-92 [PMID: 12403597]
  5. J Proteome Res. 2009 Apr;8(4):1782-91 [PMID: 19714810]
  6. J Proteome Res. 2013 Jun 7;12(6):2414-21 [PMID: 23312004]
  7. Nat Methods. 2013 Jul;10(7):634-7 [PMID: 23749302]
  8. J Proteome Res. 2014 Dec 5;13(12):6187-95 [PMID: 25349961]
  9. Nature. 2014 May 29;509(7502):582-7 [PMID: 24870543]
  10. Science. 2014 Oct 3;346(6205):1255784 [PMID: 25278616]
  11. Cancer Res. 2012 May 1;72(9):2428-39 [PMID: 22414580]
  12. Mol Cell Proteomics. 2014 Dec;13(12):3698-708 [PMID: 25360005]
  13. Bioinformatics. 2004 Jun 12;20(9):1466-7 [PMID: 14976030]
  14. Mol Cell Proteomics. 2012 Apr;11(4):M111.010587 [PMID: 22186715]
  15. Mol Cell Proteomics. 2014 Dec;13(12):3709-15 [PMID: 25106551]
  16. Mol Cell Proteomics. 2011 Dec;10(12):M111.007690 [PMID: 21876204]
  17. Nat Methods. 2014 Mar;11(3):319-24 [PMID: 24487582]
  18. J Proteome Res. 2013 Jan 4;12(1):214-21 [PMID: 23210603]
  19. Mol Cell Proteomics. 2014 Jan;13(1):339-47 [PMID: 24143002]
  20. Anal Chem. 2010 Dec 1;82(23):9843-9 [PMID: 21033674]
  21. Mol Syst Biol. 2011;7:550 [PMID: 22108792]
  22. Mol Cell Proteomics. 2013 Sep;12(9):2383-93 [PMID: 23720762]
  23. J Proteome Res. 2011 Apr 1;10(4):1794-805 [PMID: 21254760]
  24. Nat Biotechnol. 2008 Dec;26(12):1367-72 [PMID: 19029910]
  25. Cell Rep. 2013 Aug 15;4(3):609-20 [PMID: 23933261]
  26. Mol Cell Proteomics. 2009 Nov;8(11):2405-17 [PMID: 19608599]
  27. J Proteome Res. 2014 Sep 5;13(9):4113-9 [PMID: 25026199]
  28. J Proteomics. 2013 Mar 27;80:123-31 [PMID: 23268117]
  29. Nat Methods. 2007 Oct;4(10):787-97 [PMID: 17901868]
  30. Nat Methods. 2007 Mar;4(3):207-14 [PMID: 17327847]
  31. Nat Methods. 2011;8(10):821-7 [PMID: 21983960]
  32. Electrophoresis. 1999 Dec;20(18):3551-67 [PMID: 10612281]
  33. Anal Chem. 2009 Jan 1;81(1):146-59 [PMID: 19061407]
  34. J Proteome Res. 2008 Jan;7(1):47-50 [PMID: 18067251]
  35. J Am Soc Mass Spectrom. 2011 Jul;22(7):1111-20 [PMID: 21953092]
  36. Mol Syst Biol. 2011;7:548 [PMID: 22068331]
  37. Mol Cell Proteomics. 2012 Mar;11(3):M111.014050 [PMID: 22278370]
  38. J Proteome Res. 2008 Jan;7(1):40-4 [PMID: 18052118]
  39. J Proteome Res. 2014 Jan 3;13(1):60-75 [PMID: 24261998]
  40. Mol Cell Proteomics. 2005 Oct;4(10):1419-40 [PMID: 16009968]
  41. Stat Interface. 2012;5(1):3-20 [PMID: 22833779]
  42. J Am Soc Mass Spectrom. 1994 Nov;5(11):976-89 [PMID: 24226387]
  43. Nature. 2014 May 29;509(7502):575-81 [PMID: 24870542]
  44. J Proteome Res. 2013 Jun 7;12(6):2449-57 [PMID: 23294059]

MeSH Term

Chromatography, Liquid
Databases, Protein
False Positive Reactions
Humans
Proteomics
Reproducibility of Results
Software
Tandem Mass Spectrometry

Word Cloud

Created with Highcharts 10.0.0proteinFDRdataapproachlargesetsnumberstrategydecoyidentificationsq-valuechallengeanalyzingproteomicstrongdifferencesregardingconceptualusedclassictarget-decoyshowinvestigated"picked"targetperformancebasedpeptidescoringbestCalculatingconfidentlyidentifiedproteinsestimatingfalsediscoveryrateentirehumanproteomesBiologicaltechnicalheterogeneityexperimentsaddopinionvalidityconsensusmethodologydeterminationalsolimitationsinherentwidelyparticularlyleadover-representationstudymeritswellnoveltarget-decoy-basedestimationtakingadvantageheterogeneouscollectioncomprised∼19000LC-MS/MSrunsdepositedProteomicsDBhttps://wwwproteomicsdborgtreatssequencespairratherindividualentitieschooseseithersequencedependingreceiveshighestscorecombinationnormalizesample-instrument-searchengine-specificperformedyieldingstabletruepositivewiderangethresholdssimpleunbiasedeliminatesissuecommonly"classic"causesoverpredictionfalse-positiveidentificationscalessmallwithoutlosingconsistentlyincreasestrue-positivereadilyimplementedproteomicsanalysissoftwareScalableApproachProteinFalseDiscoveryRateEstimationLargeProteomicDataSets

Similar Articles

Cited By