Abstract
The Human Proteome Organization (HUPO) recently completed the first large-scale collaborative study to characterize the human serum and plasma proteomes. The study was carried out in different locations and used diverse methods and instruments to compare and integrate tandem mass spectrometry (MS/MS) data on aliquots of pooled serum and plasma from healthy subjects. Liquid chromatography (LC)-MS/MS data sets from 18 laboratories were matched to the International Protein Index database, and an initial integration exercise resulted in 9,504 proteins identified with one or more peptides, and 3,020 proteins identified with two or more peptides. This article uses a rigorous statistical approach to take into account the length of coding regions in genes, and multiple hypothesis-testing techniques. On this basis, we now present a reduced set of 889 proteins identified with a confidence level of at least 95%. We also discuss the importance of such an integrated analysis in providing an accurate representation of a proteome as well as the value such data sets contain for the high-confidence identification of protein matches to novel exons, some of which may be localized in alternatively spliced forms of known plasma proteins and some in previously nonannotated gene sequences.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).
Sadygov, R., Cociorva, D. & Yates, J.R. Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nat. Methods 1, 195–202 (2004).
Olsen, J. & Mann, M. Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proc. Natl. Acad. Sci. USA 101, 13417–13422 (2004).
Orchard, S., Hermjakob, H. & Apweiler, R. Annotating the human proteome. Mol. Cell. Proteomics 4, 435–440 (2005).
Hanash, S. & Celis, J.E. The human proteome organization: a mission to advance proteome knowledge. Mol. Cell. Proteomics 1, 413–414 (2002).
Omenn, G.S. The Human Proteome Organization plasma proteome project pilot phase: reference specimens, technology platform comparisons, and standardized data submissions and analyses. Proteomics 4, 1235–1240 (2004).
Omenn, G.S. et al. Overview of the HUPO Plasma Proteome Project: Results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly-available database. Proteomics 5, 3226–3245 (2005).
Kersey, P. et al. The International Protein Index: an integrated database for proteomics experiments. Proteomics 4, 1985–1988 (2004).
Adamski, M. et al. Data management and preliminary data analysis in the pilot phase of the HUPO Plasma Proteome Project. Proteomics 5, 3246–3261 (2005).
Carr, S. et al. The need for guidelines in publication of peptide and protein identification data. Mol. Cell. Proteomics 3, 531–533 (2004).
Cargile, B.J., Bundy, J.L. & Stephenson, J.L. Potential for false positive identifications from large databases through tandem mass spectrometry. J. Proteome Res. 3, 1082–1085 (2004).
Eriksson, J. & Fenyo, D. Protein identification in complex mixtures. J. Proteome Res. 4, 387–393 (2005).
Fenyo, D. & Beavis, R.C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 75, 768–774 (2003).
Keller, A., Nesvizhskii, A.I., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 5383–5392 (2002).
Nesvizhskii, A.I., Keller, A., Kolker, E. & Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75, 4646–4658 (2003).
Sadygov, R.G. & Yates, J.R. A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. 75, 3792–3798 (2003).
Shen, Y. et al. Ultra-high-efficiency strong cation exchange LC/RPLC/MS/MS for high dynamic range characterization of the human plasma proteome. Anal. Chem. 76, 1134–1144 (2004).
Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).
Beer, I., Barnea, E., Ziv, T. & Admon, A. Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 4, 950–960 (2004).
Eng, J.K., McCormack, A.L. & Yates, J.R.I. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).
Haab, B.B. et al. Immunoassay and antibody microarray analysis of the HUPO reference specimens: systematic variation between sample types and calibration of mass spectrometry data. Proteomics 5, 3278–3291 (2005).
Ishihama, Y. et al. Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein. Mol. Cell. Proteomics 4, 1265–1272 (2005).
O'Brien, T.J. et al. The CA 125 gene: an extracellular superstructure dominated by repeat sequences. Tumour Biol. 22, 348–366 (2001).
Bendtsen, J.D., Nielsen, H., vonHeijne, G. & Brunak, S. Improved predication of signal peptides: SignalP 3.0. J. Mol. Biol. 340, 783–795 (2004).
Miyakis, S., Giannakopoulos, B. & Krilis, S.A. Beta 2 glycoprotein I–function in health and disease. Thromb. Res. 114, 335–346 (2004).
Tang, H.Y. et al. A novel four-dimensional strategy combining protein and peptide separation methods enables detection of low-abundance proteins in human plasma and serum proteomes. Proteomics 5, 3329–3342 (2005).
Wang, H. et al. Intact-protein based high-resolution three-dimensional quantitative analysis system for proteome profiling of biological fluids. Mol. Cell. Proteomics 4, 618–625 (2005).
Misek, D.E. et al. A wide range of protein isoforms in serum and plasma uncovered by a quantitative Intact Protein Analysis System (IPAS). Proteomics 5, 3343–3351 (2005).
Choudhary, J.S., Blackstock, W.P., Creasy, D.M. & Cottrell, J.S. Interrogating the human genome using uninterpreted mass spectrometry data. Proteomics 1, 651–667 (2001).
Kuster, B., Mortensen, P., Andersen, J.S. & Mann, M. Mass spectrometry allows direct identification of proteins in large genomes. Proteomics 1, 641–650 (2001).
Kreahling, J. & Graveley, B.R. The origins and implications of Alternative splicing. Trends Genet. 20, 1–4 (2004).
Link, A.J. et al. Direct analysis of protein complexes using mass spectrometry. Nat. Biotechnol. 17, 676–682 (1999).
Liu, H., Sadygov, R.G. & Yates, J.R. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal. Chem. 76, 4193–4201 (2004).
Washburn, M.P., Wolters, D. & Yates, J.R. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 19, 242–247 (2001).
Ghaemmaghami, S. et al. Global analysis of protein expression in yeast. Nature 425, 737–741 (2003).
Anderson, N.L. et al. The human plasma proteome: a nonredundant list developed by combination of four separate sources. Mol. Cell. Proteomics 3, 311–316 (2004).
Chan, K.C. et al. Analysis of the human serum proteome. Clin. Proteomics 1, 101–225 (2004).
Zhou, M. et al. An investigation in the human serum “interactome”. Electrophoresis 25, 1289–1298 (2004).
Jaffe, J.D., Berg, H.C. & Church, G.M. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 4, 59–77 (2004).
Oyama, M. et al. Analysis of small human proteins reveals the translation of upstream open reading frames of mRNAs. Genome Res. 14, 2048–2052 (2004).
Acknowledgements
The collaborative HUPO Plasma Protein study and the data analysis presented here have been supported by a trans-National Institutes of Health grant supplement 84982 administered by the National Cancer Institute, by pharmaceutical and technology company sponsors and by voluntary efforts of collaborating laboratories.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Fig. 1
Accrual of identifications as a function of sampling. (PDF 20 kb)
Supplementary Fig. 2
Complement component 3 isoforms. (PDF 20 kb)
Supplementary Table 1
Numbers of protein identificaitons by specifmen and by methodologies applied in individual laboratories. (PDF 90 kb)
Supplementary Table 2
List of high-confidence protein identifications. (PDF 116 kb)
Supplementary Table 3
Intragenic peptides not in an annotated exon. (PDF 15 kb)
Rights and permissions
About this article
Cite this article
States, D., Omenn, G., Blackwell, T. et al. Challenges in deriving high-confidence protein identifications from data gathered by a HUPO plasma proteome collaborative study. Nat Biotechnol 24, 333–338 (2006). https://doi.org/10.1038/nbt1183
Published:
Issue Date:
DOI: https://doi.org/10.1038/nbt1183
This article is cited by
-
A new estimation of protein-level false discovery rate
BMC Genomics (2018)
-
Innovative methods for biomarker discovery in the evaluation and development of cancer precision therapies
Cancer and Metastasis Reviews (2018)
-
Quantitative, multiplexed workflow for deep analysis of human blood plasma and biomarker discovery by mass spectrometry
Nature Protocols (2017)
-
Characterisation of the circulating acellular proteome of healthy sheep using LC-MS/MS-based proteomics analysis of serum
Proteome Science (2016)
-
The ever-expanding myokinome: discovery challenges and therapeutic implications
Nature Reviews Drug Discovery (2016)