iPRG 2016 study submission

Dear iPRG 2016 Study Participant,

Thank you for participating in this year’s Proteome Informatics Research Group (iPRG) study. This letter provides the instructions needed to access the data files, complete your analysis, and submit your results. The deadline has been extended and results returned by Monday, January 16, 2017 will be included in the iPRG presentation at the next Annual ABRF Meeting, March 25-28, 2017.

News (April 2018): The answer key to this study is now available (see bottom of this page). For refernce, the other pages are kept as when the study was conducted.


This study of bottom-up proteomics LC-MS/MS data analysis focuses on the identification and false-discovery rate (FDR) estimation of proteins, or more specifically, proteoforms (Smith et. al. 2013). In this study, we have acquired data from four samples prepared by spiking different combination of partially overlapping oligopeptides recombinantly expressed in the bacterium Escherichia coli (Figure 1 in the instruction file) into a common background. These oligos are here referred to as Protein Epitope Signature Tags (PrESTs, Figure 2 in the instruction file) and mimic protein homologs for the purpose of this study. Three technical replicate runs of each sample were acquired in random order. The goal of the study is to compare methods for inferring and estimating the confidence of proteoform assignments in each of the samples. The participants are free to use any peptide and protein identification software, or a combination of several search engines. The participants are also free to use MS1 or MS2 data, or both.

Raw data is provided along with a FASTA sequence database that should be used without modification. The database contains 5,592 background proteins. However, only results on the PrESTs should be reported. These are the sequences with names beginning with 'HPRR' followed by a unique number.

To evaluate the submissions in this and future studies, and to enable the participants themselves to compare their methods and results, an alternative, open notebook-style submission and evaluation system is being introduced in this study. As this is a novelty for 2016, we will also allow uploading of results in the form of plain data table as in previous studies. As part of the study data package, we also provide templates for R Markdown and an IPython notebook, defining the starting point and output data matrix to ensure all participants start from the same data and report using the same format. The participants are then free to insert their database search results, along with R, Java or Python scripts, to further analyze the data and visualize the results. We hope that this new submission format will be more transparent and facilitate sharing of methods for analysis and visualization, thereby extending the life of the study. These will be validated during submission to ensure they conform to the submission template.

Study Package

The study package can be downloaded from here. Unboxing the study package, you will find:

1 copy of these instructions
12 raw Q Exactive LC-MS/MS datasets
1 FASTA file to be used in this study
1 R Markdown containing one example solution
1 IPython notebook containing another example solution
1 example tab-separated data table containing results
1 Allen key


Please send questions to here. All identifying information will be removed prior to forwarding the question to the iPRG group members. For details, please refer to this PDF.

We thank you for your support of the ABRF and look forward to receiving your results for the study.


The ABRF Proteome Informatics Research Group (iPRG)

Magnus Palmblad (Chair) - Leiden University Medical Center, Netherlands
Henry Lam (Co-Chair) - Hong Kong University of Science and Technology
Michael Hoopmann - Institute for Systems Biology, Seattle, WA
Susan T. Weintraub - University of Texas Health Science Center at San Antonio, TX
Hyungwon Choi - National University of Singapore
Samuel Payne - Pacific Northwest National Laboratory, Richland, WA
Lukas Käll - KTH - Royal Institute of Technology, Stockholm, Sweden
Darryl Davis - Janssen Pharmaceuticals, Horsham, PA
Yasset Perez-Riverol - European Bioinformatics Institute, Hinxton, UK
Christopher Colangelo (EB Liaison) - Primary Ion, Old Lyme, CT
Answer key (revealed April 10, 2018): PrEST pool A (192 sequences) PrEST pool B (191 sequences).
File 1 (.Rmd, .ipynb, .txt)

This file should be an R Markdown document or IPython notebook containing the analysis and explaining what was done. Alternatively, a free-text description of what was done can be provided here. Anonymized Markdowns and notebooks will be shared under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license on the ABRF iPRG website.

File 2 (.txt)

This file should be a tab-delimited table containing a data matrix with probabilities of presence (see appendix) for each identified proteoform, with each row containing one HPRR proteform and each column the probabilities for one sample. The first column should contain the PrEST accession and the first row the sample names. The spreadsheet format will be automatically validated at submission. The file created by the example Markdown/notebooks already conform to this format.


