Use of Shotgun Proteomics for the Identification, Confirmation, and Correction of C. elegans Gene Annotations

Gennifer E. Merrihew, Colleen Davis, Brent Ewing, Gary Williams, Lukas Käll, Barbara E. Frewen, William Stafford Noble, Phil Green, James H. Thomas, Michael J. MacCoss

Abstract

We describe a general mass spectrometry-based approach for gene annotation of any organism and demonstrate its effectiveness using the nematode C. elegans.  We detected 6,779 C. elegans proteins (67,047 peptides), including 384 that, although annotated in Wormbase WS150, lacked cDNA or other prior experimental support. We also identified 429 new coding sequences that were unannotated in WS150.  Nearly half (192/429) of the new coding sequences were confirmed with RT-PCR data. Thirty-three (~8%) of the new coding sequences had been predicted to be pseudogenes, 151 (~35%) reveal apparent errors in gene models, and 245 (57%) appear to be novel genes. In addition, we verified 6,010 exon-exon splice junctions within existing Wormbase gene models.  Our work confirms that mass spectrometry is a powerful experimental tool for annotating sequenced genomes.  In addition, the collection of identified peptides should facilitate future proteomics experiments targeted at specific proteins of interest.

The complete collection of identified peptides has been mapped back to the C. elegans genome and is available through http://wormbase.org.  The annotated spectra have been assembled into a reference spectrum library.  These spectra and software to make use of the libraries is available at http://proteome.gs.washington.edu.  All additional supplementary information will be available from the journal’s website.


Supplementary Data

Protein Identification List

RAW and ms2 Mass Spectrometry Data files

SEQUEST database search files

BiblioSpec Reference Spectrum Library Containing Annotated and Searchable Spectra


The ms2 and sqt file formats are described in McDonald et al.(2004).