hermie: the spectrum-to-protein analysis pipeline

Index:

Description:  Analyze peptide spectra. The pipeline script hermie is designed to automate the process of analyzing peptide spectra beginning with MS2 files and ending with a summary of peptide and protein identifications. There are several steps in the pipeline.

  1. Set up. The first step is to check that all of the input files are found, to confirm that the user has permissions to write in the requested directories, and to list the steps that will be taken. At this point the user can confirm that the analysis will proceed as desired or abort and correct any errors.
  2. Charge state determination. (optional) There are two possiblities for this step, both are turned off by default. For low resolution MS1 data, the program charge_czar can be run on the input MS2 files to determine charge state of multiply charged spectra. New MS2 files are written as an intermediate output. Use the --charge-czar option to turn on this step.
    For high resolution MS1 data, Hardklor can be run to determine charge state and a more accurate mass followed by Bullseye to filter out spectra not derived from peptides. Use the high-res-perc mode to run these steps.
  3. Library Search (optional) A library search is done on the MS2 files (or charge-state-determined files if charge state determination is turned on) using BiblioSpec. The library search results are written to a file as an intermediate output. If the library search is being used as a filter for the database search (see Search Modes section below) a new MS2 file is written containing only those spectra that did not pass the threshold criteria for positive identification. In the filtering mode, only these spectra will be used for the database search.
  4. Database Search (optional) A database search is done using SEQUEST. By default, the search is done on all spectra (or charge-state-determined spectra if charge state determination is turned on). Alternatively, the database search can be done only on spectra that were not identified in the library search (see Search Modes section below). The resulting SQT files are written as an intermediate output.
    A second or alternate database search can be run using parallel-crux, the multi-threaded implementatin of crux. The crux search is turned off by default but can be run with the -crux option.
  5. Score Search Results (optional) A probability score is calculated for the database search results using percolator. To use this option, the database search must be done twice, once with a protein database and once with a randomized version of the same protein database (see Search Modes section below) A new SQT file is written replacing the SEQUEST Xcorr with a discriminant score and replacing the Sp score with the log of the q-score.
  6. Analyze Search Results The results of the library and database searches are analyzed to select positive matches and assemble proteins from peptides. The SQT files generated by BiblioSpec and percolator or SEQUEST are summarized using DTASelect. Library matches are chosen based on the match score. If percolator is used, then matches are chosen based on a q-score. Otherwise, DTASelect can be used to choose identifications based on the usual criteria (Xcorr, deltaCn, proportion of ions, etc). A DTASelect-filter.txt and a DTASelect.html file are produced for viewing results.
  7. Update Library (optional) High-quality peptide identifications that are not in the library can be queued to be added to the library. An SSL file for building/updating a library is written to the library update queue.
  8. Upload Results to Database (not automated) Once your search has completed and you are satisifed with the results, you will want to upload them to the MSDaPl database. This step is not performed by hermie, but you are encouraged to make use of this database for storage, protein inference and results visualization. Follow these instructions.

Usage:  hermie [options] <organism> <search mode>
             hermie [options] <organism> <search mode> <directory name> [<directory name>...]
             hermie [options] <organims> <search mode> <file name> [<file name>...]

Input:   In its simplest form, hermie analyzes all MS2 files in the current directory. Alternatively, the name of a directory can be specified or the names of specific files can be listed. Options, organisms, and search modes are described in the following sections. Given only organism and search mode arguments hermie will analyze all files in the current working directory with names ending with '.ms2', '.cms2' or '.bms2'. If a directory or list of directories is given, all MS2 files in those directories will be analyzed. An individual MS2 file name or a list of file names may also be specified.

Output:   A directory tree containing all intermediate output files and a file recording the specifics of each step taken. Details of how the analysis is progressing are printed to stdout (the screen) and can be adjusted with the verbosity option (see Options). All output files are written to a new directory whose name can be set using the name option (see Options). Within this directory are sub-directories for each step of the output. A text file named 'log' describing all actions taken is also written.

In some cases, users will want to re-run only select steps in the pipeline. Therefore, if the output directory tree already exists, only those sub-directories that are used in the analysis will be changed. Details of the analysis will be written to a new log file numbered sequentially.

Options:   The options below can be specified on the command line to control the pipeline analysis and return help messages. The search mode is a collection of options saved in a file. Several pre-defined search modes are available for typical searches or the user may define a custom search mode file.     
-name <output name> Specify the name of the directory in which the results are written. Default is 'pipeline'.
-check-setup Quit after printing out setup details. This option is useful for making sure that the input MS2 files are in place and that the mode and options have scheduled the steps you wish to perform.
-help> Print the complete list of options available on the command line or in the mode file
-list-organisms Print all of the predefined organisms. To specify an organism not listed, use "other" and include the -library and -fasta options in the mode.
-list-modes Print all of the default modes available. (To read the details of these modes, look in proteome:/mnt/local/pipeline/modes/<mode name>)
-verbose <0|1|2|3> Adjust the level of output to stdout. Default level is 2. 0 is silent. 1 prints the confirmation details from the set up step. 2 additionally displays progress by printing the name of the current step being run. 3 additionally prints all of the output from each step (which by default is written to a file).

Organisms and Search Modes:  The details of the pipeline analysis are defined by the organism of interest and what are called search modes. The organism defines which spectrum library and fasta protein database is searched. Predefined organisms are already associated with particular databases and libraries. Use the organism other and a custom search mode to specify different database and/or library files. A search mode defines which steps of the analysis are performed, the options used for each step, and so on. There are five basic search modes defined (high-res-perc, standard, standard-perc, lib-as-filter, and lib-as-filter-perc) and custom modes may be defined as well. the pre-defined search mode files reside at /net/maccoss/vol2/software/pipeline/modes/

The standard mode searches all spectra with both the library and database searches. The database contains both real protein sequences and a shuffled version of each as decoys. This mode uses common DTASelect criteria for defining good matches. The standard-perc mode adds the use of percolator to define good matches. In this mode, two SEQUEST searches are done for each ms2 file, one with a standard protein database and one with a database of only shuffled sequences. The high-res-perc mode is the standard-perc mode with the addition of two pre-search steps. This mode is for high-resolution MS1 data and requires a .ms1 (.cms1 or .bms1) file for each .ms2 file. The lib-as-filter mode uses the library search as a way of reducing the number of spectra searched by SEQUEST. All spectra with a good match to the library are removed from the SEQUEST search. The lib-as-filter-perc mode adds the use of percolator to define good SEQUEST hits.

To define a new mode, create a text file that contains the desired parameters (described below). The mode is then referred to by the file name. The environment variable $MODEPATH defines where the program looks for mode files. If $MODEPATH is undefined, the default is /net/maccoss/vol2/pipeline/modes/. A custom sequest.params file can also be placed in your $MODEPATH to be used instead of the default one. The database in sequest.params will be replaced by the one defined by the organism or the one in the mode file.

The format of the mode file is the same as command-line options. Options may be separated by spaces, tabs, or new lines. Any line beginning with # will be ignored and can be used for comments. The following options may be specified.

-charge-czar Skip the charge_czar step. Off by default.
-nocharge-czar Skip the charge_czar step.
-noblibSearch Skip the library search (BiblioSpec) step.
-nosequest Skip the database search with SEQUEST.
-queue <queue name> Submit jobs to a specific queue.
-crux Run a crux search. Turned off by default.
-nocrux Skip the crux search (Use to override mode).
-percolator A flag to select the use of percolator for scoring SEQUEST results. A decoy database must also be specified with the -decoy option. Not necessary with the standard-perc mode.
-nopercolator Skip the percolator step. Can be used to override a mode file, particularly when repeating select steps in a run.
-bullseye Use Hardklor/Bullseye to assign precursor mass. Requires that -hk-conf specify the configuration file to use.
-nobullseye Do not run Hardklor/Bullseye. Can be used to overide mode file.
-old-perc Use an older version of percolatir, v1.07.
-noadd-lib Do not add new spectra to the library.
-nodtaselect Do not run DTASelect on SEQUEST or library search results.
-fasta <protein database file> (Only valid with the organism "other") Full path of the protein database file to be used by SEQUEST
-decoy <decoy database> (Only valid with the organism "other". Required for percolator) Full path of the random/shuffled protein database file to be used by SEQUEST. This option is used in conjunction with -percolator. Two separate SEQUEST searches are done on each MS2 file, one with the real protein database and one with a randomized database.
-index <index name> Use index for SEQUEST. Replaces -fasta option.
-decoy-index <index name> Use shuffled index for SEQUEST/percolator. Replaces -decoy option
-seq-params <file> Specify a sequest.params file to be used by SEQUEST. The file can have any name and does not have to be in the $MODEPATH. The database in the specified file will NOT be used. The organism argument or -fasta option will define the database.
-tryptic <type> Use a tryptic digest in SEQUEST search where 'type' is full, partial, or non (i.e. full tryptic digest, partial tryptic digest, or non specific digest). Default partial.
-hk-conf <file> File to replace the default hardklor.conf.
-nodes <n> Number of nodes requested for the SEQUEST search of each MS2 file
-sleep-time <n> Seconds between checks on SEQUEST progress.
-library <library file > (Only valid with the organism "other". Required for library search and library update) Full path of the BiblioSpec library to be used in the search.
-lib-as-filter Use the library search results as a filter for the database search. Only search those spectra which did not have a confident ID based on the library search.
-mail <address> Send an email notification to the address when the hermie run is complete either successfully or due to fatal errors.
-web <path> Copy the DTASelect.html file from the SEQUEST search to the web directory given by path. Something like /mnt/www/localhost/name/myresults.html will rename the file to myresults.html. The rest of the directory structure must already be present.
Options for component programs
Unlike the above options, these can be specified multiple times to add multiple values. There is no gurarantee as to the order in which they will be used in the command. The <option> should include any dashes. Options that do not take arguments (e.g. the percolator -d flag) should still be followed by a '=' and then no <argument>. (Example: two different percolator options as they would be passed to hermie. --perc-option -d= --perc-option --sqt-out=myfile.sqt ) Note that all SEQUEST, Hardklor and Bullseye options are passed via their respective parameter files.
--cz <option>=<argument> Specify the options used with charge-czar. The format is the same as for the --dta-sequest option. For available options, see the charge-czar documentation
--lib <option>=<argument> Specify the options used with the library search (BlibSearch). The format is the same as for the --dta-sequest option. For available options, see the BlibSearch documentation
--crux-option <option>=<argument> Specify options to be used with crux. For available options, see the parallel-crux documentation.
--perc-option <option>=<argument> Specify options to be used with percolator. For available options, run percolator from a command prompt with no arguments.
--dta-sequest <option>=<argument> Specify the options used with DTASelect on the SEQUEST search results. A full list of available options can be found by runnng DTASelect with the --help option.
--dta-library <option>=<argument> Specify the options used with DTASelect on the library search data.

Bugs:  

Library search with no SEQUEST search. If a library search is done with no SEQUEST search, DTASelect will fail on the library results because there is no sequest.params file. As a work-around, begin by running hermie with the -check option as though you would also run SEQUEST. This writes the sequest.params file. Then do the library search by running hermie without the -check option and with the -nosequest option.

Symbolic links and ms1 files. When running Hardklor/Bullseye, there must be an .ms1 file for each .ms2 file and they must be in the same directory. Hermie follows symbolic links back to their source and discards the link location. So unfortunately, putting links to ms1s and ms2s in the same directory will not work. The actual files must be in the same place. This problem is schedualed to be fixed.


Last updated March 30, 2010