hermie: the spectrum-to-protein analysis pipeline

Table of Contents

Home
Getting Started
Documentation
Customization
Examples
Troubleshooting
Index

On this Page

Before you begin
An Example Run
Using MS2 files from different locations
Choosing a search mode
Running longer analyses
Checking on your run's progress

Getting Started

This tutorial is designed to introduce MacCoss lab members and collaborators to hermie, the peptide mass spectrum data analysis pipeline. It assumes you know a few basic UNIX commands so that you can copy files, list the contents of a directory, and move to different directories. You can jump to a topic on this page with the links in the navigation bar to the left. Experienced users might try the examples page.

Before you begin

For this tutorial you will need an account on proteome and an MS2 file. In order to complete this tutorial quickly, you might want to use a small MS2 file with only a few spectra. There is a sample on proteome at /home/frewen/examples/sample.ms2 which you can copy to your home directory and use for this tutorial. Or if you prefer, you can generate your own. Here's how.

If you have a file, say platypus-01.ms2, you can create a shorter version of it by running this command

$ head -n 15000 platypus-01.ms2 > short-platypus-01.ms2
You can count the number of spectra in your new file like this

$ grep -c ^S short-platypus-01.ms2

Hopefully, you were given a web directory on proteome so that you can view your HTML results. To check for that directory, do

$ ls ~/public_html
Either it will list the contents of the directory (which may be empty) or it will return an error telling you that no such directory exists.

Note: if you have never run SEQUEST before, you should follow the set up instructions.

Before you begin, you might want to check that there are processors available on the cluster. Run this command

$ qstat
If it returns nothing, you are in good shape. If there are jobs listed that are in the queue and not yet running (qw), you may have to wait a while for your run to complete.

return to top

An example run.

Note: Hermie can read spectra from .ms2, .cms2 or .bms2 files. For this tutorial, ".ms2" will refer to all three types.

We will assume that you are using the sample MS2 file called sample.ms2 and that it contains spectra from yeast.

  1. Navigate to a directory where you would like to store your results. Copy your MS2 file to this directory.
  2. Make sure that hermie is in your $PATH environment variable by running this command.
    $ hermie
    If you get an error message saying command not found, try this
    $ export PATH=$PATH:/net/maccoss/vol2/software/bin
    $ hermie
    You should see the following text printed to your screen.
    FATAL: No organism and/or search mode given
    Usage:  hermie [options] <organism> <mode>
            hermie [options] <organism> <mode> <directory name> [<directory name>...]
            hermie [options] <organism> <mode> <file name> [<file name>...]
    
    Options: -name <output name>  Specify directory name for results.
                                    Default is 'pipeline'.
             -check-setup         Quit after printing out setup details.
             -help                Print the complete list of options.
             -list-organisms      Print a list of established organisms.
             -list-modes          Print a list of predefined modes.
             -verbose <0|1|2|3>   Adjust the level of output to stdout.
                                    Default level is 2. 0-silent. 1-setup details
                                    2-progress. 3-output of each step.
    

    This is the usage statement preceded by an error message.
  3. The error message on the first line is telling you that you did not provide the necessary arguments: the organism and the search mode. By specifying an organism, you are determining which protein database and library will be used for the search. By selecting a search mode, you are determining how the pipeline will run. There are several predefined organisms and search modes or you can create your own. (see Custom Modes and the documentation for details) To see a list of available organisms, run the command
    Note that other is in the list. This can be used with a custom mode to specify organisms not in the default list.
    $ hermie -list-organisms
    You should see yeast in the list, which is the one we will use. To see a list of available search modes, run the command
    $ hermie -list-modes
    We will use the standard-perc mode (also see the Choosing a Search Mode section below).
  4. Look once more at the help message generated in step 2. The section beginning with Usage gives the overall format of hermie commands. The organism and mode are mandatory and there are some options--information that isn't necessary but can be included. Some of those options are listed in the usage statement. You can see all of the options with the command
    $ hermie -help
  5. Once we have chosen an organism, search mode, and options, we can get a preview of what hermie will do by using the -check-setup option. Run the command
    $ hermie -check-setup yeast standard-perc
    You should see the following
    hermie: Spectrum Analysis Pipeline
    
    Scheduled to run BiblioSpec SEQUEST percolator DTASelect update-library 
    Using ms2 files:
    	sample.ms2
    	
    Using library /net/maccoss/vol2/software/pipeline/libraries/yeast.lib
    Using protein database /net/maccoss/vol2/software/pipeline/dbase/yeast/yeast-200209-contam.fasta
      and decoy database /net/maccoss/vol2/software/pipeline/dbase/yeast/yeast-200209-contam-rev.fasta
    
    The preview is telling you that hermie will run five steps in the analysis beginning with BiblioSpec and ending update-library. There is a list of the MS2 files that will be used (by default, any .ms2 files in the current directory. In this case sample.ms2) as well as the names of the libraries and fasta protein databases.
  6. Now we are ready for a real run. Run the analysis with the command
    $ hermie -sleep-time 1 -nodes 1 yeast standard-perc
    We added the -sleep-time option so you don't have to wait 20 minutes for hermie to finish and we added the -nodes option so that you are only requesting one node on the cluster. Usually, you will want to use the default values for these options.
    Once again, you should see the set-up information printed to the screen. Now it should also be followed by the name of the program being run as the analysis proceeds. Once it has completed, it will print Pipeline complete and give you a prompt.
  7. Take a look at the results. First look at the contents of the directory. ($ ls) You should see two files named log and log-1 and a new directory named pipeline. This is the default name and it can be changed by using the option -name. For example, to name the the directory storing the results tutorial run the command
    $ hermie -name tutorial yeast standard
    The log files are a more detailed version of the information printed to the screen. The first one was produced when we did the check and the second one came from the actual run. You can control how much information is printed to the screen with the -verbosity option. This does not affect the log file, so you could run hermie with no screen output and still have all the details saved to the log file.
  8. Within the pipeline directory are all of the intermediate outputs of the programs run. They are organized by sub-directories named for the program. For example, the output from SEQUEST is in pipeline/sequest.
    Note: Don't panic if there is no DTASelect-filter.txt file or if the DTASelect.html file is empty. Remember, we are only looking at a handful of spectra so it's pretty unlikely that it found any good matches. Have a look at the end of the file pipeline/dtaselect/ sequest/dtas-messages There may be a line saying "No proteins passed criteria!"
    The file you are probably most interested in is pipeline/dtaselect/sequest/DTASelect-filter.txt. There is also the corresponding HTML file for viewing the results. The command
    $ cp pipeline/dtaselect/sequest/DTASelect.html ~/public_html/tutorial.html
    will put the HTML file in your proteome web directory (see Before you begin) and you can view it by pointing your web browser to proteome.gs.washington.edu/~<username>/tutorial.html. Don't forget to change <username> to your proteome login name.
return to top

Choosing a search mode

The search mode defines which steps are performed and the arguments passed to the programs at each step. There are five defined search modes: high-res-perc, standard, standard-perc, lib-as-filter, lib-as-filter-perc. In most cases, you will want to use standard-perc. This mode runs BiblioSpec, SEQUEST, percolator, DTASelect (on both the BiblioSpec and SEQUEST search results), and update-library. The standard mode does not run percolator. Unless you have a compelling reason not to use it, you should probably choose to run percolator. The other critical difference between standard and standard-perc is the set of options passed to DTASelect. Without percolator, DTASelect chooses good matches based on features like XCorr and deltaCn. Percolator inserts its new score into the Sp field so DTASelect must be configured to ignore all of the usual features and select primarily on Sp score.

The mode high-res-perc is like standard-perc but it is intended for high resolution MS1 data. It runs two additional steps, Hardklor and Bullseye, which identify a more accurate precursor m/z and charge and filter out MS/MS spectra which do not appear to be derived from a peptide. To use this mode you must have an .ms1 file for each .ms2 file.

The other modes are lib-as-filter and lib-as-filter-perc. These modes run all the same steps as the two standard modes. The difference is that the with lib-as-filter the library search results are used to limit which spectra are searched by SEQUEST. Any spectrum with a good library match is considered to have been identified and only those without good hits are passed on to SEQUEST to be searched. The other major difference is that the search results will be combined together by DTASelect. In theory, filtering out the easily identified spectra should speed up the SEQUEST search. However, even in the best circumstances, only a small fraction of the spectra will be identified so the time savings are marginal.

These modes are meant to cover typical use of hermie and may not meet your specific needs. Any combination of features can be combined in a customized mode. See this section for more details.

return to top

Using MS2 files from different locations

In the above example, we put the MS2 files we wanted to analyze in our current working directory. This, however, is not always convenient, so hermie has two additional ways of looking for input files. You may specify any number of MS2 files or directories containing MS2 files on the command line. Consider this example

$ hermie yeast standard ~/research/runs/best-run.ms2 ~/research/others/ ../another.ms2
This will analyze the two files best-run.ms2 and another.ms2 as well as all of the MS2 files in ~/research/others and NONE of the MS2 files in the current directory. (To also include the ones in the current directory, add . to the list of files.)

return to top

Running longer analyses

As you well know, SEQUEST can take hours or days to complete. While you and hermie are waiting for the results, the terminal window in which you are running hermie is occupied. You cannot work in that terminal and if you close it, the run will stop. There are several ways of dealing with this.

Checking on your run's progress

There are several ways to check on your run.


return to top