Running distributed SEQUEST ('run_ms_ssh') on the Sun Grid Engine

Scope:

This document gives instructions for setting up your environment to run SEQUEST jobs on the computers proteome, grid, and sage (for the quartz cluster). It has only a few details of the SEQUEST search itself. You should need to follow these instructions only once per computer when you first receive your account.

Conventions:

Instructions:

The program that runs SEQUST is run_ms_ssh. It distributes SEQUEST jobs over a set of nodes using ssh calls to farm out the processing of each spectra. Most of this set-up process deals with ensuring that ssh is working properly.

Your computer keeps track of what other computers you have connected to by keeping a list of "keys" for each computer. If you try to connect to a new computer, it will give you a warning and ask that you confirm (by typing 'y') that you really do want to connect. Since you will not be monitoring run_ms_ssh, you will not be able to confirm any new connections. To get around this, we need to generate ssh keys for each node (computer) in the cluster.

  1. Log on to the computer on which you would like to run SEQUEST: proteome, grid, or sage (for access to quartz). Generate a key with this command
    % ssh-keygen -t dsa

    .. press return when prompted to save the key in /home/user/.ssh/id_dsa

    .. press return when prompted for a passphrase

  2. Now you'll authorize your key for use on your own account. The following commands navigate to the directory where your key was stored, move the new key to a file of authorized keys, and set the permissions on that file
    % cd ~/.ssh
    % cp authorized_keys authorized_keys.bak
    % cat id_dsa.pub authorized_keys.bak > authorized_keys
    % chmod 644 authorized_keys
    % cd ..

    NOTE: The step cp authorized_keys authorized_keys.bak is an optional step to back-up any existing keys. If you are a very new user, you may not have the file authorized_keys in which case you will get an error saying "no file or directory". Ignore the error and continue with the next step.

    Next, we need to make sure you are not prompted to verify the keys of the nodes. Run this script
    % setup_keys.pl

    troubleshooting: one possible source of failure could be that your authorized_keys file is group-writable. Check by running the command % ls -l ~/.ssh/authorized_keys     If it says '-rw-r--r--', you should be fine. If there are any 'w' past the third position, then it will not work. To fix this, do % chmod 644 ~/.ssh/authorized_keys

  3. Now check to see if the SEQUEST executable is in your path.
    % ssh m001 search27
    If it returns
    
     SEQUEST v.27 (rev. 9), (c) 1993
     Molecular Biotechnology, Univ. of Washington, J.Eng/J.Yates
     Licensed to John Yates III @ Univ. of Washington
    
     SEQUEST usage:  search27 [options] [dtafiles]
     options = -Dstring  where string specifies the database to be searched
               -Pstring  where string specifies an alternate parameter file name
                            (sequest.params is the default parameters file)
               -S        sets SEQUEST to not re-search .dta files if .out files exists
    
     For example:  sequest *.dta
    
    
    then everything is fine. You may need to replace 'm001' with a different computer name if you are not on proteome. Take a name from the list that was generated from setup_keys.pl (could be something like 'maccoss001' or 'q1').

    If you connect correctly, but get a 'command not found' warning, your $PATH environment variable will need to be set. Normally, $PATH would be set in your ~/.bashrc file, but this does not work for the non-interactive sessions used by SEQUEST. Please ask for assistance.

  4. If you will be running SEQUEST searches via the data analysis pipeline, that's all you need to do. The following will be taken care of automatically. However, if you want to run a SEQUEST search manaually, see below.

Notes on running run_ms_ssh on the SGE:

SEQUEST, which is run by run_ms_ssh, requires two inputs: an MS2 file (or files) containing spectra to be searched, and a file called 'sequest.params'. In order to run run_ms_ssh on the cluster, you will also need a script. These are described below. The easiest way to get a sequest.params file is to copy one from someone else

  1. Check that your sequest.params file makes sense. (The best way to get a sequest.params file is to copy one from someone else. If you are running hermie, one will be created for you). Towards the top of the file is a line that begins wtih 'database_name = '. Make sure that the database listed actually exists and is the one you want to use. Also look for the line beginning with 'xcorr_mode =' and make sure it equals 0.
  2. Create a script to submit to the cluster. (If you are running hermie, one will be created for you called something like seq.number.sh) Below is a script you could use. Copy the text into a file called something like 'pe_sequest.sh'.
    #!/bin/sh
    #
    #$ -S /bin/sh
    #
    #$ -N your_job_name
    #
    # Parallel Environment Request
    #$ -pe mpich 8-16
    #
    echo "Got $NSLOTS slots."
    PATH=$SGE_O_PATH:$PATH
    run_ms_ssh -f $TMPDIR/machines *.ms2
    
    Change the word 'your_job_name' to some word (no spaces) which you would like to name the run. The numbers '8-16' decide how many nodes your job will request. You can change it to be a single number (like 4) or a range as it is here.
  3. SEQUEST likes all of its inputs to be in one directory. Create a directory in which you would like to do the search. Put all of the MS2 files (or links to the files) in that directory. Put the sequest.params file and the script file in the same directory.
  4. Start the run (from the directory you just created).
    % qsub -cwd pe_sequest.sh
    Or use whatever name you gave to the script in step 2. To check on the status of your run use the command
    % qstat
    This should give you a list of all the jobs running and waiting to run. Yours will have the name you gave it in the script and your user name. If it is running (look for an 'r') it will also list the number of nodes it is using.
  5. The output of SEQUEST is files endint in '.sqt'. The output of run_ms_ssh, which has status reports for each spectra, should be in your run directory as a file named jobname.o### (jobname being the name of your run and ### being a process ID number). For every spectrum successfully searched, there is a line that says 'stat=0'. A spectrum that failed will have a line that says 'status=' and then some number other than 0. Check for failed searches.
    %grep -c 'status =' filename
    This will return the number of spectra that failed.