parallel-crux

parallel-crux
Usage:
parallel-crux [parallel options] sequest-search [search options] <ms2 file> <protein input>
Description:

Runs crux sequest-search distributed across a number of nodes in a cluster. The spectra in the ms2 file are divided into blocks of 100 and each block is searched separately. The first n blocks are sent out to the n nodes available. As each node finishes the next block is sent to it. When all have completed, the results are concatenated together into one set of output files.

Input:

sequest-search – The name of the crux command to run. See the crux documentation for more details. (Note: soon search-for-matches will also be available.)
<ms2 > – The name of the file (in ms2 or cms2 format) from which to parse the spectra.

<protein input> – The name of the fasta file containing protein sequences or the directory containing a protein index from which to retrieve proteins and peptides.

Output:

The output files are the same as those for crux sequest-search. All files are put in the directory named crux-output.

Parallel Options:

--nodes <filename> – A file containing the names of the nodes available to the process.
--block-size <num spec searched per call> – The number of spectra searched at a time. Default 100.
Search Options:

All of the options available to crux sequest-search can be used with parallel-crux with a few exceptions.

DISABLED Search Options:

--output-dir <directory name> – The results are always written to a directory in the CWD named crux-output
--fileroot <file prefix> – The file prefix will be the root name of the .ms2 file searched.
--overwrite <T | F> – Overwrite is always TRUE. Existing results for the same .ms2 files will be erased.
--scan-number <range> – All scans will be searched.
Using the cluster
Users are encouraged to run parallel-crux as part of a script that is submitted to a cluster/queue management program, specifically the Sun Grid Engine (SGE). An example of a user-generated script might look like this
#!/bin/sh
#$ -S /bin/sh
#$ -N pcrux
#$ -pe mpich 3
parallel-crux --nodes $TMPDIR/machines sequest-search two-spec.ms2 test.fasta
The script (suppose it is named runpc.sh) would be submitted to the queue by the user from a directory containing the files two-spec.ms2 and test.fasta with a command such as
qsub -cwd runpc.sh
In the example script, the sh shell is used, it's full path being /bin/sh. The job is named with the -N option. The -pe option is requesting 3 nodes. The last line of the script is the actual parallel-crux command. The SGE writes a file named machines in the directory given by $TMPDIR. The file contains the names of the nodes assigned to this job. By passing the name of this file to parallel-crux it will respect the node assignment determined by the queuing system.