Hardklör
Home Documentation Sample Config Files Tutorials Auxiliary Programs Download

Tutorials


Preparing your data [return to top]

Hardklör can read a variety of data files, but many mass spectrometers output data in proprietary formats. To solve this issue, spectrum data files are often converted to a common format. One standard is the mzXML format, which Hardklör can read. Instructions on how to configure Hardklör to read an mzXML file are given in the "Config file setup" tutorial.

If working with a Thermo Fisher Scientific instrument, data is stored in the .RAW format. We provide an additional tool, MakeMS2, to extract .RAW spectra into easily accessible formats with features that allow filtering of the spectra. In this tutorial, we will cover some of the basics of converting a .RAW file to the .CMS1 (compressed MS1) format.
The MakeMS2 .RAW file conversion tool.

Step 1: Open MakeMS2 and select the files you wish to convert. You can also specify a new destination for the output files.
Step 2: Select the type of file you would like to output. In this case, we are choosing to output precursor ion scans. MakeMS2 will filter the MS/MS scans from the output file. If both MS1 and MS2 are selected, MakeMS2 will output two files: one containing precursor ion scans and the other containing the fragmentation scans.
Step 3: Apply additional filters. Since we want to shrink our output file as much as possible to save space, we selected "Binary Output" and "Compress Data". This produces a .CMS1 file. If neither was selected, we would output a standard .MS1 file, which is in ASCII text format, but would occupy a lot more disk space. Selecting only "Binary Output" produces a .BMS1 file, which is smaller than a .MS1 because it is in binary format, but larger than .CMS1 because it is not compressed. Additionally, we can filter our spectra to only output data between specific m/z values and retention times.
Step 4: Execute. Simply press the "Execute" button to start the file conversion. Status bars indicate how much conversion is progressing. Once completed, your files are ready for use by Hardklör.

Config file setup. [return to top]

For examples of full configuration files, please refer to the section on Sample Config Files. The following tutorial only highlights the most widely used features of the configuration files.

First, configuration (or simply config) files must be ASCII text files. On Windows systems, these files typically end with the .txt extension, but any file name or extension is acceptable. You can design and edit your config files with any simple text editor, such as Notepad or Wordpad in Windows, or vi, vim, nano, pico, and numerous other tools on GNU/Linux.

Typically all config files should start with two important lines:

       -mdat \your_path\ISOTOPE.DAT
       -hdat \your_path\Hardklor.dat
      
These lines are paths to the data files required by Hardklör for operation. This is particularly important if you store Hardklör in a central location (such as /usr/bin/ or C:\Program Files\), but wish to operate it on mass spectrometry files stored in a different folder. In the above example, simply replace "your_path" with the path of your ISOTOPE.DAT and Hardklor.dat files.

Also essential to every config file is a line or several lines of execution. In their basic form, lines of execution need only an input file and an output file:

       MyData1.ms1 MyResults1.txt
       MyData2.ms1 MyResults2.txt
      
In the example above, two files will be analyzed: MyData1.ms1 and MyData2.ms1. MyData1.ms1 will be analyzed first and the results will be stored in a tab-delimited text file called MyResults1.txt. After analysis is finished on this first file, analysis will start again on MyData2.ms1 and will be stored in MyResults2.txt.

To optimize your analysis, it will be necessary to set parameters for your data analysis. Parameters are represented in two ways: Global and Local. Global parameters are applied to every data file being analyzed. Local parameters are applied to only one data file. You can mix your parameters so that some are applied globally, but others are applied locally.

Global parameters are represented as lines starting with a dash. You may put only one global parameter per line in your config file:

       -corr 0.90
       -d 3
      
The lines above set two parameters globally. The first parameter sets the correlation threshold to 0.90. The second parameter sets a maximum deconvolution (depth) to 3 overlapping peptides. These two parameters will be applied to every line of execution that comes after them.

Local parameters are set on the same line as the lines of execution. You can have as many parameters on a single line as necessary in these cases:

       MyData1.ms1 MyResults1.txt -corr 0.90 -d 3
       MyData2.ms1 MyResults2.txt -corr 0.95 -d 3
      
In the example above, two lines of execution are shown to have different parameters. MyData1.ms1 will be analyzed with a correlation threshold of 0.90 and a maximum depth of 3. MyData2.ms1 will be analyzed with a correlation threshold of 0.95 and a maximum depth of 3.

Notice in the last example that both MyData1.ms1 and MyData2.ms1 are analyzed with a maximum depth of 3. It would therefore be more efficient to set this parameter globally, but keep the correlation threshold parameters local. This can be done as shown in the following example:

       -d 3
       MyData1.ms1 MyResults1.txt -corr 0.90
       MyData2.ms1 MyResults2.txt -corr 0.95
      
In order for the maximum depth of 3 to be applied to both data files, it must be set before the lines of execution.

Now that you have a basic understanding of how a config file works, you might want to know what the most commonly used parameters are:

       -a
       -cdm
       -chMin
       -chMax
       -corr
       -d
       -p
       -res
       -sn
       -snWin
       -win
      
For optimal Hardklör performance, you should always set the above parameters to values appropriate for your data. Details for each of these parameters can be found in the documentation on this website. These are the most important parameters because they 1) Characterize your data files, 2) Specify what you want found in your data files, and 3) Set the basic thresholds for finding isotope distributions. Typically, these parameters are set globally when analyzing several data files from the same instrument.

For the fastest performance, two parameters should always be set as such:

       -a FastFewestPeptides
       -cdm Q
      
-a FastFewestPeptides specifies Hardklör to return the fewest overlapping peptides to fit a distribution, and to perform the analysis using as much memory as possible to maximize speed. This parameter should be changed only if the system has low memory. -cdm Q tells Hardklör to use the QuickCharge algorithm to find charge state. This is much faster than other computation methods such as a Fast-Fourier Transform.

Additionally, increasing the values of the -chMax, -d, or -p parameters increases computation time, so it is recommended not to set them higher than necessary for your analysis. Decreasing the value of -snWin also increases computation time.


Hardklör execution. [return to top]

Hardklör does not have a GUI interface. It is available as a command-line activated binary for Windows and Linux. It is compiled for 32-bit processors, but supports 64-bit m/z precision and extended file sizes ( >2GB ).

There are two ways to run Hardklör from the command-line. The first is to specify the data file and parameters you wish to use. The second is to specify a configuration file that contains parameters and a list of one or more files to analyze. The configuration file is the recommended method for two reasons: 1) It is simpler to set the large number of parameters this way, and 2) It allows you to batch multiple analyses without user interaction.

To execute Hardklör, simply type the following command from your command prompt:

       Hardklor -conf your_config.txt
      
Where your_config.txt is the name of your configuration file. Please see the section on Sample Config Files or the config file setup tutorial for instructions on how to make a configuration file.


Special optimization: mzXML files. [return to top]

Hardklör reads mzXML files the same as any other file, and you can set your Hardklör parameters the same for an mzXML file as you would any other file. The unique feature about mzXML files, however, is that they can contain mixed levels of mass spectrometer data. For example, an mzXML file can have MS data and MS/MS data mixed into the same file. It may be desirable to analyze only one level or the other, or to separate the Hardklör analysis into two files. This can be done with the -mF parameter, which only functions for mzXML files. Here is an example of how to use the -mF parameter:
       MyData.mzXML Results_MS.txt -mF MS1
       MyData.mzXML Results_MSMS.txt -mF MS2
      
In the above example, MyData.mzXML contains both MS and MS/MS data. The lines of execution tell Hardklör to analyze just the MS data first, and put the results in the Results_MS.txt file. Then it will analyze the MS/MS data and put the results in a different file, Results_MSMS.txt.


Reading Hardklör results. [return to top]

Hardklör results are output in a tab-delimited ASCII text file. Below is an excerpt of a typical output file and a key for interpreting the results:



Hardklör output has two line types: spectrum lines, designated with an "S" and protein lines, designated with a "P". There is a spectrum line for every spectrum in the data file Hardklör has analyzed. The protein lines under any spectrum line are the protein or peptide isotope distributions that Hardklör identified in that spectrum.

Here is a breakdown of a spectrum line:
A=Spectrum Tag
B=Scan Number
C=Retention Time
D=File Name

Here is a breakdown of a protein line:

Click for larger image.
A=Protein Tag
B=Mass
C=Charge State
D=Intensity of Base Peak
E=M/Z of Base Peak
F=Spectrum Window
G=Signal-To-Noise Threshold
H=Modifications
I=Dot Product Score

Please note the following:
  • The mass is the uncharged (zero), monoisotopic mass of the protein or peptide.
  • The Base Peak refers to the base isotope peak of the model used to predict the protein or peptide.
  • The Signal-To-Noise threshold refers to the relative abundance a peak must exceed in the Spectrum Window to be considered in the scoring algorithm.
  • Modifications refer to deviations to the averagine model. If there are no modifications, the column is marked with an underscore.
  • The Dot Product Score applies to all predictions in a given Spectrum Window. Thus, if two protein or peptide predictions share the same Spectrum Window, then they have a single Dot Product Score that is the score of their combined peaks.


Auxiliary programs [return to top]

Currently, there is only one auxiliary program available for post-Hardklör analysis, named Krönik. This tool allows you to reduce your Hardklör results to just the persistent peptide isotope distributions (PIDs). This is the same tool that was used in the Hardklör publication.

Krönik must be run from the command-line. It is available for both Windows and GNU/Linux. For a quick help guide, you can type kronik from your command prompt and hit return. In the simplest case, Krönik can be run with the following command:

kronik hardklor_results.txt output.txt

where you can substitute your own file names for hardklor_results.txt and output.txt.

Additionally, you can adjust some parameters for Krönik. Parameters must be set on the command line prior to your Hardklör results file. There are currently four parameters as follows:
-c Sets the threshold for contaminants. If a peptide isotope distribution persists greater than this number of scans, it is excluded from further analysis. Default: 150
-m Sets the maxmimum mass allowed in daltons. If a peptide isotope distribution's mass is greater than this number, it is excluded from further analysis. Default: 8000
-n Sets the minimum mass allowed in daltons. If a peptide isotope distribution's mass is less than this number, it is excluded from further analysis. Default: 600
-p Sets the mass accuracy in parts per million (ppm). Peptides in adjacent scans must differ in mass by less than this amount to be called the same. Default: 10

A persistent PID is defined as a set of PIDs that:
1. have identical charge states.
2. have identical mass (within a parts per million mass tolerance).
3. are contained in consecutive spectra (at least 3 of 4 consecutive scan events).

A detailed README file is included in the download of Krönik.


Hardklör is Copyright ©2007 University of Washington. All rights reserved. Written by Michael R. Hoopmann, Michael J. MacCoss, in the Department of Genome Sciences at the University of Washington.