The use of multiple data files for the analysis of a large set of LC-MS technical replcates in differential proteomics experiments creates constraints where not all data files can be held in memory. A binary file format was chosen to store binned LC-MS data for CRAWDAD to facilitate quick extraction of extracted ion chromatograms for statistical analysis. Existing file formats such as mzXML are based on storing scans - hence the development of a file format which stores data as extractetd ion chromtograms (XIC) in a binary format to afford rapid access. This had been designed for data which is binned in the m/z dimension -- hence, data is stored as a binary matrix. It can be stored in XIC row-order (default), or scan row-order, depending on what type of extraction needs to be done quickly.
Binary values are currently encoded in little-endian format. Future support will work for big-endian or network- byte orders.
MSMAT files begin with a plaintext line of '_MSMAT_V#_' where # is a version number. Lines in the header are terminated with the '\n' character (UNIX newline). A series of header fields follows, stored as:
FIELD_NAME:BYTE_LEN,BINARY_DATA\n
Where FIELD_NAME is a plaintext label for a field (fields defined below), BINARY_DATA is binary data, and BYTE_LEN the length of the binary data. The field is terminated with a newline character.
The header is terminated with a '_END_MSMAT_HEADER_' terminated with a newline character. Binary data encoding scans or XICs follows until the end of the file.
MSMAT header fields can encode strings, arrays of floats, associative arrays of floats, or associate arrays of strings to floats
100,60<1>abc<0>def<0>ghi<0>30<1>flt1flt2flt3 total_bytes,string_bytes<1>str1<0>str2<0>float_bytes<1>float1float2 total_bytes -- the complete size of this field, including all separator characters string_bytes -- The length in bytes of the string field, terminated with a character of value one (i.e. (char)1 in C ) float_bytes -- The length in bytes of the float field, terminated with a (char)1 character. str1, str2 -- byte strings terminated with a null character float1, float2 -- floating point values, not terminated due to fixed size
The ordering of header fields is not defined. New fields can be introduced -- the existing parser passes over field labels that it does not recognize until the '_END_MSMAT_HEADER_' token is encountered
While the MSMAT header format does not have the ease of use or full self-documenting nature of XML, it still retains extensibility. Also, a field could be stored as XML itself -- an example application would be for an audit trail on actions perfomed on the file.
Data on the speedup of storing data as XICs rather than scans shall be presented.
Forthcoming