MSMAT File Format

Overview

The use of multiple data files for the analysis of a large set of LC-MS technical replcates in differential proteomics experiments creates constraints where not all data files can be held in memory. A binary file format was chosen to store binned LC-MS data for CRAWDAD to facilitate quick extraction of extracted ion chromatograms for statistical analysis. Existing file formats such as mzXML are based on storing scans - hence the development of a file format which stores data as extractetd ion chromtograms (XIC) in a binary format to afford rapid access. This had been designed for data which is binned in the m/z dimension -- hence, data is stored as a binary matrix. It can be stored in XIC row-order (default), or scan row-order, depending on what type of extraction needs to be done quickly.

Organization

MSMAT files consist of an identifying plaintext line, followed by a header which encodes information about binned m/z, retention time values, and other metadat about the LC-MS run. Scans or XICs follow -- as data is stored as an equal number of m/z bins, or retention times, the basic unit of data is the same length, removing the need for an index -- the offset of any scan or XIC can be quickly calculated.

Byte Encoding

Binary values are currently encoded in little-endian format. Future support will work for big-endian or network- byte orders.

Header

MSMAT files begin with a plaintext line of '_MSMAT_V#_' where # is a version number. Lines in the header are terminated with the '\n' character (UNIX newline). A series of header fields follows, stored as:

FIELD_NAME:BYTE_LEN,BINARY_DATA\n

Where FIELD_NAME is a plaintext label for a field (fields defined below), BINARY_DATA is binary data, and BYTE_LEN the length of the binary data. The field is terminated with a newline character.

The header is terminated with a '_END_MSMAT_HEADER_' terminated with a newline character. Binary data encoding scans or XICs follows until the end of the file.

Comments

While the MSMAT header format does not have the ease of use or full self-documenting nature of XML, it still retains extensibility. Also, a field could be stored as XML itself -- an example application would be for an audit trail on actions perfomed on the file.

Data on the speedup of storing data as XICs rather than scans shall be presented.

Example Code

Forthcoming

Older Formats

Previous format stored the header data in YAML, with some data fields stored using Python cPickle string outputs. YAML was flexible, but slow, and the transition to C++ made supporting Python cPickle objects problematic.