Members of ISAC Image Cytometry Data Standards Task Force (ISAC ICDSTF) include representatives of companies selling image cytometry instrumentation and software as well as academic researchers. ICEFormat was developed in an open manner with full and equal participation and approval by all members. The ISAC ICDSTF membership is open to members of the ISAC community.
In image cytometry, most instruments produce image and feature data in proprietary file formats that are tightly bound to analytical software provided by the specific hardware manufacture. Obtaining specifications for these formats can be problematic and as such, these formats are difficult or even impossible to process by third party software applications. Consequently, amalgamating data from different instruments and further processing beyond the capabilities of the software provided by the manufacturer is nearly impossible, which hinders interoperability and independent validation of experimental results. This issue has been widely recognized by the community, which resulted in efforts to increase the readability of binary image data.
In 1990, the International Society for Advancement of Cytometry (ISAC) proposed a file format for image cytometry data files (1), which included storing binary image files, basic metadata about images as well as underlying experiments. Several other groups also introduced specifications tailored to their particular needs. In addition, several dozens of image file formats have been developed for general image interchange purposes. Many of these are currently widely used by common image processing software tools. In the clinical and pathology environment, digital image standardization efforts are coordinated by the National Electrical Manufacturers Association (NEMA), which manages the Digital Imaging and Communications in Medicine (DICOM) (2) specification. While DICOM is a standard for handling, storing, and transmitting medical imaging information, it incorporates dozens of file formats used in digital microscopy and image cytometry in general, which significantly complicates the development of analytical tools. This issue is being addressed by the Open Microscopy Environment (OME) (3), a collaborative effort of academic laboratories and commercial entities that produces open tools to support data management for biological light microscopy. Currently, OME's bio-formats library (4) supports 125 image file formats.
Unfortunately, none of the mentioned efforts targets the computational exchange of features definitions, object masks and feature values derived from these images. In order to address this need, ISAC's Image Cytometry Data Standards Task Force (ICDSTF) developed the open image cytometry experiment format (ICEFormat). ICEFormat captures image cytometry metadata, masks and features in a standardized manner so that they can be electronically interchanged and shared. The normative version of the ICEFormat specification is included in supplementary material to this manuscript and is available from the standards section of the ISAC website (http://www.isac-net.org/) under Resources for Cytometrists > Standards > Data File Standards > Image Cytometry Data File Format Standards > ICEFormat. It was developed to facilitate the exchange of image cytometry data, including images, segmentation masks and features, among different analytical tools. This scope was set to fill in evident gaps without duplicating existing efforts. The specification uses the methodology and best practices from international standardization bodies such as the World Wide Web Consortium (W3C), the Institute of Electrical and Electronics Engineers (IEEE), and the Internet Engineering Task Force (IETF), including the Extensible Markup Language (XML), XML schema and open image file formats.
Image cytometry commonly produces large data sets. An appropriate organization is needed for both performance and to keep data easily readable. Therefore, ICEFormat components such as images, masks and feature values are kept in separate files in an “ICEFormat structure.” As shown in Figure 1, basic metadata and organizational information is separated out into an XML “ICEFormat data directory” (IDD) according to the ICE XML schema (included in supplemental information). For software tools, this XML file is the main file to process in order to locate all other components of the ICEFormat structure and determine their roles and mutual relationships. The actual images, segmentation masks, and feature values are stored in separate files using standardized file formats. This design allows for the IDD to remain reasonably small so that it can be held in memory for fast processing. Moreover, it allows for reuse of existing file formats for the image data and for the use of efficient binary representations for segmentation masks and feature values.
An IDD includes definitions of channels, segmentation masks, features and datasets. For the purpose of this specification, a channel is a combination of specific excitation, emission, and detection parameters and settings that resulted in an image. Each image may optionally be assigned to a channel in order to allow different images to be grouped as acquired via a particular channel. Segmentation is the process of partitioning a digital image to locate objects within the image. A segmentation algorithm typically performs this process computationally and objects are the results of a segmentation algorithm applied to an image. Technically, these results are captured as masks specifying the position of objects identified in that image. A feature is a characteristic of objects (e.g., nuclear intensity, cytoplasmic intensity) that is usually calculated on the individual objects derived from the segmentation algorithm. A feature value is the value of that characteristic for a specific object. These values may be floating point numbers, integers, character strings, Boolean values, and object classifications. An image of an object is a two-dimensional digital representation of the object captured by an optical device. Consequently, an image is also a type of feature. Values of these image features are either the actual image data files (in the case of individual images), or a combination of image data files with masks (in case of composite images). Associations between objects (e.g., an assignment of granules, organelles or nuclei to cells) are also captured as features. Associated objects may come from different datasets, segmentations, etc. Both one-to-one and one-to-many association multiplicities are supported.
A dataset is understood as a collection consisting of dataset-specific metadata, a set of feature values and associations between objects and these feature values. A dataset contains a value for each of its objects for each feature. It can be envisioned as a metadata-accompanied table with objects in rows, features in columns, and feature values in the data cells of the table. The IDD supports the description of individual as well as multiple datasets. Multiple datasets may be created as a result of plate based data acquisition and/or when multiple acquisition sites are used. Dataset metadata may include a timestamp, z-position, reference to a site, and any custom metadata. Dataset features may either be specific for a particular dataset or shared among several datasets described in the IDD.
Segmentation masks, feature values and images are stored in additional files outside of the IDD. Segmentation masks are stored as a sequence of binary-encoded unsigned integer values. Values of integer-based, floating-point-based, and Boolean-based features, object classifications as well as object associations are captured as little-endian encoded binary files. Any established open image file format may be used to store the actual image data files. Keeping the set of supported image file formats open allows for an easy adoption of this specification by vendors producing images in various file formats already. These files should have a descriptive name and a correct file name extension to ensure that these images are easily found and opened from the file system by other software. Image file formats that are proprietary, vendor-specific, that do not have a publicly available specification, or that require the acquisition of a license in order to decode the image shall not be used.
CONCLUSIONS AND ALTERNATIVES
History has shown that the adoption of a single standardized file format facilitates interoperability to a greater extent than the support of dozens of proprietary file formats by all affected software applications. The FCS (5) data file standard serves as a clear example of advantages that standardization can bring to researchers in the field of cytometry. Therefore, the ISAC ICDSTF developed the ICEFormat to address this need. The adoption of ICEFormat will allow image cytometry data to be processed and analyzed by independently developed software applications and therefore, it will facilitate the interchange of image cytometry data between different software packages with the potential to increase interoperability and support scientific collaboration and validation.
The current version of the ICEFormat specification is an ISAC candidate recommendation, released to solicit additional feedback from ISAC membership and other interested parties. There are some limitations resulting from the chosen design that bare considering. The current mask representation does not allow for overlapping objects in a single mask. Multiple masks have to be used to encode overlapping objects. According to our experience, this design does not represent significant drawback, and it has the advantage of being simple and reasonably effective for storing masks. With a small requirement of 1 byte per pixel, it allows for 255 objects to be captured in a mask. Alternatively, 2 or 4 bytes per pixel may be used allowing the encoding of up to 65,535 and 4,294,967,295 objects, respectively. A general mask that would allow any combination of objects to be present at any given pixel was considered by ICDSTF. However, according to our experience, overlapping objects are not commonly required and such a mask would require as many bits per pixel as there were objects in the dataset, and compression would be required. Finally, the ICDSTF also considered utilizing one of the widely available image file formats, such as PNG (6), to encode masks rather than using a simple stream of bytes. While this would have some advantages, such as built-in compression, it would also alter the semantics of the image file format as values would encode objects rather than colors, which could contribute to confusion. Even more importantly, it would force vendors (ICEFormat producers) to support a specific image file format that they may not be familiar with. After considering all the pros and cons, the ICDSTF chose a very simple binary mask representation that can be implemented easily and directly without any need for third party software libraries.
Floating point numbers, integers, character strings, Boolean values and object classifications, associations and images are the currently supported types of features. Additional data types have been considered, such as binary blobs, vectors, and complex numbers. However, either no proper use case for these values has been recognized or it has been determined that the appropriate data type may be represented using one of the existing primitive types. In these cases, the ICDSTF decided not to complicate the ICEFormat by including additional data types until proper use cases are identified.
Currently, the specification allows for images to be stored in any suitable established open image file format. While it would be favorable from the ICEFormat consumers' perspective if a single image file format was used for all image types, based on our assessment, the specification would not be adopted if it asked data producers to accommodate to any file format different from the one they are currently using. Moreover, different file formats are likely better suited for different use cases. Consequently, we chose to allow common existing image file formats as long as they are freely implementable according to some open specification. In order to facilitate interoperability, we are asking data producers to choose the simplest representation possible that will allow capturing of the acquired images. It is our hope that over time, only a few file formats and file format variants will become routinely used so that it will be feasible for ICEFormat readers to process all images. However, we would also like to welcome additional feedback from the image cytometry community and we may restrict the spectrum of allowed image file formats in a future revision if a set of file formats covering all use cases and supported by all involved parties is identified.
Alternative binary storage mechanisms have been proposed during the ICEFormat review process. One of these is HDF5 (7) which is designed to provide efficient, parallel and highly scalable access to large binary datasets such as those generated by high-throughput screens (8). While HDF5 has some characteristics that are similar to the ICEFormat, the underlying binary format is complex and it imposes a burden on implementers in that it introduces a dependency on the HDF5 library within their code base.
The way associations are implemented is relatively simple while allowing for common types of associations mirroring the hierarchical structure of identified objects. Associated objects may be captured in different data sets, such as based on different segmentation algorithms tweaked to identify different cellular and subcellular structures. Currently, typical “one-to-one” and “one-to-many” associations may be expressed. However, the current design would have to be complicated significantly in order to allow for generic relations with the “many-to-many” multiplicity or some nontypical relations within a single data set, such as nontransitive or nonreflexive relations. According to our experience, these types of associations are not represented among typical use cases for referencing objects in image cytometry and therefore, they are not supported in by the current ICEFormat specification.
The ICEFormat specification tries to strike a balance between limitations and simplicity of the format to make the initial implementations easy. Unlike image data file formats, there are no established file formats for feature descriptions and therefore we hope this proposal will become readily adopted by the community. At this point, the ISAC ICDSTF is satisfied that the specification serves its purpose and would like to encourage all related software and hardware vendors to proceed with ICEFormat implementation to ensure its applicability, as well as to provide any additional comments or suggestions. One of the main purposes of this ISAC candidate recommendation is to solicit additional feedback from ISAC membership and other interested parties.
The authors would like to thank members of the ISAC Image Cytometry Data Standards Task Force, mainly Dr. Ryan Brinkman, Dr. Lee Kamentsky, Dr. Mel Henriksen, and Dr. Kim Blenman for their contribution to the development of the ICEFormat specification.