Leif et al. (1) have raised a number of issues that will confront many cytometry laboratories. Their most basic contention is that cytometry data need to be more readily integrated into other systems and thus be more amenable to standard database techniques. To accomplish the needed integration, they suggest that the changes must begin with altering the way instrument list mode data are recorded and stored. In support of changing the list mode format, they argue that (a) the Flow Cytometry Standard (FCS) (2) specification is fatally flawed and (b) because flow cytometry (FCM) data are stored in such a specialized format, the data are unnecessarily excluded from existing software that is capable of handling more standardized formats. They then propose an integrated alternative approach based on extensible markup language (XML) and a newly defined derivative of the Digital Imaging and Communications in Medicine (DICOM) file format to be used for storage of list mode FCM data. They contend that by using their proposed solution FCM data would be much more accessible and could be handled readily by many types of existing software. In my opinion, their most basic point that FCM data must be more readily accessible and integrated into other types of systems and software is entirely correct. However, I do not agree with some of their arguments or with some of their proposed solutions. In this commentary, the areas of disagreement are discussed first. Also, the discussion is restricted to issues concerning FCM and especially to list mode files of multiparameter data.
IS THE FCS SPECIFICATION FATALLY FLAWED?
Although Leif et al. have actually raised much broader issues, the point that will likely stimulate the most intense discussion is the proposal for storing data in another format. Part of their rationale for abandoning the FCS format is their contention that the FCS specification is fatally flawed. In support of that contention, their principal arguments concern the recursive provision in FCS 3.0, problems and ambiguities with parsing keywords in the TEXT section, a stated over-reliance on the C programming language, unneeded complexity from values of the $BYTEORD keyword, the absence of additional needed keywords in the FCS specification, and an overall lack of verification of the validity of data files. They devote considerable discussion to these issues and these are considered first.
In my opinion, one of the most significant issues they have raised is the recursive structure that can be created in FCS 3.0 files via nonzero values of the $NEXTDATA keyword. Some of us who were involved in the FCS 3.0 revision were not in favor of this approach at that time. I am in agreement with Leif et al. with regard to this keyword and the types of files that can be created with it. FCM files would be improved by abandoning this recursive structure so that each list mode file would contain no more than one instance of the HEADER, TEXT, and DATA sections.
They also cite problems in parsing FCS TEXT sections that can arise from the freely selectable separator character that is used to delineate keywords and their values. They are correct that the FCS specification does not explicitly define the separator character. However, the separator character is explicitly defined in each file by being the character that occurs at a specific place in the file. Thus, although one must include this dynamic definition when constructing a parser, this approach need not create serious problems. There are considerably more problems that arise from occurrences of the separator character within keyword values or with empty keyword values. The widespread use of forward or backward slash characters, i.e., / or \, as separators is particularly problematic because of their presence in file specifications listed as keyword values. One simple solution to this problem would be to require that the separator character as defined in a file be used for no other purpose in that file. One suggested example that should work without causing hardship would be to use the! character as the separator. A second minor change would be to require that keywords having no value either not be listed or be required to list some sort of null value instead of simply having consecutive separator characters. These minor changes would alleviate some of the more difficult problems in creating robust parsers. Nevertheless, there are presently many examples of FCS parsers that have successfully handled the problems that can arise with files written under the FCS specifications.
Three other complaints they have raised seem to have less merit. For example, it is argued that the FCS specification is somehow too reliant on the C programming language and its dialects. Although acquisition and analysis software may be written in C, it is not apparent that the FCS specification itself is uniquely tied to C. I have coded FCS parsers in other computer languages without feeling handicapped. I also find the complaints about the $BYTEORD keyword not to be persuasive. The keyword does indeed permit many encodings of 32- or 64-bit values. However, the keyword is not included in the specification to encourage the use of obscure data encodings but rather to specify explicitly how data are encoded. In other words, it is more of a strength than a weakness. Another stated flaw in the FCS specification is the absence of additional needed keywords. In fact, the specification clearly allows new keywords and values to be defined and included. However, I do agree completely that the creators of acquisition software must publish these keywords and explain their meanings. An issue that was not raised is the practice of some manufacturers of using proprietary keywords in place of existing standard keywords. This practice does not necessarily violate the letter of the specification but does add complexity to the construction of a parser.
An additional perceived flaw in the FCS specification is that the file format does not adequately protect the integrity of the data. At the first level, we will always be dependent on the acquisition software to record data faithfully. In addition to the basic tools of the operating system that are used to create files, some manufacturers use other techniques to verify data. For example, some acquisition software that stores typical 1,024-channel (10-bit) data use the high bits of 16-bit integers to record codes that can be used to check the validity of that value. These schemes should be published, and parsers and analysis software should use the information to verify the data. Another safeguard would be to treat list mode files as read-only after they are created. A related issue is maintaining the integrity of data files as they are transmitted electronically. In fact, the FCS definition includes several features that protect against nonapparent corruption of files by specifying explicitly in the file where various sections begin and end. In addition, if required keywords or their values are altered, then the parser would likely fail to read the file. It is clearly crucial that the integrity of data files be maintained, but it is not clear to me that other formats for list mode data would necessarily be more likely to ensure data integrity.
The preceding sections have summarized the principal arguments in support of the contention that the FCS specification is flawed and that the flaws are so serious that the FCS format should be abandoned. Although some of their complaints with the FCS format do not appear to be correct, there are several legitimate concerns presented in their arguments. However, although some of these concerns may be valid and significant, the issues could be addressed with minor amendments to the FCS specification. Thus, I do not agree that the identified problems in the FCS specification are irreparable and require that the format be replaced.
ARE THERE OVERWHELMING ADVANTAGES TO OTHER LIST MODE FORMATS?
The second principal argument in favor of abandoning the FCS specification is that its specialized nature prevents the use of existing software. Specific software is not named, but it appears that spreadsheet image manipulation and plotting packages are intended. In my opinion, the greatest flaw in this argument stems from too little distinction being made between “raw” list mode data, as one would find in the DATA sections of FCS files, and other types of information and/or calculated results. Although it is correct that list mode data cannot be directly read by commonly available software, it is not apparent to me that this state represents a hardship to the cytometry community. Performing meaningful analyses of multiparameter FCM data requires specialized methods that are not available in most software packages. There are some software packages that are not dedicated FCM analysis programs that could be used, e.g., IDL from Research Systems. IDL is a high-level language with complex capabilities for analyzing arrays of data as one finds in list mode FCM files. IDL runs on many platforms and is available in many academic settings. In the mid-1990s, Rob Habbersett from Los Alamos created an IDL FCM analytical program. It was discussed at congresses of the International Society of Analytical Cytometry (ISAC), described at cytometry development workshops, and demonstrated at annual flow cytometry courses. The code was made freely available and could be run on any of the many computer platforms supported by IDL. Nevertheless, despite the sophisticated capabilities of the total software package and despite having a parser for FCS files freely available, analysis of FCM data by IDL was not widely used. The arguments that are presented do not make a compelling case that storing FCM list mode data in an alternative format would automatically open FCM data to useful analyses by widely available software. That said, there is a real need to make other types of FCM-associated data, e.g., assay details, demographic information, derived results, etc., more readily available, as discussed below.
WHAT IS THE VALUE OF REPLACING THE FCS SPECIFICATION?
In summary, I do not find the arguments to abandon the FCS specification to be persuasive. Although Leif et al. have correctly identified some problems with the FCS specification, there are minor amendments to the file format that would address these problems. The second argument that having FCM list mode data in a more widely supported format would open it to many existing software packages does not adequately recognize the specialized analyses that are necessary with multiparameter data. Widely available software such as spreadsheets would handle multiparameter data poorly regardless of the difficulties in reading the data. In the case of one software package that could perform meaningful analyses, namely IDL, the availability of FCS parsers and the code for list mode analyses did not lead to widespread acceptance. I find neither line of argument to abandon the FCS specification to be convincing.
MORE EXTENSIVE EFFORTS TO INTEGRATE FCM RESULTS WITH DATABASES
Although I do not agree with the arguments that the FCS specification must be replaced, I do see a great deal of merit in many of the issues that are raised regarding the integration of FCM data into other systems. As Leif et al. have noted, these issues revolve around the creation and use of databases. At the simplest level, one needs to be able to manage list mode files. For example, one could have a need to find the files relevant to a particular subject or the files relevant to a particular type of cell. Creating a database containing information gathered from the TEXT sections of FCS files would address this need. One could use a simple FCS parser to extract relevant keyword values from FCS files and write them into a database. I know from my own experience in coding software to extract keyword values and create database files that such a strategy is fully feasible with the existing FCS specification. The software that I wrote created files compatible with the xBASE structure so it could be read by many types of software. (It is even simpler to write extracted data into comma-delimited files that can also be handled by many types of software.) The approach could be readily extended to create files in an XML format and would in no way require that the list mode data format be reinvented. The database approach could be easily extended to the results of analyses, to descriptions of experimental details, to demographic information, etc. Using a relational database model with separate tables listing various types of information will become more useful in meeting the requirements of the Health Insurance Portability and Accountability Act (HIPAA) that recently became effective. For example, the HIPAA specifications list explicit criteria for creating de-identified records that could be assembled readily by the proper creation of relational database tables. Biomedical institutional review boards are now beginning to deal with HIPAA-related issues, and it is clear that studies using de-identified records will have many fewer requirements to consider. The CytometryML schema could be of great utility in creating and managing databases associated with FCM data and analyses.
Another important issue is the creation of audit trails for FCM data. Audit trails will include the basic data, e.g., the list mode FCM files, regardless of the format in which they are created, the associated information as noted above, and the results and/or methods of analysis. In some contexts, such records would need to be maintained for years. It is fairly simple to keep records of analytical methods and experimental descriptions because these can be stored electronically in simple formats and/or as printed copies. Storing list mode data or results on CDROM should have the required degree of physical longevity. However, another consideration is having the appropriate hardware and software to be able to read the data, including any existing files in FCS format. The pace of technologic change is such that one may have to consider archiving a computer and software to be sure of being able to retrieve stored data. Leif et al. allude to these issues but their proposed XML-based strategy does not directly solve the problem.
The involvement of regulatory agencies in various FCM-related areas constitutes another important area for consideration. Clinical cytometry tests already fall under the purview of the U.S. Food and Drug Administration (FDA). To comply with FDA regulations, these tests must be done by using the reagents and methods specified in the FDA-approved procedure. Although not implemented, there have been discussions for at least 10 years regarding direct oversight of FCM software by the FDA. Leif et al. make the argument that ISAC and the cytometry community can facilitate handling regulatory concerns by having the data stored in established formats such as they propose. Is this contention correct?
Regulatory oversight could involve at least three levels: (a) verifying that data files accurately record the values measured by the instrument, (b) verifying that analytical software does what it is specified to do, and (c) it could involve some oversight of how analyses are actually performed. As noted elsewhere, verifying that instruments correctly record values in data files will have to be handled by the manufacturers of instruments and the creators of acquisition software. It probably makes little difference whether the raw data are written into FCS-compliant files, the proposed DICOM-derived structures, or any other format. The second level of regulatory involvement, namely verifying that analytical software actually does what is stated, is also independent of data file formats. If FCM files could be handled by general-purpose software, then a general purpose certification could obviate specialized FCM verification. However, as discussed above, widely available software packages really cannot handle multiparameter FCM data. Therefore, if there is regulatory oversight of proper FCM software operation, then it will not matter how the raw data are stored. The third level of oversight, namely assuring that analyses are done “correctly,” is the most difficult of all. Oversight of analytical methods could use at least two strategies that would have differing involvements with the format used for data storage. First, it would be fully feasible with existing technology to create completely automated analyses of routine clinical tests. By including specified standards (if such standards were available) and then having the software set analytical regions based on those standards, one could obtain results without needing most of the subjective judgments that one normally must use. The existing hardware and software could be modified to perform automated analyses, but the availability of the needed standards is a bigger issue. A second approach would be to continue using current subjective methods of analysis but to create audit trails that would include all the information to identify populations or even individual cells. It would not be necessary to reinvent the list mode file format for either strategy. However, as Leif et al. have noted, it would be necessary to create structures to link the basic list mode data with the results of the analyses. The CytometryML approach they have proposed would be one way to create results files with wider accessibility.
Overall, Leif et al. have identified problems in the FCS 3.0 specification that should be addressed. However, I do not concur that these problems require abandoning the FCS specification because fairly minor alterations to the specification would be adequate. Furthermore, I do not think that a compelling case has been made for a new data file standard to make list mode data more accessible. On the one hand, multiparameter FCM data have specialized requirements for meaningful analysis, and there are few, if any, widely available software packages such as spreadsheets that could be used effectively. On the other hand, what is being discussed and proposed goes well beyond the issue of replacing the FCS specification. The argument that FCM data must be better integrated into other systems seems irrefutable. Likewise, there is also a need to create better methods to manage FCM data and results. The strategies that are proposed could be very useful in these areas. In summary, although I do not agree with all the contentions made by Leif et al., they have raised important issues that are worthy of consideration and discussion.
DISCLOSURES AND BACKGROUND
My direct involvement with the details of the FCS specification began in the early 1990s, when I created some simple parsing and reading routines to get list mode data into neural network software. Those initial forays into programming to handle FCM data ultimately developed into a full-fledged analysis program that continued to evolve through 2001. The neural network analyses also led to the development of strategies to embed cell identifiers in FCS files. A few copies of the software were sold along the way, but mostly it was developed to handle my own needs. Those needs included being able to analyze FCM data saved in other formats, so routines were developed to write FCS-compliant files from BryteHS and from Profile II list mode data. It was also apparent that two key requirements were an ability to create databases of list mode files and to develop simple ways to get results of analyses to files that could be handled by spreadsheets and other applications. Therefore, those capabilities were also built into the software. Although I was part of the group that revised the FCS specification to version 3.0, I do not have “ownership” issues with the specification. I currently have no financial interests that would be directly affected by maintaining, modifying, or abandoning the FCS specification.