• zebrafish image analysis;
  • image classification;
  • cell segmentation;
  • feature extraction;
  • SPIN descriptor;
  • threshold selection;
  • kernel combination


  1. Top of page
  2. Abstract
  6. Acknowledgements
  8. Supporting Information

Zebrafish somitogenesis is governed by a segmentation clock that generates oscillations of gene expression in the zebrafish presomitic mesoderm (PSM) cells. The segmentation clock causes cells to undergo repeated cycles of transcriptional activation and repression, which can be divided into eight phases based on their distinct mRNA co-localizations. Recognizing different gene oscillation phases of cells is important in zebrafish research, but manual analysis is time-consuming and difficult. In this article, an effective automated gene oscillation phase classification framework is established for zebrafish PSM cell images. The framework consists of three major steps: (1) identify the individual cells by a two-stage segmentation procedure; (2) extract multiple features on each cell patch to measure the subcellular mRNA distribution; (3) employ a support vector machine (SVM) with a combined kernel to complete feature fusion and classification. To evaluate the effectiveness of this framework, a dataset containing 2,227 cell samples is constructed. Experimental results on this dataset indicate that our approach can achieve reasonably good performance for this gene oscillation classification problem. The feature sets NF9 and SPIN introduced in this article have proved to be superior to other cell features in this problem. Besides, the kernel fusion method used in the third step provides a way to combine heterogeneous features together, i.e., numerical feature set and histogram-based feature set, and classification performance with the combined kernel is better than single feature. © 2011 International Society for Advancement of Cytometry

The zebrafish (Danio rerio) has recently emerged as an important vertebrate model organism in developmental biology, functional genomics, disease modeling, and drug discovery (1–8). The rapid embryonic development and the transparency of zebrafish embryo render it possible to visualize their early developmental events in vivo. Digital microscopic imaging is a critical step in zebrafish research, but the tremendous amount of images generated from large number of embryos in experiments has become a bottleneck in data analysis and result interpretation. Many existing methods for zebrafish image quantification are based on visual inspection and manual segmentation, which are labor-intensive, time-consuming, and even suffering fromnotable drawbacks of inter-observer variation. Recently, there have been significant interests in the development of computerized algorithms and software systems (9–12) for automated zebrafish image interpretation.

Zebrafish somitogenesis is governed by a segmentation clock that generates oscillations in expression of several Notch pathway genes (8). The segmentation clock causes cells to undergo repeated cycles of transcriptional activation and repression. By fluorescent in situ hybridization (6, 7), these oscillations can be visualized. Figure 1 shows two examples of zebrafish presomitic mesoderm (PSM) images; the red blobs are cell nuclei and the green fluorescent stain appears in or around nuclei showing the mRNA localization which reveals different oscillation phases. Experts divided these oscillations into eight phases based on their distinct subcellular mRNA patterns (8). Figure 2 reveals models for the eight oscillation phases, and also shows representative cells for each phase on a zebrafish image. From Figure 2, it can be found that the eight phases form a cycle from Phase 7 to Phase 0, and there exists a gradual transition between adjacent phases of the oscillation cycle. A zebrafish microscopy image may contain over 1,000 tiny cells, and real cells in image are not so regular as cell phase models (e.g., comparing real cells with cell models shown in Figure 2). Due to the above characteristics, classifying cell phases manually is not only time-consuming but also intractable. So developing a reliable automated cell phase classification system is a challenging and valuable task.

thumbnail image

Figure 1. Zebrafish presomitic mesoderm images. The red blobs are cell nuclei and the green fluorescent stain appears in or around nuclei showing the mRNA localization. Cells labeled with yellow numbers have been manually classified by experts (8). See Supporting Information Figure S1 for more details of zebrafish presomitic mesoderm image. The scale bars represent 75 μm. [Color figure can be viewed in the online issue, which is available at]

Download figure to PowerPoint

thumbnail image

Figure 2. Gene oscillation phase models from Phase 0 to Phase 7 (left column) and corresponding representative cells for each phase in a zebrafish PSM image. The scale bar represents 75 μm. [Color figure can be viewed in the online issue, which is available at]

Download figure to PowerPoint

Automated subcellular pattern classification on microscopic images is an interesting bioinformatics problem which has received much attention (13–18) in the past decade. Several systems achieving good classification accuracy have been developed for different subcellular pattern dataset such as HeLa cell images (13), Human Protein Atlas (HPA) images (14), Drosophila cell images (15), yeast image collection (16) etc., and cell image analysis software such as CellProfiler (17) has been developed to provide public-domain platform for cellular phenotype analysis. The mainstream methodology of automated cell image classification generally consists of three steps: cell region segmentation, cell feature extraction and selection, and cell classifier training and evaluation. Many newly arising image processing techniques and machine learning methods have been incorporated into every step to improve the final classification accuracy. However, due to the diversity of different cell patterns, no universal method is applicable to all these different kinds of problems.

In this article, a framework of gene oscillation phase classification is investigated for zebrafish PSM images, our preliminary work about this problem can be found in Ref.19. The proposed framework has three steps as used in the mainstream methodology (13–15). Our contributions include (1) improved methods, combined with classical methods, are employed to overcome the difficulties in this image set; (2) in cell feature extraction, two feature sets, namely NF9 and SPIN descriptor, are designed to measure the mRNA distribution of each cell on the low resolution images; (3) in cell classification, a kernel combination method with two strategies are used to combine heterogeneous feature together. The total classification accuracy by support vector machine (SVM) is over 76%, and as far as we know, no similar work has been reported in published literature for this problem before.


  1. Top of page
  2. Abstract
  6. Acknowledgements
  8. Supporting Information

Zebrafish PSM Images

The zebrafish image dataset used in this work contains 10 images of zebrafish PSM, which is from Dr. Holley's Lab of Yale University (8). The wild-type zebrafish used in the microscopic imaging were Tü, TLF, or WIK. Zebrafish were prepared in accordance to protocols approved by the Yale University Institutional Animal Care and Use Committee. Zebrafish embryos were manually cultured 20-per-well in glass depression slides in ∼1.5 ml E3 media (8). Fluorescent in situ hybridizations were performed by using methods described in Ref.7. The embryos were flat-mounted in glycerol, and confocal stacks were collected on a BioRad 1,024 confocal microscope using a Zeiss Neofluor 25× objective (8). The selected optical sections were generally ∼20 μm apart and had an optical thickness of <3 μm. As shown in Figure 1, the red channel, referenced as nuclei channel, records the nuclei information by staining cell nuclei with propidium iodide, while the green channel, referenced as mRNA channel, records the information of mRNA subcellular distribution using high-resolution fluorescent in situ hybridization (6–8). A mask image and a lookup table were also supplied for each zebrafish PSM image in the dataset, which provide manual annotation and classification result for cells in the corresponding zebrafish PSM image (8). These expert-labeled datasets are used as the standard benchmark for algorithm evaluations.

Framework Overview

The proposed framework consists of three sequential steps: cell region segmentation, cell feature extraction, classifier training and evaluation. In the first step, a two-stage cell segmentation procedure is conducted to segment zebrafish image into patches containing a single cell each, which first segments cell nuclei from the nuclei channel and then estimates the range of cytoplasm for each nucleus to obtain the whole cell segmentation result. In the second step, multiple feature sets are extracted on each cell patch aimed to describe subcellular mRNA distribution from different perspectives. These feature sets consist of two parts, one part is the two feature sets we designed, and the other part is three feature sets which are often used in cell pattern classification. In the last step, two simple multiple kernel combination strategies are used to combine the feature sets together, and the fusion kernels are embedded into SVMs to complete the classification task. Methods used in each step will be introduced in the rest of this section.

Cell Patch Segmentation

This step is to extract single cells. In subcellular pattern classification with multi-channel microscopic images, cell segmentation is usually solved by a two-stage segmentation method (15). In the first stage, nuclei region is located based on nuclei channel of the cell image; in the second stage, cell region is generated based on cytoplasm or cell protein skeleton channel using the nuclei location as initial seeds. We also conform to this two-stage strategy in our problem. But the boundaries of cell nuclei on zebrafish PSM images are almost vague and touching, and no information about cytoplasm or membrane has been recorded in their microscopic image, which both aggravate the difficulty of this cell segmentation task.

In the first stage of cell segmentation, an effective method based on gradient flow tracking (11, 12) is used to obtain cell nuclei positions. This method first generates a diffused gradient vector flow field on the nuclei channel image, and then performs a gradient flow tracking procedure to attract each point's flow toward a sink corresponding to the individual nuclei centers, which separates the image into small regions each containing one nucleus. This algorithmic pipeline finally executes local adaptive thresholding on each region to extract the cell nucleus. The gradient flow tracking used in this method produces a natural segmentation for images containing nuclei touching each other without suffering from the sensitivity to heuristic rules or ad hoc parameter selections (11, 12). Figure 3 shows an example of nuclei segmentation.

thumbnail image

Figure 3. The nuclei channel of a zebrafish PSM image (left) and its nuclei segmentation result (right). The scale bar in left figure represents 75 μm. [Color figure can be viewed in the online issue, which is available at]

Download figure to PowerPoint

Because of the lack of information about cytoplasm or membrane, an offset around the nucleus was used to estimate the region of cytoplasm. Fortunately, the imaging condition is strictly controlled (8), and the cell size appears similar. Considering the distances between cell boundaries in the image, three pixels is as the most appropriate offset value. The upper part of Figure 4 shows some cell segmentation results: the images from left to right correspond to the cell examples from Phase 0 to Phase 7, and the cytoplasm region of each cell was labeled in purple shadow for the convenience of viewing. The average cell-patch size is about 10 pixels by 10 pixels.

thumbnail image

Figure 4. Cell patches (upper part) and corresponding mRNA pixel segmentation results on their mRNA channel images (bottom part). Images from left to right correspond to the cell examples from Phase 0 to Phase 7. The cytoplasm regions in cell patches were labeled in purple shadow for convenience of viewing. See Supporting Information Figure S4 for more examples of cell patches and their mRNA pixel segmentation results. [Color figure can be viewed in the online issue, which is available at]

Download figure to PowerPoint

Cell Feature Extraction

The differences among phases within the oscillation mainly lie in the amount and intensity of mRNA distribution in the nucleus region and cytoplasm region. As shown in Figure 2, the Phase 7 progresses to Phase 0 as follows: small discrete dot staining only in the nucleus; large single dot only in the nucleus; full nucleus with no cytoplasmic staining; staining throughout the nucleus and cytoplasm; primarily high cytoplasmic staining; moderate cytoplasmic staining only; light cytoplasmic staining only; and no staining.

We designed a problem-specific feature set NF9 and a histogram-based feature set SPIN to capture the cell pattern characteristics, which will be introduced later. Besides, three texture features are also resorted to as their effectiveness in cell classification problems. The three features are Haralick features (20), Gabor features (21), and threshold adjacency statistics (TAS) (22).

Problem-specific feature set NF9

In this subsection, a feature set named NF9 is designed to measure the mRNA localization in each cell. We first segment mRNA pixels on the mRNA channel of each cell image patch, then extract NF9 on the mRNA pixel segmentation results. Following, a threshold selection method based on Fisher criterion is introduced to segment mRNA pixels in cells, and the definition of NF9 is presented next.

In this article, we deal with the mRNA pixel segmentation by a threshold optimization procedure. Since NF9 is extracted on the mRNA pixel segmentation results, the values of NF9 are highly influenced by the mRNA pixel segmentation threshold. On the other hand, NF9 is expected to separate different cell phase categories as apart as possible. Therefore, the optimal threshold is the one to make resulting NF9s of samples of different cell phase categories appear as apart as possible. This requirement is similar to the original idea of the Fisher discriminant analysis (23, 24). Therefore, the Fisher criterion is employed as the evaluation function in this threshold optimization problem.

Let {Xmath image(T), t = 1,2,…,Ni} be the features (i.e., NF9) of cell samples in phase Ci extracted with the threshold T, where Ni is number of cells in phase Ci. The Fisher criterion function is defined as following:

  • equation image(1)

where equation image is the between-class feature distance with threshold T; equation image is the total within-class distance, and σmath image(T) is the variance of features in phase Ci, mi(T) is mean of features in phase Ci, and m(T) is mean of all features.

The optimal threshold is the one that minimizes the evaluation function J(T), and then this threshold is used to segment mRNA pixels on the green channel of cell patches for feature NF9 extraction.

The differences among eight phases mainly lie in the amount and intensity of mRNA distribution in different cell region. Therefore, we designed three numerical features to measure the mRNA distribution in a certain cell region. Next, the definition of the three features will be presented in the context of entire cell region, and the definitions of the features on nucleus region and cytoplasm region are similar.

Let I(x, y) be the mRNA channel image of the single cell patch, R(x, y) be the cell region indicator function, and f(x, y) be the mRNA pixel indicator function in the cell region. The value of R(x, y) equals 1 when the pixel (x, y) is located within the cell area; otherwise, R(x, y) equals 0. The value of f(x, y) obeys the following rule:

  • equation image(2)

where Th is the optimal mRNA pixel threshold obtained by the method previously introduced.

 The first feature is the proportion of mRNA pixels within thecell region, which is to measure the amount of mRNA distribution.

  • equation image(3)

 The rest two features are the first- and second-order statistics of the values of the mRNA pixels within the cell region, which aimed to measure the intensity of mRNA distribution.

  • equation image(4)
  • equation image(5)

 These three features will be computed respectively on the nucleus region, cytoplasm region and the entire cell region. Finally, the nine features from the three regions form a feature vector that we named Numerical Features NF9.

SPIN descriptor for cell image

The Spin image descriptor, first proposed as a kind of shape representation for matching range data (25), is then used as a texture descriptor (26) on the intensity image. As shown in Figure 5, the spin image descriptor (SPIN descriptor for short) is a rotation-invariant two-dimensional histogram coding the distribution of image brightness values in the neighborhood of a particular reference (center) point. The two dimensions of the histogram are d, the distance from the center point, and i, the intensity value. The “slice” of the spin image corresponding to a fixed d is simply the histogram of the intensity values of pixels located at a distance d from the center.

thumbnail image

Figure 5. Construction of spin image (26). Three sample points in the normalized patch (left) map to three different locations in the descriptor (right), and the gray level of each location in the spin image reveals the amount of points mapped to this location. [Color figure can be viewed in the online issue, which is available at]

Download figure to PowerPoint

The SPIN descriptor is composed of histograms computed from a series of concentric ring regions. Conceptually, the concentric ring structure coincides well with the nucleus-cytoplasm cell structure. Furthermore, the main difference among the eight phases is the amount and intensity of mRNA distribution in different cell structure, which can be well characterized by the histogram of the local region. So the SPIN descriptor, used as a new cell feature set, is extracted directly on mRNA channel of each cell image patch without the mRNA pixel segmentation step, which is necessary for features NF9 extraction.

SPIN descriptor extraction for a cell patch consists of three steps. First, cell centroid is localized based on the cell shape. Second, cell region is normalized to a round region according to the cell centroid. Finally, SPIN descriptor is computed on the normalized cell patch. In this article, four bins for distance and 16 bins for intensity value are chosen as the SPIN parameters, thus generating a 64-dimensional feature vector. Spin images for stereotypical cells from every phase are shown in Supporting Information Figure S2, and significant differences among these spin images can be observed.

Support Vector Machine Classification With Combined Kernels

In the classification step, SVMs with a radial basis function (RBF) kernel (27) are employed as the cell classifier. Since SVMs are binary classifiers, “One versus One” strategy is adopted to deal with this multi-class problem. The binary SVMs used in multi-class SVM (1 vs. 1) has been implemented using the LIBSVM (28). For different types of feature sets (e.g., numerical features vs. histogram-based features) are used in our problem, two aspects need to be considered in addition to standard SVM framework.

Extended Gaussian kernel

When dealing with feature sets having particular distance metrics, an extended Gaussian kernel (29, 30) is preferred to incorporate these particular metrics into the SVM framework. The extended Gaussian kernel function can be defined as

  • equation image(6)

where D(·,·) can be any distance in the input space, and A is the kernel parameter that is the same as the one in RBF kernel.

Among the five feature sets used in this article, Gabor feature set has its own distance metric (21) that can be embedded into extended Gaussian kernel directly. However, a distance metric of histogram is required for SPIN descriptors.

There are several distance metrics to measure the similarity or dissimilarity between two histograms, and χ2 distance is a simple and effective one among them. In this article, χ2 distance is chosen to embed into the SVM framework since χ2-kernel is proved to be a Mercer kernel (31). Given two histograms hi(k) and hj(k) with K bins, the χ2 distance between them can be defined as

  • equation image(7)
Feature fusion by kernel combination

In cell classification problems, feature selection methods are usually employed after the feature extraction step to combine different features together. Feature selection methods consider all features as homogeneous numerical features and select a subset of the features that are most informative in discriminating the various classes. But when heterogeneous features exists (like SPIN descriptor in our problem), this methodology may not be suitable.

In this article, a kernel fusion method with two simple strategies is applied to combine features of different types together. Suppose M different kinds of feature sets have been extracted on each sample in dataset. Let Km(·,·) be the extended Gaussian kernel that embedded with appropriate distance metric for the mth feature set. The combined kernel K(·,·) can be computed by the “averaging” strategy or “product” strategy shown as following. Then the resulting kernels then could be used as a single kernel for SVMs training.

  • equation image(8)
  • equation image(9)


  1. Top of page
  2. Abstract
  6. Acknowledgements
  8. Supporting Information

As shown in previous section, our approach consists of three consecutive steps. The performance of each step will influence the general performance of the entire framework. Since the first stage of cell segmentation in our framework has thorough evaluations in Ref.12, the error of cell segmentation are excluded from the general classification performance evaluation.

After cell segmentation, a sample set of cell patches is constructed as the benchmark dataset after cell segmentation, and then features extraction and classification experiments are conducted on this sample set. The classification results shown in this section are based on this sample set.

Sample Set Construction

To evaluate the classification performance of our approach, a benchmark dataset is required. Unlike other subcellular pattern classification problem, there is no existing dataset for this problem. A sample set is constructed using cell patches after cell segmentation step: First, each cell patch was assigned to its corresponding phase manually according to the expert-annotated label in Ref.8, and then samples obviously inconsistent with the models in Figure 2 were removed from the sample collection to avoid bringing manual misclassification error to this dataset.

In this article, a zebrafish image set with 10 images was used to construct the sample set, which produced nearly 10,000 cell patches (but many of them are unlabeled). After label-assignment and sample selection, a sample set containing 2,227 patches was generated based on the criteria in (8), and the selection was double-checked by two experts. The sample numbers of each phase are listed in Table 1.

Table 1. Sample number of each phase

Threshold Selection for mRNA Pixel Segmentation

When selecting threshold for mRNA pixel segmentation for NF9 extraction, the evaluation function J were computed on 12% samples randomly selected from the sample set, and the threshold value with the minimal J value was taken as the optimal mRNA threshold. See Supporting Information Figure S3 for a plot of ln(J) values computed under a randomly selected dataset, in which the optimal threshold is 19. To make the threshold more robust, this procedure was repeated 10 times, and the mean value 20 was taken as the final mRNA threshold used for feature NF9 extraction. The lower part of Figure 4 presents results of the mRNA pixel segmentation for samples from each phase under threshold 20. It can be seen that the mRNA pixel segmentation results are quite consistent with the phase models demonstrated in Figure 2.

Classification Results and Analysis

In this section, 10-fold cross-validation was performed to estimate the classification performance. The sample set is first randomly partitioned into 10 folds with equal or nearly equal size. Subsequently, 10 iterations of training and validation are performed such that within each iteration a different fold of the data is held-out for testing while the remaining nine folds are used for training. Classification accuracy was calculated by aggregating the predictions on all 10 testing sets. This procedure was repeated 10 times and the mean accuracy was taken as the final result.

Classification performance with single feature set

Classification experiments with single feature set were first conducted. Besides NF9 and SPIN, Haralick, Gabor, and TAS features are also extracted on the 2,227-cell-patch sample set. We compute the 14-dimension Haralick feature set (20) on nuclei, cytoplasm, and whole-cell regions, respectively, and a 42-dimension Haralick feature set was obtained. The Gabor and TAS features are computed as the references (21) and (22) suggested. After feature extraction, the five feature sets were fed into the SVM classifier respectively to evaluate their individual effectiveness. The optimal SVM parameters (the slack penalty and the kernel parameter) were chosen by a grid search with a 5-fold cross-validation on the training data in each iteration.

The overall accuracies of the five feature sets are shown in Table 2. As shown in Table 2, the mean overall accuracies of SPIN and NF9 are 75.91% and 75.07%, respectively, which is superior to that of Haralick features (73.79%) and much better than those of Gabor features (60.80%) and TAS (44.10%). TAS is a newly invented cell feature that only characterizes the binary pattern of cell image, which may be the reason of its ultra-low accuracy. Two-sample t-tests between the overall accuracies of SPIN and rest four feature sets were performed to validate their performance differences. Asterisks in this table indicate the performance differences that are statistically significant at 5% level. The results of t-tests show the superiority of SPIN compared with rest four feature sets.

Table 2. Overall accuracies of the five feature sets (the mean ± the standard deviation of the results of the 10 experiments)
  • *

    Asterisks indicate the performance differences that are statistically significant at 5% level between the given method and the corresponding best result indicated in bold.

Accuracy (%)75.91 ± 0.3975.07* ± 0.2173.79* ± 0.2860.80* ± 0.2944.10* ± 0.13

Besides the overall accuracy, the accuracies for each phase are also used as the measurements of classification performance. Table 3 gives the accuracies for each phase of these five feature sets, and the best results of every phase among the five feature sets are in bold. Comparing the accuracy values of eight phases in Table 3, we can draw following conclusions. First, the performances of SPIN, NF9, and Haralick features are much better than those of Gabor feature and TAS, since the accuracy values of three phases by Gabor feature and four phases by TAS are zeros or almost zero, which means these phases cannot be identified by these two kinds of features. Second, observing the first three columns in Table 3, we can find that the classification abilities for different phases are different. The accuracy values of Phase 0 and Phase 4 by these three features are over 80%, which shows their good classification ability for these two phases. On the other hand, the accuracy values of Phase 5 and Phase 6 by these three features are less than 60%, which shows their poor classification ability for these two phases.

Table 3. Accuracies for each phase of the five feature sets (the best results of every phase among the five feature sets are in bold)
Accuracy (%)SPINNF9HaralickGaborTAS
Phase 082.17 ± 0.5084.79 ± 0.7779.73 ± 1.3278.12 ± 1.020
Phase 163.74 ± 0.8061.35 ± 0.9652.73 ± 1.1149.14 ± 0.630
Phase 272.41 ± 0.5371.59 ± 0.3770.02 ± 0.6057.86 ± 0.3734.79 ± 0.13
Phase 379.39 ± 0.6879.74 ± 0.7979.44 ± 0.2951.06 ± 0.7143.74 ± 3.52
Phase 485.46 ± 0.4183.67 ± 0.4584.60 ± 0.5365.78 ± 0.2753.51 ± 0.12
Phase 552.05 ± 3.2249.24 ± 2.9854.51 ± 2.4500
Phase 656.46 ± 2.7652.15 ± 3.0856.29 ± 2.3300
Phase 766.36 ± 2.1767.01 ± 1.8369.09 ± 1.823.33 ± 10.540

Experiments on single feature set demonstrate the feature sets NF9 and SPIN we designed are effective for this problem compared with three other feature sets commonly-used in cell classification problems. SPIN is better than NF9 not only for its superior performance but also for its universality that can be used in other cell classification problem. Moreover, notice that the highest accuracy values of each phases appear in different feature sets (bold values in Table 3), it is possible to improve classification performance by combining multiple feature set together.

Classification performance with combined kernel

Classification experiments with combined kernels were conducted in this subsection. For the ultra-low performance of TAS, TAS is excluded from the combination. We combined kernels of the resting four feature sets by the two strategies introduced in previous section. To reduce the computation cost, the kernel parameters in each kernel were set to be the mean value of the pairwise distances between samples of each feature set, which is a rule-of-thumb suggested by Ref.32. The only parameter in SVMs training is the slack penalty, which is chosen by a 5-fold cross-validation on the training data in each iteration.

The first two columns in Table 4 show the performance of the two combined kernels. The mean overall accuracies of averaging kernel and product kernel are 76.52% and 76.71%, respectively, which has a small improvement compared with that of SPIN (also verified by the results of t-tests between the overall accuracies of the two combined kernels and SPIN). The performance of product kernel is a little better than that of averaging kernel, but the computation cost of product kernel is higher than the averaging one.

Table 4. Accuracies of averaging kernel and product kernel and manual classification results (The first two columns are the mean ± the standard deviation of the results of the 10 experiments)
Accuracy (%)AveragingProductManual
  • *

    Asterisks indicate the performance differences that are statistically significant at 5% level between the given method and the single feature with best result in Table 2.

Phase 089.85 ± 1.0490.05 ± 0.9192.13
Phase 163.31 ± 0.7663.46 ± 1.1557.80
Phase 274.29 ± 0.5574.17 ± 0.2965.10
Phase 378.24 ± 0.6178.16 ± 0.6668.48
Phase 484.88 ± 0.7885.32 ± 0.7587.69
Phase 550.90 ± 3.0953.28 ± 2.7938.04
Phase 657.50 ± 1.9858.38 ± 2.0945.78
Phase 767.32 ± 1.5167.06 ± 1.4956.57
Overall76.52* ± 0.4576.71* ± 0.4667.96

Besides, we performed an additional manual classification experiment to compare with automatic classification results. A graduate student trained with related knowledge was asked to classify a cell sample set containing 60% samples from the whole sample set, and the manual classification result in comparison with the benchmark data in Ref.8 is shown in the last column of Table 4. The overall accuracy of manual classification is around 68%, while those by combined kernels are over 76%.

Observe the accuracies of each phase classified by human, the differences of classification abilities among different phases still exist, which could validate the difficulty of this classification task on one side. These differences mainly result from the inconsistence between the low resolution of cell image and cell phase models. In Figure 2, the 8-phase models seem to have notable differences among them, but the models are very ideal. In fact, the difference between neighboring two phases is not so notable for real cells (comparing representative cells of each phase with their models in Figure 2). For example, according to cell models in Figure 2, mRNA distribution in cells from Phase 5 and Phase 6 locate mainly in nucleus region and the difference between them lies in the amount of mRNA distributed in nucleus. But mRNA distribution in real cells of these two phases may always appear more or less in cytoplasm region, which certainly results in misclassification. That is the reason for low performance of Phase 5 and Phase 6. On the other side, mRNA distributions in Phase 0 and Phase 4 are two extreme cases (no staining vs. staining throughout entire cell), so the performance for these two phases is the best.

In short, the accurate classification for these ambiguous phases is a very hard work no matter for computer or human. Our experimental results reveal that the combined kernels have reasonable classification ability.


  1. Top of page
  2. Abstract
  6. Acknowledgements
  8. Supporting Information

In this article, we introduced a framework for automated gene oscillation phase classification on zebrafish PSM images. Two kinds of feature sets have been designed to measure the mRNA distribution in cells. One is the numerical feature NF9 that is proposed based on the cell phase models, and the other is the histogram-based feature SPIN that can be used as a general descriptor for other cell patterns and easily extended to 3D images. Besides, two kernel combination methods are used to combine the multiple heterogeneous feature sets together.

The classification task in this article is hard even for human because of the low image solution, vague differences between neighboring phases and irregularity of real cell image. Experimental results indicate that feature sets SPIN and NF9 can describe different cell phases effectively, and the kernel combination methods improve the classification performance than those of single feature sets.

Since there exist some phases that cannot be recognized very well at present, our further research will develop features that can characterize these phases finely. We will also focus on multiple kernel learning for optimal feature combination and effective classifier design specific to this problem with gradual transition between adjacent classes.


  1. Top of page
  2. Abstract
  6. Acknowledgements
  8. Supporting Information

The authors thank Dr. Scott Holley for providing the zebrafish dataset used in this paper. They also thank Jiao Zhang, Chen Chen, and Zhenwen Zhu for their cell annotation work in sample set construction.


  1. Top of page
  2. Abstract
  6. Acknowledgements
  8. Supporting Information
  • 1
    Stern HM,Zon LI. Cancer genetics and drug discovery in the zebrafish. Nature Reviews Cancer 2003; 3: 533539.
  • 2
    Mayden RL,Tang KL,Conway KW,Freyhof J,Chamberlain S,Haskins M,Schneider L,Sudkamp M,Wood RM,Agnew M, et al. Phylogenetic relationships of Danio within the order Cypriniformes: A framework for comparative and evolutionary studies of a model species. J Exp Zool B Mol Dev Evol 2007; 308B: 642654.
  • 3
    Xiang J,Yang H,Che C,Zou H,Yang H,Wei Y,Quan J,Zhang H,Yang Z,Lin S. Identifying tumor cell growth inhibitors by combinatorial chemistry and zebrafish assays. PLoS ONE 2009; 4: e4361.
  • 4
    Major RJ,Poss KD. Zebrafish heart regeneration as a model for cardiac tissue repair. Drug Discov Today Dis Models 2007; 4: 219225.
  • 5
    Tsujikawa M,Malicki J. Intraflagellar transport genes are essential for differentiation and survival of vertebrate sensory neurons. Neuron 2004; 42: 703716.
  • 6
    Holley SA,Geisler R,Nüsslein-Volhard C. Control of her1 expression during zebrafish somitogenesis by a delta-dependent oscillator and an independent wave-front activity. Genes Dev 2000; 14: 16781690.
  • 7
    Jülich D,Lim CH,Round J,Nicolaije C,Schroeder J,Davies A,Geisler R,Lewis J,Jiang Y,Holley SA. beamter/deltaC and the role of Notch ligands in the zebrafish somite segmentation, hindbrain neurogenesis and hypochord differentiation. Dev Biol 2005; 286: 391404.
  • 8
    Mara A,Schroeder J,Chalouni C,Holley SA. Priming, initiation and synchronization of the segmentation clock by deltaD and deltaC. Nat Cell Biol 2007; 9: 523530.
  • 9
    Liu T,Nie J,Li G,Guo L,Wong STC. ZFIQ: A software package for zebrafish biology. Bioinformatics 2008; 24: 438439.
  • 10
    Liu T,Lu J,Wang Y,Campbell WA,Huang L,Zhu J,Xia W,Wong STC. Computerized image analysis for quantitative neuronal phenotyping in zebrafish. J Neurosci Methods 2006; 153: 190202.
  • 11
    Li G,Liu T,Nie J,Guo L,Malicki J,Mara A,Holley SA,Xia W,Wong STC. Detection of blob objects in microscopic zebrafish images based on gradient vector diffusion. Cytometry Part A 2007; 71A: 835845.
  • 12
    Li G,Liu T,Nie J,Guo L,Chen J,Zhu J,Xia W,Mara A,Holley S,Wong STC. Segmentation of touching cell nuclei using gradient flow tracking. J Microsc 2008; 231: 4758.
  • 13
    Boland MV,Murphy RF. A neural network classifier capable of recognizing the patterns of all major subcellular structures in fluorescence microscope images of HeLa cells. Bioinformatics 2001; 17: 12131223.
  • 14
    Newberg JY,Li J,Rao A,Pontén F,Uhlén M,Lundberg E,Murphy RF. Automated analysis of human protein atlas immunofluorescence images. In: Proceedings of the Sixth IEEE International Conference on Symposium on Biomedical Imaging: From Nano to Macro. Piscataway: IEEE Press; 2009. pp 10231026.
  • 15
    Wang J,Zhou X,Bradley PL,Chang SF,Perrimon N,Wong STC. Cellular phenotype recognition for high-content RNA interference genome-wide screening. J Biomol Screen 2008; 13: 2939.
  • 16
    Huh S,Lee D,Murphy RF. Efficient framework for automated classification of subcellular patterns in budding yeast. Cytometry Part A 2009; 75A: 934940.
  • 17
    Carpenter AE,Jones TR,Lamprecht MR,Clarke C,Kang IH,Friman O,Guertin DA,Chang JH,Lindquist RA,Moffat J, et al. CellProfiler: Image analysis software for identifying and quantifying cell phenotypes. Genome Biol 2006; 7: R100.
  • 18
    Peng H,Long F,Zhou J,Leung G,Eisen MB,Myers EW. Automatic image analysis for gene expression patterns of fly embryos. BMC Cell Biol 2007; 8:S7.
  • 19
    Lu Y,Lu J,Liu T,Yang J. Automated cell phase classification for zebrafish fluorescence microscope images. In: Proceedings of 20th International Conference on Pattern Recognition. Washington: IEEE Computer Society; 2010. pp 25842587.
  • 20
    Haralick RM,Shanmugam K,Dinstein I. Textural features for image classification. IEEE Trans Syst Man Cybern 1973; 3: 610621.
  • 21
    Manjunath BS,Ma WY. Texture features for browsing and retrieval of image data. IEEE Trans Pattern Anal Mach Intell 1996; 18: 837842.
  • 22
    Hamilton NA,Pantelic RS,Hanson KH,Teasdale RD. Fast automated cell phenotype image classification. BMC Bioinformatics 2007; 8: 110.
  • 23
    Fisher RA. The use of multiple measurements in taxonomic problems. Annal Eug 1936; 7: 179188.
  • 24
    Duda RO,Hart PE,Stork DG. Pattern classification, 2nd ed. New York: John Wiley & Sons Inc.; 2001. p 117.
  • 25
    Johnson AE,Hebert M. Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Trans Pattern Anal Mach Intell 1999; 21: 433449.
  • 26
    Lazebnik S,Schmid C,Ponce J. A sparse texture representation using local affine regions. IEEE Trans Pattern Anal Mach Intell 2005; 27: 12651278.
  • 27
    Alpaydin E. Introduction to machine learning. London: The MIT Press; 2004. p 225.
  • 28
    Chang C,Lin C. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2011; 2: 27.
  • 29
    Chapelle O,Haffner P,Vapnik VN. Support vector machines for histogram-based image classification. IEEE Trans Neural Netw 1999; 10: 10551064.
  • 30
    Jing F,Li M,Zhang H,Zhang B. Support vector machines for region-based image retrieval. In: Proceedings of 2003 International Conference on Multimedia and Expo. Washington: IEEE Computer Society Press; 2003. pp 2124.
  • 31
    Fowlkes C,Belongie S,Chung F,Malik J. Spectral grouping using the Nyström method. IEEE Trans Pattern Anal Mach Intell 2004; 26: 214225.
  • 32
    Zhang J,Marszalek M,Lazebnik L,Schmid C. Local features and kernels for classification of texture and object categories: A comprehensive study. Int J Comp Vision 2007; 73: 213238.

Supporting Information

  1. Top of page
  2. Abstract
  6. Acknowledgements
  8. Supporting Information

Additional Supporting Information may be found in the online version of this article.

CYTO_21097_sm_suppinfofig1.tif4298KSupporting Information Figure 1
CYTO_21097_sm_suppinfofig2.tif1751KSupporting Information Figure 2
CYTO_21097_sm_suppinfofig3.tif26971KSupporting Information Figure 3
CYTO_21097_sm_suppinfofig4.tif1229KSupporting Information Figure 4

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.