FlyExpress Discovery Platform: Gene Expression Images and Image-Based Searching
Digital images revealing patterns of D. melanogaster gene expression, captured by RNA in situ hybridization, were retrieved from BDGP Release 2 (Tomancak et al., 2002) and Fly-FISH September 2007 release (Lécuyer et al., 2007). All images were processed using an in-house, semi-automated pipeline to standardize and align embryos, where multi-embryo images were manually divided into separate images and partial embryo images were discarded. Image processing was carried out using Matlab and our own image processing routines. In this pipeline, images were extracted by means of the following procedure (Matlab functions are shown in italics): read original image with imread, convert to grayscale with rgb2gray, apply windowed low-pass Gaussian filer (imfilter and fspecial) to blur the image a little so that edge detection would detect only the main outside edges of the embryo and not edges caused by expression patterns, shadows, etc., delineate embryo edges using edge Canny edge detection, expand points with imdilate, fill holes in the image with imfill, shrink dilations, blur again and sharpen the image edges with imerode, strel, and medfilt2. The bwlabel function is used to identify individual embryo boundaries. Finally all pixels outside the selection region (outside the embryo) are set to pure white. The resulting image is saved in RGB color as a bitmap file.
The next standardization step is embryo alignment, which is done by rotating embryos using imrotate, such that the major axis of the embryo is parallel to the horizontal, drawing a bounding box to enclose the embryo, and cropping the embryo using imcrop to automatically remove background external to the smallest embryo bounding box. For consistent orientation, we used anterior-on-the-left and dorsal-toward-the-top format for lateral images and anterior-on-the-left for all other views (e.g., dorsal and ventral; see also Kumar et al., 2002). During quality control, experienced biologists corrected orientation and alignment images using flipdim, as necessary.
To size standardize and align all images, we chose a cellular aspect ratio of 2.5 (320 × 128 pixels) based on natural aspect ratios (Markow et al., 2009), on the need to avoid pixel padding in image representation and to make sure that each line of pixels and all images end on byte, word and long word boundaries. Using the imresize function, all embryo images were resized and a standardized collection created.
Developmental stages for Fly-FISH and BDGP embryos are available from the image source and were thus assigned to embryos. BDGP embryos are annotated with developmental stage ranges (1–3, 4–6, 7–8, 9–10, 11–12, or 13–16) using the Bownes system (16 stages; Roberts, 1986) whereas Fly-FISH embryos are classified into stage ranges (1–3, 4–5, 6–7, 8–9, and later) using the Campos-Ortega and Hartenstein (1985; 17 stages) system. We found that a large fraction of Fly-FISH images contained multiple embryos from different developmental stages. We carefully reviewed and assigned appropriate stage ranges to individual embryos for these cases. We also assigned anatomical embryo views (e.g., lateral, dorsal, and ventral) for all images, because computational comparison of spatial expression patterns is only biologically meaningful if conducted within each view. These stage and view assignments were added during our quality control process, where all embryos were examined by at least one developmental biologist and image standardization and expression extraction were carried out manually, when needed. This produced a total of 99,148 standardized embryo images: 42,065 Fly-FISH and 57,083 BDGP. Within BDGP, there are 14,257 dorsal, 37,825 lateral, and 5,964 ventral views, and within Fly-FISH, there are 3,494 dorsal, 37,211 lateral, and 1,360 ventral views.
Comparative expression analysis to identify images (and thus genes) with overlapping expression patterns requires digital descriptions of spatial expression patterns (Kumar et al., 2002). We used adaptive intensity and color thresh-holding to automatically delineate the expression profile from the embryo background. For Fly-FISH images, expression patterns are captured in green and yellow colors, and in BDGP images, blue color captures the spatial profile. In the process of image extraction there are four parameters: the color layer (red, green, or blue), whether to reverse intensity, an adjustment parameter for shift threshold and the amount of variance between the three expression patterns produced. The RGB image is loaded with imread. Based on the color channel (layer) parameter, a histogram of intensities for just that color layer is created with imhist. The histogram is “pre-cut” to only include intensities between 5 and 250. The sum of the intensities in the pre-cut histogram is calculated with sum. Upper and lower limits of the histogram are computed for the range between 10% and 90% of this sum. An area (“integration”) vector is calculated for the color layer (specified by the color parameter) of the original image. Using this area vector and the original intensities, we calculate the first moment and the centroid of the area under the under the curve about the y-axis and the first moment and the centroid of the area under the curve about the x-axis. A threshold value is calculated as the centroid about the y-axis + the lower limit calculated above. The minimum distance from the x–y center is calculated for each point is the intensity histogram. The plus_da (the differential best area + variance) and minus_da (the differential best area - variance) are calculated. plus_da and minus_da thresholds are calculated from these. The image background is cleared using imdilate (Image, SE).
The three binary extraction patterns are extracted from this image with cleared background based on the adjustment parameter and the threshold (for the best extraction), the minus_da threshold (for the minus extraction) and the plus_da_threshold (for the plus extraction). The code for this algorithm is available upon request. Three binary (black and white) patterns enable searches of the database using patterns at different levels of expression intensity (a, b, and c patterns). Embryos with ubiquitous expression in the earliest stages of development as well as those with no expression were noted and marked for exclusion from comparative image analysis.
To identify the degree of spatial overlap between patterns, we used the low-level bitmap Jaccard similarity index (Kumar et al., 2002), which traces its roots to the Tanimoto measure (Tanimoto, 1958) and is a member of a family of similarity measures that include Taversky, Euclidian, Hamming, and Ochia measures (Bradshaw, 2001). In this case, the similarity score (S) between two images (Q and D) is given by SQD = |Q∩D|/|Q∪D|, where |Q∩D| is the size of the intersection of expression (count of black pixels) between images Q and D and |Q∪D| is the size of the union between images Q and D. We have previously shown that this approach emphasizes spatial overlap, which is biologically more meaningful than shape matching and invariant moment based features (Kumar et al., 2002; Gurunathan et al., 2004). We have also found that it performs with an effectiveness similar to the computationally more intensive Gaussian Mixture Model method (Peng et al., 2007), which in our hands is very sensitive to shifts in image properties, such as the color and contrast (Gargesha et al., 2008, 2009a,b; Roy et al., 2009). Thus, we used multiple binary feature vectors to represent the expression information in our analysis and provide the greatest flexibility. For each S-score, we also computed the probability that any pair of images will show an equal or higher value by chance alone (P-value). Empirical S-score distributions derived from image pairs from the same data source (BDGP or Fly-FISH), developmental stage and anatomical view are used to determine P-values (Supp. Fig. S2).
For images from PubMed publications, gene expression patterns were manually extracted from the PDF file by our pipeline team. Extracted images were then standardized and expression pattern representations created using the same methods described above for BDGP and Fly-FISH. Images were manually annotated, for example for developmental stage and anatomical view (lateral, dorsal, ventral), by our team of curators at Harvard (FlyBase) and by biologists on our pipeline team. Images from publications present special challenges due to their varying quality. For images of low resolution or poor quality, it is not possible to make specific developmental stage determination. In this case the best possible stage range (from-stage/to-stage) is determined for the annotation. Publication information (for example author, year, and title) for the papers was obtained from the FlyBase publically available database.
A detailed description of all computational aspects of FlyExpress and of the extracted gene expression pattern database can be found in Kumar et al. (2011) and the source code for all algorithms is available upon request.
Expression Assessment
For each gene pair, the standardized BDGP or Fly-FISH expression patterns in FlyExpress were manually compared to determine if the same or different expression patterns were present, either spatially or temporally. Manual comparison was necessary because we could not identify an S-score cut-off that adequately represented spatial and temporal divergence in comparisons involving embryos with artifact staining, high backgrounds, or embryo pairs with three-dimensional expression considerations. Even when using manual inspection, posterior spiracles and other openings that are frequently a source of artifacts must show expression in at least two images of the same gene at the same stage and view to be considered legitimate.
Image comparisons were performed without any additional information, but comparisons within a multigene family were verified to be free of logical inconsistencies. Pattern similarity or difference was determined on the basis of the presence or absence of expression within the same region of the embryo. Although only images of the same data source, developmental stage and view were compared, gene pairs were binned into BDGP stage ranges and each pair given an overall assessment. The exception was stage 1–3 BDGP embryos where the exact stage cannot be assigned due to inability to visualize the number of nuclei. Gene pairs were assigned an overall spatial and temporal assessment of Same (0), Different (1), or Ambiguous/No Data (null).
Spatially, only images of the same data source, developmental stage, and anatomical view were compared. A gene pair was classified as having the same expression in a stage if all image pairs showed expression in the same embryonic regions. A gene pair was classified as having divergent expression if image pairs within a stage showed expression in different embryonic regions. Gene pairs with no images meeting the above criteria (including those with no image data) were classified as Ambiguous/No Data.
Temporally, images of the same data source and stage were examined. Two genes were classified as having the same expression in a stage if all image pairs showed expression presence or absence during that stage, accompanied by corroborating microarray data (Arbeitman et al., 2002). Two genes were classified as having different expression in a stage if at least one image pair showing expression presence for one gene and absence for the other was present, again accompanied by corroborating microarray data. Gene pairs not falling into one of the above categories were classified as Ambiguous/No Data. Gene pairs exhibiting temporal expression differences in a stage were not assessed spatially for that stage.