Label‐free quantitative chemical imaging and classification analysis of adipogenesis using mouse embryonic stem cells

Stem cells have received much attention recently for their potential utility in regenerative medicine. The identification of their differentiated progeny often requires complex staining procedures, and is challenging for intermediary stages which are a priori unknown. In this work, the ability of label‐free quantitative coherent anti‐Stokes Raman scattering (CARS) micro‐spectroscopy to identify populations of intermediate cell states during the differentiation of murine embryonic stem cells into adipocytes is assessed. Cells were imaged at different days of differentiation by hyperspectral CARS, and images were analysed with an unsupervised factorization algorithm providing Raman‐like spectra and spatially resolved maps of chemical components. Chemical decomposition combined with a statistical analysis of their spatial distributions provided a set of parameters that were used for classification analysis. The first 2 principal components of these parameters indicated 3 main groups, attributed to undifferentiated cells, cells differentiated into committed white pre‐adipocytes, and differentiating cells exhibiting a distinct protein globular structure with adjacent lipid droplets. An unsupervised classification methodology was developed, separating undifferentiated cell from cells in other stages, using a novel method to estimate the optimal number of clusters. The proposed unsupervised classification pipeline of hyperspectral CARS data offers a promising new tool for automated cell sorting in lineage analysis.

Stem cells have received much attention recently for their potential utility in regenerative medicine. The identification of their differentiated progeny often requires complex staining procedures, and is challenging for intermediary stages which are a priori unknown. In this work, the ability of labelfree quantitative coherent anti-Stokes Raman scattering (CARS) micro-spectroscopy to identify populations of intermediate cell states during the differentiation of murine embryonic stem cells into adipocytes is assessed. Cells were imaged at different days of differentiation by hyperspectral CARS, and images were analysed with an unsupervised factorization algorithm providing Raman-like spectra and spatially resolved maps of chemical components. Chemical decomposition combined with a statistical analysis of their spatial distributions provided a set of parameters that were used for classification analysis. The first 2 principal components of these parameters indicated 3 main groups, attributed to undifferentiated cells, cells differentiated into committed white preadipocytes, and differentiating cells exhibiting a distinct protein globular structure with adjacent lipid droplets. An unsupervised classification methodology was developed, separating undifferentiated cell from cells in other stages, using a novel method to estimate the optimal number of clusters. The proposed unsupervised classification pipeline of hyperspectral CARS data offers a promising new tool for automated cell sorting in lineage analysis.

K E Y W O R D S
adipogenesis, classification analysis, coherent anti-Stokes Raman Scattering, hyperspectral imaging, stem cells

| INTRODUCTION
A significant challenge when investigating biological phenomena is that biological systems exhibit complex dynamic interactions with inter-celluar and intra-cellular states fluctuating and changing to accommodate feedback from the surrounding micro-environment in order to enable tissue homoeostasis. To some extent, this complexity can be recreated via growth of cells in in-vitro environments in order to ask fundamental biological questions regarding the control of these processes [1,2]. One class of cells which has received much attention in recent years, primarily for their potential utility in regenerative medicine applications, are stem cells. Stem cells can be derived from a variety of tissue types at multiple developmental stages (such as embryonic, foetal) and during adulthood (such as haematopoietic, mesenchymal stem cells and neural crest derived stem cells [3]). Each stem cell type, normally loosely defined by its source of origin, has its advantages and disadvantages. The desirable properties of stem cells, which tissue engineers aim to exploit, are their ability to self-renew in-vitro (potentially indefinitely, dependent on stem cell type) and their inherent ability to transition through several intermediate or precursor states (progenitor cells) to produce multiple specialised cell types (differentiation). Stem cells and their differentiated progeny are currently primarily identified based on a combination of in-vitro morphology and marker profiles (sometimes destructive). This represents a significant hurdle for the discovery of novel stem cell types since it requires that markers for cellular states have been pre-identified and that these markers accurately reflect the cellular phenotype [4,5]. Furthermore, markers of intermediary stages of cellular differentiation are frequently lacking due to the heterogeneous nature of stem cell cultures, resulting from their capacity to generate multiple cell types. Also, intermediate cell states are often ill-defined as a result of current experimental approaches which, while producing large volumes of data, tend to average cell population behaviour (e.g. in microarrays, proteomics and other high-throughput marker discovery approaches). Generating well-defined markers such as antibodies typically requires well-defined populations of intermediate cell states.
Destructive cellular staining procedures utilising validated fluorescently labelled antibodies against targets of interest remains the mainstay method to visualise and assess the localisation of proteins and other cellular constituents in order to infer cellular state. Significant limitations of immunofluorescent techniques are that generation of novel antibodies is a time-consuming, technically demanding and labour intensive task that still often utilises animals and results in antibodies that are either not specific for the intended target or bind to targets which bear no relevance for the phenotype being investigated. Furthermore, fluorescence intensity readouts are not quantitative due to photobleaching. Other classification techniques based on gene expression profiling are equally invasive, as they require destruction of the samples.
Label-free techniques, such as bright field or dark field [6] and quantitative phase imaging [7], have been used for cell sorting by extracting morphological features [6,7], for example, size, shape, granularity or phase metrics [7], for example, dry mass, but are unable to separate its chemical composition.
In order to overcome some of these limitations, we sought to develop a label-free micro-spectroscopy platform which could assess intermediary cell states through the identification of novel, non-destructive, markers of cellular phenotype by utilising chemically specific vibrational imaging based on coherent Raman scattering (CRS) and the model of mouse embryonic stem cells differentiating towards the adipocytic lineage. In the last decade, CRS microspectroscopy has emerged as a powerful technique which overcomes limitations of spontaneous Raman and offers high acquisition speed, compatibility with imaging living cells, and intrinsic three dimensional spatial resolution [8]. CRS utilises the interference between two optical fields to resonantly drive molecular vibrations in the focal volume, and a third field to read out the Raman scattering. Due to the coherent driving of the vibrational excitation, Raman scattering by identical chemical bonds in the focal volume constructively interferes, generating a signal enhancement compared to spontaneous Raman.
To date, only a few studies of stem cell differentiation using CRS micro-spectroscopy have been reported [9][10][11][12][13]. In Ref. [9], an increase of the protein:RNA ratio was found after the differentiation of murine embryonic stem cells (mESc), similar to what was observed in spontaneous Raman studies [14]. The ability of CRS micro-spectroscopy to detect lipid content and volume concentrations within lipid droplets formed during adipogenesis was demonstrated for adipose-derived stem cells in Refs. [10,12]. Hofemeier and co-workers used Raman and CARS micro-spectroscopy to identify early signs of calcium deposition during osteogenic differentiation of human stem cells [13]. A hyperspectral CARS investigation of adipogenic and osteogenic differentiation of human mesenchymal stromal cells (MSCs) in Ref. [11] showed an increase of the lipid (for the adipogenesis) and mineral (for the osteogenesis) content during differentiation. Although promising, these works either studied differentiation into well-defined cell lineages with little heterogeneity (as is the case for MSCs), or lacked the ability to classify intermediate cell types in the absence of a priori knowledge.
In the present work, we have investigated the heterogeneous differentiation of mouse embryonic stem cells using hyperspectral coherent anti-Stokes Raman scattering (CARS) micro-spectroscopy combined with an advanced quantitative data analysis tool for unsupervised classification analysis. Cells were imaged at 14, 16, 18, 20, 23 and 29 days of differentiation by hyperspectral CARS, and images were analysed with an unsupervised factorization algorithm providing Raman-like spectra and spatially resolved maps of chemical components in vol:vol concentration units [15][16][17]. Chemical decomposition into protein, lipid and aqueous components combined with a statistical analysis of their spatial patterns enabled us to extract a set of parameters that were used for classification analysis. We found 3 main clusters that were attributed to undifferentiated cells, cells differentiated into committed white preadipocytes (as discussed below) and intermediate differentiating cells exhibiting a distinct protein globular structure with adjacent lipid droplets.

| Cell growth, differentiation and staining
IMT11 mouse embryonic stem cells [18] were maintained and differentiated towards an adipogenic lineage as previously described [15,19] on gelatin (0.1% v/v) coated glass slides. Samples were fixed in 4% (w/v) paraformaldehyde at 14, 16, 18, 20, 23 and 29 days of differentiation and stored in Phosphate buffered saline (PBS)/50 units/ml penicillin/ streptomycin prior to immunostaining. All time points were immunostained concurrently with an anti Fatty Acid Binding Protein 4 (rabbit polyclonal IgG anti-FABP4, ab66682, abcam, Cambridge, UK), a well described marker of adipogensis [20] or with a secondary only control (fluorescein isothiocyanate isomer 1 [FITC]-conjugated swine anti-rabbit immunoglobulins, F0205, Dako, Cambridgeshire, UK) where omission of the primary antibody acts as a control for non-specific binding of the secondary antibody to the sample. Immunostaining was achieved via washing samples 3 times in PBS, blocked with blocking buffer comprising of 1% (w/v) BSA in PBS/tween (0.1% v/v) for 15 minutes and staining with or without primary antibody (1:200, 0.7 mg/ ml) for 1 hour at 4 C. Samples were washed again 3 times in PBS, secondary antibody diluted in blocking buffer (1:200) and samples stained with secondary antibody overnight at 4 C. Samples were subsequently washed 3 times in PBS prior to CARS analysis.

| CARS micro-spectroscopy
CARS is a non-linear process, where a third-order signal is emitted when two electromagnetic fields of different frequency, historically called pump and Stokes, coherently excite a molecular vibration resonant at their frequency difference. In our experiment, CARS hyperspectral images have been acquired on a home-built multi-modal laserscanning microscope based on an inverted Nikon Ti-U. A description of the set-up can be found in [21]. Briefly, pump and Stokes beams for CARS excitation are obtained by splitting a broadband (660-970 nm) laser beam from a 5 fs Ti:Sa laser into the wavelength ranges of 660 to 730 nm and 730 to 900 nm, respectively. Hyperspectral imaging is achieved by spectral focussing [22][23][24][25]. In this technique, the pump and Stokes pulses have equal linear chirp, so that their frequency difference is constant. By changing the delay between the two pulses, the frequency difference can be tuned to record a CARS spectrum. The CARS signal is collected in forward direction, discriminated by a pair of band-pass filters (Semrock FF01-562/40) and detected by a photomultiplier (Hamamatsu H7422-40). The vibrational frequencies which can be addressed in our set-up are in the range (1200-3800) cm −1 with a spectral resolution of 10 cm −1 . The data discussed in this paper were taken over a (2600-3700) cm −1 range with a 20 × 0.75 NA dry objective and a 0.72 NA dry condenser. The excitation beam fill factor w 0 /(fNA) was 0.55, where w 0 is the Gaussian waist parameter of the excitation beam at the objective entrance and f is the objective focal length [17]. The data were acquired by scanning the spatial positions for each vibrational frequency step, with a dwell time for each pixel of 10 μs. The measured spatial resolutions for the CARS intensity (full-width at half-maximum [FWHM]) are 0.6 (1.1) μm in the lateral (axial) direction. The measured spatial resolutions for the retrieved CARS susceptibility (FWHM) are 1.0 (4.4) μm in the lateral (axial) direction [17].

| Hyperspectral image analysis
The CARS hyperspectral images were analysed using the methodology described in [15,16]. Briefly, the CARS intensity I C data are first de-noised using an unbiased singular value decomposition approach [15]. The data are then normalised using the CARS intensity spectrum I ref measured in a reference material (glass) which does not show vibrational resonances in the frequency range investigated [15,17]. This approach allows to correct for the spectrally dependent transduction coefficient of the set-up and to reference the signal to a known response. The resulting CARS ratio I C = I C =I ref is given by the absolute square χ j j 2 of the complex susceptibility χ normalised to the reference material susceptibility. In order to obtain χ, which is proportional to the concentration of the chemical species present in the focal volume, we retrieve χ in amplitude and phase using a phase-corrected Kramers-Kronig method [15]. The obtained spectra J χ ð Þ resemble spontaneous Raman spectra and can be described as a linear combination of chemical components with spatially dependent concentrations. To infer the chemical components and their concentration maps, we use the FSC 3 algorithm (Factorization into Spectra and Concentrations of Chemical Components) on the retrieved hyperspectral susceptibility χ ω ð Þ [15,16]. We refer to the resulting components as FSC 3 components, and to their spectra as FSC 3 spectra.
In the factorization, the frequency range was reduced to (2675-3200) cm −1 to limit the influence of the water component which is dominating J χ ð Þ for frequencies above 3100 cm −1 . All shown data have been factorised together to provide a common spectral basis, and 5 components were used. The factorization is unsupervised and starts from random spectra and concentrations. No pure known compounds are used to train the algorithm. To provide sufficient repeatability of the FSC 3 , the knock-out approach [16] was used, where multiple runs of the algorithm are performed and the resulting factorization with the smallest error is selected. In Ref. [15], we demonstrated that FSC 3 is able to retrieve the spectra and the concentration of chemical components in a model system (lipid mixture) without pre-knowledge.

| Hyperspectral CARS imaging and factorization
Representative results from hyperspectral CARS acquisition and analysis (see also section 2) for cells at different stages of differentiation are shown in Figure 1. The data analysis pipeline retrieves the imaginary part of the CARS susceptibility J χ ð Þ, providing a Raman-like spectrum [8], which is then factorised into the superposition of chemical components with spatially resolved concentration maps (see section 2.3). Three major components corresponding to water, protein and lipid were distinguished (see also in File S1 Figure S1, Supporting Information). The concentration maps C L of lipids and C P of proteins are shown in Figure 1 using the sum concentration C S = C L + C P as brightness and the lipid fraction η L = C L /C S as saturation with a green hue. In this scale, regions with high sum concentration of proteins and lipids (i.e. dry mass) appear bright. The lipid fraction is instead given by the colour saturation, that is, the colour changes from white to green as η L increases from 0 to 0.5. Regions with η L > 0.5, indicative of lipid droplets, are red.
The correlation between morphology and the number of days in the differentiation medium was found to be not strict. Even within a single plate, the morphology of the cells was heterogeneous (more data are shown in the Figure S2 in File S1). We also observed a low efficiency of the differentiation protocol, with less than 1% of the cells showing pre-adipocyte morphology containing large lipid droplets. These findings are expected for pluripotent stem cells undergoing differentiation and reflect the known limitations of differentiation protocols [26,27]. Samples prepared under the same differentiation induction agents showed clear expression of adiponectin (a marker for differentiation into adipocytes). In Figure 1, we ordered the cells according to the morphology and lipid concentration C L to approximate the development through the sequential stages. At early stages of differentiation (see Figure 1a,b), we observe a small lipid concentration C L < 15% and a lipid fraction η L < 70%. At later stages, the lipid concentration increases as small lipid droplets with η L reaching unity form in the proximity of globular structures (see, eg, Figure 1c For pre-adipocytes, identified by the presence of large lipid droplets as shown in Figure 1e, such proteic structures are not observed and C P is homogeneous in the cytosol. The large size and the limited number of lipid droplets observed in those cells indicate that they are pre-adipocytes committed to differentiation into white adipocytes [28]. This assignment is supported by the comparison between spontaneous Raman spectra (see Figure S4 in File S1) acquired on mES cells-derived adipocytes and on white fat tissue. In both cases, the fingerprint region spectra show the absence of typical features observed in brown fat tissues, that is, peaks between 1500 and 1600 cm −1 [29]. Cells showing protein vesicles are unlikely to be beige [30] or brown (pre)-adipocytes, since first, they do not show a significant concentration of lipids in smaller lipid droplets, and second, both beige and brown adipocytes are characterised by the presence of a large concentration of mitochondria. From measurements comparing CARS and fluorescence images of human cells stained with MitoTracker, such large density of mitochondria gives rise to small micron-sized structures visible in the protein channel, much smaller than the vesicles observed here.

| Cell classification
Fatty acid binding protein 4 (FABP4) is thought to be responsible for the formation of adipocytes [31]. We  Figure 4 and Figure S15 in File S1, respectively. (f) FSC 3 susceptibility spectra for the protein component C P (black) and the lipid component C L (red). The solid (dashed) lines refer to the imaginary part (mean value of the real part over the spectral range, giving the non-resonant contribution to the susceptibility), respectively therefore investigated FABP4 levels during the differentiation of embryonic stem cells, and the extent to which they correlate with the chemical composition measured by FSC 3 .
To this end, we immunostained cell samples at different stages of differentiation and imaged them using widefield fluorescence microscopy prior to CARS imaging (see Supporting Information). We found that immunofluorescence can be detected only for pre-adypocites where large lipid droplets are already formed, while the signal observed in undifferentiated cells seems to be dominated by autofluorescence and unspecific aggregation of the stain. We correlated the immunofluorescence observed in the pre-adipocytes to the chemical components individuated by FSC 3 and found no significant correlation, indicating that the expression level of FABP4 is below the CARS detection limit. Driven by the need to identify label-free markers for early and intermediate stages of differentiation, we studied the spatial distribution of the chemical components obtained by the FSC 3 analysis, and developed a methodology to use these for classification analysis as detailed in the following. Cells which were contained in the field of view and not overlapping were segmented as individual objects using the image analysis software CellProfiler. Statistical features not explicitly depending on the cell size were calculated from the concentration maps of the lipid and protein component for each of those cells (see a list in Table S1 in File S1). The distribution of each feature across the cell ensemble was offset and normalised to have zero mean and unit variance, in order to allow relative comparison between the features (a common procedure in classification analysis [32]). To reduce the dimensionality, we applied principal component analysis (PCA) to the normalised features and we retained the principal components (PCs) with the largest variance, carrying 90% of the total variance (in the present case this resulted in 5 PCs, see also the dependence of the variance as a function of the number of PCs shown in Figure S11 in File S1). Note that the PCA is applied here on the statistical features of the chemical concentration distribution, and not directly on the spectral domain data as often done in Raman spectroscopy [33]. Figure 2 shows the analysed cells vs the first two PCs. The first PC contains all features except the average lipid concentration and its standard deviation (SD), while in the second component, those two features are dominant (see Table S1 in File S1). A visual inspection of Figure 2 shows most cells in the lower left part. Two cells (indexes 19 and 20) are well separated at large PC2-they are committed white pre-adipocytes, having a large PC2 due to their large lipid content. A row of cells towards larger PC1 are also distinguishable from the main distribution. These contain the differentiating cells with large protein globular structures (cells 33, 34 and 29).
To go beyond this visual inspection of the first two PCs, we have developed an unsupervised classification method based on hierarchical cluster analysis (HCA). We use the first five PCs which are capturing the relevant information as discussed above, and an Euclidean metric to calculate the distance between pairs of cells in the PC space. Each cell is represented by the PC vector describing its statistical features, and the HCA clusters the cells by linking them, starting with the pair of shortest distance. The obtained dendrogram is shown in Figure 3. A dendrogram consists of U-shaped lines that connect clusters in a hierarchical tree. The height of each connection is given by the minimum distance between the corresponding cluster pair.
The dendrogram shows that the majority of cells have similar features and can be grouped in a single cluster. A few cells show distinctive properties and are progressively separated in individual clusters. The estimation of the optimal number of clusters is still a challenge in cluster analysis. Several models have been proposed, either based on measuring the intra-and inter-cluster distance [34,35] or by estimating the stability of the clustering method against random sampling of the population [36,37]. Considering the limited number of cells in our investigation and the tendency of HCA to isolate cells, the stability-based validation methods were not found suitable for our data.
An indication of the number of clusters present in the ensemble can be taken from the histogram of the distances. The minima in this histogram identify distances for which the clustering is most stable against perturbations of the features, that is, a slight change in the statistical quantities extracted from the cells will not affect the clustering. The corresponding histogram of cluster distances in Figure 3 shows three minima. The first one corresponding to the green arrows in the top of Figure 3 and in the bottom left of Determining the clusters by drawing a horizontal line in the dendrogram corresponds to setting a lower limit for the distance of the clusters, but in general it does not give information on how well the clusters are separated. We therefore introduce a new figure of merit (FOM) which quantifies how well all clusters are separated from each other. This FOM is calculated using the separation of each pair of clusters by the hyperplane determined by a support vector machine (SVM) classifier [38], see section S5 in File S1 for details. For a given number of clusters, we evaluate the FOM for each cluster selection from the HCA dendrogram which contains all cells. Figure 3 bottom right shows the resulting FOM as a function of the number of clusters and the specific combination of clusters which is identified by an index which increases with decreasing FOM. Pixels above the maximum index are given in grey, while the black pixels indicate cluster combinations which could not be separated, resulting in a FOM of zero. The method showed that the best classification occurs at 3 clusters,   Figure 3). The corresponding clustering is shown as coloured frames of the images used as the labels of the dendrogram. In this configuration, the majority of the cells at an early stage of differentiation have been grouped in a single cluster, while cells at later stage of development (including the 2 committed white pre-adipocytes) form mostly individual clusters. We have tested our FOM against common distance-based indexes and we found some similarity in the results (see section S7 in File S1 for details). We also verified that the defined FOM is robust against increasing the number of PCs included in the analysis (see section S8 in File S1 for details). A third maximum is found for 22 clusters. The corresponding configuration is shown by the blue horizontal dashed line in Figure 3. In this case, the majority of cells occupy individual clusters, but still a large part of cells in an early stage of differentiation is grouped in a single cluster.
To put these results into context, we assigned the symbol colour in Figure 2 according to the cluster membership as given in Figure 3. A plot using as colour coding the 22 cluster configuration, indicated by the blue dashed line in Figure 3, is shown in Figure S8 in File S1.
Summarising, we find that the unsupervised classification on the basis of the spatial distribution of lipid and protein chemical components separates the cell ensemble into undifferentiated cells of low protein and lipid content, and differentiating cells of different types, with committed white pre-adipocytes with large lipid droplets and low protein content being most separated, followed by cells exhibiting large protein globular structures.

| FSC 3 lipid-protein correlation
In order to quantify in a statistically significant way the observation of lipid droplets forming next to protein-rich globules in differentiating cells, we calculated the average protein distribution as a function of the distance r = |r − r 0 | from the centre of lipid droplets r 0 = (x 0 , y 0 ) (zooms of the regions corresponding to the areas indicated in Figure 1b,d, are presented in Figure S15 in File S1 and Figure 4, respectively). For each droplet, the distribution of the protein concentration around the LD is analysed (see sketch in Figure 4) by evaluating the mean value of the protein concentration C P r ð Þ and its SDĈ P r ð Þ over the contour of the quadrant s(r) of radius r centred at r 0 (for details see section S9 in File S1). The LD are then sorted according to their effective radius r d = ffiffiffiffiffiffiffiffiffiffi ffi A L =π p , where the lipid area A L was obtained by spatially integrating the lipid concentration over the LD mask. r d corresponds to the radius of a LD with the same total lipid area but made of pure lipid. Lastly, the mean values C P r ð Þ andĈ P r ð Þ outside of the droplet, that is, for r > r d , were determined over the droplet ensemble in logarithmically spaced ranges of r d , and are given in Figure 4.
The protein concentration around lipid droplets is larger for short distances to the droplet centre and reduces as the distance increases, which confirms the hypothesis that the LDs are localised in the proximity of the protein-rich globular structure. Interestingly, the concentration of protein in the nearby globules increases with the effective size (i.e., the mass) of the lipid droplet. The average normalised SD increases with the distance from the lipid droplet, showing a more homogeneous protein environment in proximity of the droplet than further away (similar results were obtained by performing a full circular integration of the protein FIGURE 4 Distribution of the protein concentration around lipid droplets. Left: sketch of the integration contours used to calculate C P r ð Þ andĈ P r ð Þ around the lipid droplets, with the quadrant s(r) covering AE π/4 around the direction d (see the supplementary information for details on the estimation of d). The image is a zoom of the areas indicated in Figure 1d. The colour scale is shown on the left, with the brightness proportional to C S and the saturation of the green hue equal to η L . The scale bar represents 5 μm. Center and right: radial distribution of the protein concentration C P r ð Þ (center) and its relative SDĈ P r ð Þ (right) around LDs as function of the effective LD radius r d . The red lines show r d = |r − r 0 |. Linear grayscale from minimum m to maximum M as indicated concentration around the LDs instead of using a directional quadrant, see section S10 and Figure S15 in File S1). The results thus indicate that the position of LDs correlates with the boundaries of protein-rich structures.
To understand the possible role of these proteic globules, we speculate in Figure 5 that they rise and decay during differentiation and might be responsible for LD formation. Starting with undifferentiated cells (Figure 5a), which show a rather homogeneous lipid-protein composition of low protein concentration, protein-rich regions develop outside of the nucleus, as exemplified in Figure 5b. As the differentiation progresses the protein structures are organised in globules (Figure 5c), which are rich in protein. Notably, the lipid concentration inside these globules increases, a sign of lipid assimilation and/or synthesis. Eventually a sufficiently high lipid concentration is reached such that lipid droplets emerge from these globules (Figure 5d,e). The lipid droplets then fuse together and the cell acquires the committed white pre-adipocyte morphology (Figure 5f) showing large lipid droplets. These observations suggest that protein rich globules might be responsible for the initial formation of lipid droplets during differentiation into adipocytes, and that their presence, alongside that of adjacent lipid droplets, might be used a label-free marker of differentiation. Our interpretation suggests a model of lipid droplet formation specific to the differentiation phase towards adipocytes. It is however different to the prevailing model [39,40] that LDs form in the endoplasmic reticulum. In the absence of a time course study on the same live cell, this interpretation remains a hypothesis.

| CONCLUSIONS
We investigated mESc undergoing heterogeneous differentiation towards adipocytes using label-free chemically specific hyperspectral CARS micro-spectroscopy and our latest advances in quantitative data analysis. Chemical decomposition into protein and lipid components provided spatially resolved concentration maps. These maps were analysed on a cell-by-cell basis to provide 11 descriptors for each cell. PCA of the normalised descriptors captured 90% of the variance in 5 PCs. Visual inspection of the first 2 PCs allowed to identify undifferentiated cells, cells differentiated into committed white pre-adipocytes, and a second type of differentiating cells exhibiting significant protein globular structures. Small lipid droplets were found to colocalize with the protein structures in these cells. The first 5 PCs were used for HCA. To select the depth along the hierarchical tree, we introduced a novel FOM for the cluster separability based on the distance between clusters determined by a SVM analysis. The resulting clusters split the undifferentiated cells from the differentiated ones, first separating the pre-adipocytes, and then cells with protein globular structures. These results suggest an analysis pipeline for automated cell sorting, generally applicable to heterogeneous samples in the absence of a priori knowledge of their cell types. Furthermore, research in adipogenesis has reached worldwide biomedical importance, after obesity, and associated diseases have become a modern epidemic. Thus, our observation of a distinct chemical and morphological phenotype of differentiating cells exhibiting small lipid droplets next to large proteic globular structures shall stimulate further studies into the quantitative understanding of such complex process from embryonic stem cells. The data presented in this work are available from the Cardiff University data archive [41].