Interpreting Image-based Profiles using Similarity Clustering and Single-Cell Visualization

Image-based profiling quantitatively assesses the effects of perturbations on cells by capturing a breadth of changes via microscopy. Here, we provide two complementary protocols to help explore and interpret data from image-based profiling experiments. In the first protocol, we examine the similarity among perturbed cell samples using data from compounds that cluster by their mechanisms of action. The protocol includes steps to examine feature-driving differences between samples and to visualize correlations between features and treatments to create interpretable heatmaps using the open-source web tool Morpheus. In the second protocol, we show how to interactively explore images together with the numerical data, and we provide scripts to create visualizations of representative single cells and image sites to understand how changes in features are reflected in the images. Together, these two tutorials help researchers interpret image-based data to speed up research. © 2023 The Authors. Current Protocols published by Wiley Periodicals LLC.


INTRODUCTION
Automated microscopy allows biologists to acquire thousands of images from cells perturbed with drugs, small interfering RNA (siRNA), CRISPR-Cas9, and more. In a typical quantitative microscopy experiment, biologists select fluorescent biomarkers (such as antibodies or dyes for specific proteins or cell compartments) and measure only the features Figure 1 In Basic Protocol 1, based on sample clustering, biologists can understand the underlying morphology that makes certain samples cluster in a certain way. In Basic Protocol 2, biologists can examine representative cells from each sample. they hypothesize will be perturbed in the experiment. By contrast, in image-based profiling, the aim is to let the cells speak for themselves. Diverse stains are used (as in the Cell Painting assay, which stains eight cell components; Bray et al., 2016;Cimini et al., 2022) and then image analysis software segments the cells and measures all possible morphological features from single cells. The collection of features for a cell is called a profile (sometimes described as a morphological profile or image-based profile), and typically a thousand or more features are measured per cell. It is then possible to analyze whether features are modified in a treated sample of cells compared to controls. Afterward, samples can be grouped into clusters based on their image-based profiles (Fig. 1). However, the biological meaning behind clusters is difficult to interpret because there are thousands of features in the profile. This leads to a common bottleneck: given a sample or cluster of samples, how do you interpret what a given profile means biologically?
Here, we present two protocols: exploratory analysis using Morpheus software (Basic Protocol 1) and image and single-cell visualization following profile interpretation (Basic Protocol 2). In Basic Protocol 1, we show how to explore the overall large-scale associations of the data (after feature extraction and cleaning) using the free web-based software Morpheus. Using Morpheus, the data can be grouped in different ways, revealing how features and samples are correlated. Exploring the data is essential to gain insights into the biological interpretation of the profiles. In Basic Protocol 2, the goal is to help biologists create intuitions about differences between treatments by examining example cells. This notebook contains Python scripts to help crop representative or random single cells from each treatment and group the cropped images based on correlations of interest. In addition, representative images of each sample can be retrieved to understand how the cells are distributed across representative fields of view (e.g., those captured from different sites [locations] within a sample well), which can give insights into treatment toxicity and/or growth-stimulating effects. In Understanding Results, we provide insights on how visualizing example cells from the samples and linking them to the correlations between samples will provide extensive information that can be used to formulate new hypotheses and interpretations from the data. While these approaches are powerful, we note that they require high-dimensional image measurements and, as such, require the user to first use CellProfiler or a similar tool to identify objects and generate large numbers of measurements; they also unfortunately do not always lead to easily interpretable conclusions (see Understanding Results for further discussion).
The protocols described here yield a similarity matrix, hierarchical clustering for the samples, and representative example cells from their data. These outputs can easily be used for reports and publications. For the input data for both protocols, we use a dataset of images processed by CellProfiler to identify cells and extract features (Stirling et al., 2021) and by pycytominer to normalize and aggregate single-cell profiles into population-averaged profiles . Extensive documentation is available online for feature extraction with CellProfiler (https:// github. com/ CellProfiler/ tutorials) and for data aggregation, normalization, and feature selection with pycytominer (https: // github.com/ cytomining/ pipeline-examples). In addition, we provide an example dataset in our GitHub repository, including comma-separated value (CSV) spreadsheets to be processed on Morpheus (https:// github.com/ ciminilab/ 2023_Garcia-Fossa_Cruz_CurrentProtocols). In our example dataset, each compound is annotated with its mechanism of action (MOA). However, these protocols can be used without having the MOA for every compound in the dataset, and instead by comparing treated cells with negative and/or positive controls, or comparing multiple perturbed samples with each other.

EXPLORATORY ANALYSIS OF PROFILE SIMILARITIES AND DRIVING FEATURES
The main goal of this tutorial is to examine the correlations between samples to check for their replicability, to explore correlations among them, to discern how features drive differences between samples or groups, and to interpret the biology behind the data.
After cell treatment, imaging, and feature extraction, some profiles are dramatic in only one or a few features and the feature names have obvious meanings (nucleus area or integrated intensity of the mitochondria channel in the cytoplasm, which corresponds to the total amount of staining in that channel); in these cases, looking at feature names will help to discern their connection to biological meaning. Other individual features have meanings that are more difficult to translate into plain language. Furthermore, the challenge is even greater to interpret a collection of feature names that all contribute strongly to a more complex morphological phenotype. For example, a collection of features from a channel stained for actin and wheat germ agglutinin together with DNA granularity was particularly important to predict 70 specific cell health phenotypes from Cell Painting data (Way et al., 2021). Even phenotypes that are visually obvious and distinctive by eye, such as cells stalled in a particular stage of the cell cycle, are often difficult to predict just by examining a list of distinctive features; the problem is even more acute for samples without a visual discernible phenotype yet quite distinguishable using image metrics.
To help us in the exploration and interpretation process, we often use Morpheus (available at https:// software.broadinstitute.org/ morpheus/ ), a free web-based open-source software that allows matrix visualization, analysis, clustering, filtering, and displaying of charts. The tool can be readily used without extensive computational or statistical experience. It allows for quick visualization of an entire dataset in different ways, so you can identify patterns in their data that could lead to new biological insights, or even use it as a data quality control step by examining replicability. Morpheus was originally designed at the Broad Institute for exploration of mRNA profiling data, but accepts a variety of matrix files from multiple formats (CSV, GCT, GMT, text file) to be imported. Although raw CellProfiler outputs tables can be input into Morpheus, here, we provide notebooks to preprocess the outputs from CellProfiler so the data can undergo aggregation and Garcia-Fossa et al.

of 21
Current Protocols normalization (both of which can also be performed in Morpheus) followed by multiple feature reduction steps (some of which are not available in Morpheus).
More information can be found in the Morpheus documentation (https:// software. broadinstitute.org/ morpheus/ documentation.html), as well as a two-part series of video tutorials on the Center for Open Bioimage Analysis (COBA) YouTube channel: "The beginner's guide to morphological profiling (Morphological profiling, part 1)" and "Practical exploration of morphological profiling data (Morphological profiling, part 2)".
During this tutorial, we start by examining how similar each sample is to the other samples using per-well similarity matrices, sorting the data in a way that allows for interpretation. We provide a sample dataset in which drugs with known mechanisms of action (MOAs) have been added at various dose points prior to Cell Painting. To observe how MOAs are grouped, and if technical artifacts such as batch or plate-layout effects are playing a role in the distribution of the groups, we use hierarchical clustering. In the end, you will be able to identify whether drugs with similar MOAs have similar morphological profiles and the positive and negative connections between various MOA profiles. You will also learn how to determine what features drive the differences between the groups. We emphasize that this is just one of the data-exploration approaches that can be used to interpret image-based profiles, and produces comparative results rather than hard distinctions between similar and not.

Materials
Laptop or desktop computer with at least 2 GB RAM and a suitable web browser such as Google Chrome Internet access to use Morpheus (https:// software.broadinstitute.org/ morpheus/ ) Data and Jupyter Notebooks (Kluyver et al., 2016), available at https:// github. com/ ciminilab/ 2023_Garcia-Fossa_Cruz_CurrentProtocols. The data are in a GCT format, a tab-separated value table containing the extracted features aggregated by well in a Cell Painting assay. In this assay, 1571 compounds were tested across six doses in A549 cells (Way, Natoli, et al., 2022). We randomly selected a plate map from this experiment (C-7161-01-LM6-011 plate map) and downloaded the CSV files for five of its replicate plates (SQ00015195, SQ00015218, SQ00015219, SQ00015220, SQ00015221) from the cpg0004-lincs dataset (Way, Natoli, et al., 2022) available from the Cell Painting Gallery on the Registry of Open Data on AWS (cellpainting-gallery). We then added annotations to the data (labels for each MOA, compound, and concentration) and normalized the features to the negative control (DMSO) in a Jupyter Notebook (Kluyver et al., 2016) using the pandas library (Reback et al., 2020) and pycytominer . Next, we performed feature selection to exclude features with low variance (frequency cut = 0.05), high correlation to another feature in the profile (threshold = 0.9), features that have >5% NA (not available) values, blocklisted features, and outliers (features with minimum or maximum absolute values greater than threshold = 500). These parameters serve as useful starting values but may be adjusted as needed; for more details, see the data preparation notebook and pycytominer documentation (https:// pycytominer.readthedocs.io/ en/ latest/ ). These steps are available in the basic_protocol_1/notebooks/data_processing folder using the Data_preparation.ipynb notebook in our GitHub repository (https:// github. com/ ciminilab/ 2023_Garcia-Fossa_Cruz_CurrentProtocols/ blob/ main/ basic_protocol_1/ notebooks/ data_processing/ Data_preparation.ipynb). We opened the CSV file obtained using Data_preparation.ipynb in Morpheus and clicked on Tools > Transpose, allowing the CSV table to be better visualized in Morpheus. To apply the protocol to your own data, we recommend using CellProfiler to extract features and pycytominer for data preparation.
We calculated the average precision based on https:// github.com/ niranjchan drasekaran/ profiling-workflow-demo/ blob/ master/ analysis/ 0.calculateap.ipynb to enable us to remove weakly correlated pairs (defined as < 0 mean average precision between replicates) before analysis; no such profiles were found for this dataset. 10. Go back to the first tab "Morpheus_Example_FeatureSelected" and select Tools > Similarity Matrix > Pearson correlation on the columns. This will calculate the correlation between features for all pairs of samples in the dataset and generate a similarity matrix for them.

of 21
Current Protocols and across doses, though a subtle recurring pattern within this block (highlighted by yellow arrows) indicates that one of the five replicates shows a somewhat different profile than the other four, indicating a possible batch effect or technical anomaly. The effective concentration of a drug is highlighted by the lowest dose of ixabepilone clustering together (black box) but having weak correlations with the highest doses of ixabepilone. The higher doses of the microtubule-stabilizing agent are extremely similar to low concentrations of microtubule inhibitor (blue box) but less similar to higher concentrations of microtubule inhibitor (purple box). (B) Negative control (DMSO) correlation pattern, zoom out view of the similarity matrix. Black arrows highlight artifacts from platelayout effects; treatments plated in the same or very similar well positions still can show significant similarity even after normalization. This can be alleviated at the experimental level by scrambling positions across plates and/or plating the same treatment in multiple positions spread across an individual plate.
Can you see how there is not much correlation between the different compounds? Each compound, even belonging to the same MOA, seems to have a different morphological profile.
13. Using the same configuration as in the previous step (columns sorted by MOA > Compound > Concentration), continue to explore the similarity matrix and observe whether there are different MOAs with similar morphological profiles.
Go to the microtubule inhibitor and microtubule-stabilizing agent MOAs ( Fig. 2A) 17. Zoom out (using the -key) to see a broader view of the clustering. Scroll through and find large squares of red color in the matrix to observe which MOAs are clustering. 18. Return to the tab containing the feature value (rather than similarity matrices) and go to Tools > Marker Selection. Choose T-test as the metric, MOA as the field, class A as DMSO, and class B as the tubulin polymerization inhibitor. Leave the default values for Number of Markers and Permutations. This step reveals which features are driving the differences between these two groups (Fig. 3).

IMAGE AND SINGLE-CELL VISUALIZATION FOLLOWING PROFILE INTERPRETATION
With large datasets, it often becomes challenging to retrieve images of sites or single cells for visualization to perform quality control, validate a pipeline, and, most importantly, interpret any morphological changes detected in the profiles explored during the data analysis and exploration (visualized with heatmaps, UMAPs, etc.). Along with visualizing sample and feature correlations as in Basic Protocol 1, it is also important to think biologically about organelle distribution, morphological characteristics such as cell and nucleus shape, and intensities of each stain. Connecting the numbers (Pearson coefficients, T-tests, morphological feature values in profiles, etc.) with how the cells look in the images can help the user decipher a complex profile.
In this protocol, we describe how to use a script we created to retrieve random or representative images from the dataset and plot them together, allowing the user to choose which samples to observe and how to group and display them. While random images are often helpful, especially in cases of high heterogeneity, it can also be helpful to computationally determine which cells' phenotypes are the most representative in a sample and compare them to control cells. This is not a trivial step, but can sometimes provide critical insight into morphological changes. In this protocol, we use Jupyter Notebook to derive representative cells by performing a clustering analysis on the morphological space of the population of single cells and sampling from the subpopulation closest to the center of the sample(s) of interest. This notebook can also be used to compute similarity matrices as in Morpheus; however, for large-scale experiments, we recommend examining the experiment using the per-well aggregated information as in Basic Protocol 1. Once a few treatments of interest are identified, single cells can be visualized using this protocol.
From the Jupyter Notebook, the user will obtain representative or random image sites and single cells, enabling comparison of the images with the correlation coefficient values obtained in the similarity matrix. By establishing the relationship between the images and heatmaps, the user can start hypothesizing about biological processes and morphological profiles that are significant, which could lead to more specific biological questions and assays. As in Basic Protocol 1, we provide some hints and interpretations for each step; for more detailed discussions of biological interpretations, see Understanding Results.

Current Protocols
Our dataset table is in a CSV format and contains the extracted features for single cells in a Cell Painting assay. In this assay, 1571 compounds were tested across six doses in A549 cells . Here, we use the same dataset from Basic Protocol 1, but we require information about single cells, and each row of the table must have cell features and x-y locations within the image to enable single-cell image retrieval. We also provide all the images of where these single cells are located. For this purpose, we selected only a subset of samples within the dataset to minimize the memory requirements needed for users to explore the data. We performed normalization and feature selection with this dataset using pycytominer. The Jupyter Notebooks required to create this dataset from publicly available datasets (1_Samples_retrieval.ipynb and 2_Generate_Profiles.ipynb) are available on our GitHub under the basic_protocol_2/notebook folder. We also provide an alternate code in the sample retrieval notebook to allow the loading of entire plates when experiment size and memory permit. The Jupyter Notebook functions were written using Python 3.9 (Van Rossum & Drake, 2009). Data processing was performed using pycytominer tools for normalization, feature selection, and data annotation. Check pycytominer documentation (https:// pycytominer.readthedocs.io/ en/ latest/ ) for details on how to change parameters and inputs depending on your dataset. The GitHub repository contains the following files relevant to Basic Protocol 2: util folder with .py files containing functions written to be used on this notebook. These functions are installed onto the notebook using pip install and then imported from utils.correlations import *. basic_protocol_2/Images folder, which contains the subset of images downloaded from https:// github.com/ broadinstitute/ cellpainting-gallery. We provide PNG images that were compressed from the original TIFF images; PNG is a lossless format that requires less storage space. basic_protocol_2/data folder, which contains the BasicProtocols2_Example.zip with a CSV file. To use this notebook with your data, you could extract the features using CellProfiler and export the information to a spreadsheet that can be read in the Jupyter Notebook. Alternatively, if using a database file, you could transform it into a CSV file using our available Samples_retrieval.ipynb Jupyter Notebook. The notebook will perform annotation, normalization, and feature selection if you have not already run those steps. These steps can be bypassed if they have already been done (e.g., by notebook 2_Generate_Profiles.ipynb).

Open the Google Colab notebook Basic Protocol 2_Visualize cells and images.ipynb available in the link at https:// github.com/ ciminilab/ 2023_Garcia-Fossa_Cruz_CurrentProtocols/ blob/ main/ basic_protocol_2/ notebook/ Basic_
Protocol_2.ipynb. Be sure to access the notebook from our GitHub repository, allowing you to check for any recent updates.

Click the Copy to Drive button and the notebook will be available on your Google Drive in the Colab Notebooks folder.
This step allows you to have your own copy of the notebook and, if your wish, perform any modifications to run the notebook using your own data.
3. Run the first three cells in the notebook Section 1 -Import Libraries by clicking on the start button at the top left (or hit Ctrl + Enter). The first line will clone the GitHub repository and install the functions; the second line will install the required libraries to run this notebook (this process takes ∼5 min) and import the libraries to allow their use inside the notebook. Run the lines of code in the order that they appear in the notebook.
The Python packages required to run this notebook are also available on GitHub under the requirements.txt file. This file can be used to install packages via pip or to generate an environment using Anaconda or miniconda to run this Jupyter Notebook locally.

Run only the first cell inside Section 2 -Define
Inputs. This will define the inputs required to run the cells in the notebook. The script requires the filename and pathname to access the CSV table and read it as a DataFrame. It also needs the pathname for the images directory.
The pathnames are all based on the ones available in the GitHub repository for this project. If you clone the repository in the first step, there is no need to change these inputs. To run this notebook with new data, mount the notebook inside Google Drive and provide the inputs for the variables (running the second code cell inside Section 2 instead of the first cell).
5. Run the cells inside Section 3a, which will import the dataset and perform annotation, normalization, and feature selection. The 6. Run the first three cells in Section 4 (through cell 4.1.1) and choose Meta-data_Compound_Concentration for this demonstration. These options were generated based on the names of columns with the "Metadata_" prefix. This choice will impact the information visualized on the plots for the next steps. If the choice is Metadata_Compound_Concentration, you will see values such as DMSO 0.0, etc. When using new data, add the "Metadata_" prefix to any such columns before loading it into the notebook, as it will appear under this dropdown and be used for aggregation (Fig. 4A).
We use dropdown interaction to allow users different choices based on the DataFrame, because users may be interested in looking at the data based on MOAs or compound names. When using new data, be aware that the tables must have columns containing metadata information with the "Metadata_" prefix.
7. Run the cell in Section 4.2 to choose all the compounds available on the dataset to visualize. This step will select all the compounds in the dataset.
To select just a few compounds of interest to be visualized, run Section 4.3. This piece of code will create an interactive checkbox with the compound names for you to choose only a few options (Fig. 4A).
8. Run the cells in Section 5 to generate and graph the correlation between the compounds. Choose a column to be the labels for the correlation matrix using the dropdown, then use pycytominer to return a per-well aggregated DataFrame. A correlation matrix will be generated. There is an option to export the matrix as an image (type the name and press Enter/return).
In Section 5, the function that applies pycytominer operations aggregates the data and then performs a Pearson correlation analysis on the dataset. To visualize the matrix with different labels, choose a different column and rerun the notebook from that cell onward; the dataset will then be re-aggregated and a new correlation matrix will be calculated based on the new column.

of 21
Current Protocols 9. Run the three cells in Section 5.1 to insert the correlation values calculated in the previous step inside the initial DataFrame as a new column. This function will get the chosen compound and find the correlation values for every other compound related to the first. Choose "DMSO 0.0" for comparison, because the aim for this dataset is to evaluate which compounds have morphological profiles more similar to the control.
Choose whichever compound is desired as a point of reference to be added to the DataFrame. This choice will depend on the biological question being asked.
10. Run all of the cells inside Section 5.2 and choose "DMSO 0.0". This choice reflects the biological question of which compounds are closely correlated to the negative control (DMSO). However, this is a dynamic Jupyter Notebook where the user could be interested in other compounds or MOAs.
11. In Section 6 -Visualize Cells, run the first cell to choose whether to visualize randomly selected or representative single cells. Choose the random method to select random samples for each treatment/group you have; choose the representative method to select the most representative cell within each subgroup. Many cells in this section rely on correlation to the reference compound selected in Section 5.1; if you want to change reference compounds, rerun those cells before returning to Section 6 and running all cells here.
The representative method uses the KMeans algorithm with the scikit-learn package (Pedregosa et al., 2011) to cluster data and find the most representative cell(s) (i.e., closest to the mean of the subgroup) within each subgroup. The random method will return a random sample of one cell for each subgroup (Reback et al., 2020). The representative method allows you to evaluate average change, while the random method is often helpful for quality control to check for out-of-focus or unusual cells.
12. Run the next cell and select how many cells you would like to display from each subgroup and whether or not you would like the images shown in order of subgroup correlation to the reference compound.
Answering "Yes" to "Would you like to use the correlations to order your image plot?" will order the dataset based on the correlation values to the reference compound selected Garcia-Fossa et al.

of 21
Current Protocols in step 9, starting at 1.0 and descending; answering "No" will keep the DataFrame in the original order. The second question is about how many cells (c) to plot for each group.
The generated image will have (c × the number of subgroups) rows. Looking at one cell per subgroup creates a compact visualization, especially for many subgroups; looking at several per subgroup can increase confidence in the overall visual appearance of each subgroup, especially when displaying random cells.
13. Choose whether (a) each image should be rescaled to the minimum and maximum before being displayed or (b) the raw intensity values should be plotted. Raw intensities are typically more comparable across conditions (see below for caveats), but may be harder to see when the signal is dim and thus may require external rescaling after saving.
While raw images are generally more comparable than individually rescaled images, caution should be taken especially in comparing images from treatments imaged on different plates or different plate batches. Each plate is independently stained, imaged, and feature-normalized, and plates from different batches may have other differences such as reagent lots used. Thus, a treatment that induces "2× negative-control-mean-intensity" in channel X from plate 1 may be overall dimmer in raw pixel intensity values than a different treatment that induces "0.5× negative-control-mean-intensity" in channel X from plate 2 if the plate mean intensities in channel X are quite different. Any conclusions drawn based on looking at images should be subsequently checked against normalized feature data.
14. Insert the pixel size value. This is necessary to add a scale bar in your images. Type the value "0.29898" in this example to add the pixel size for this example dataset in μm/pixel. Each microscope and lens will have its own configuration.
Some microscopes (such as the Opera Phenix microscope used in this experiment; Way, Natoli, et al., 2022) (Schindelin et al., 2012) and look at the Properties menu. Embedded metadata is sometimes missing or unreliable; when in doubt, consult the local expert on the microscope in question and/or calculate the effective pixel size based on the camera specifications and magnifications used.

record the effective pixel size in a file such as an XML (eXtensible Markup Language). Other microscopes record this information in the file metadata; one easy way to check this is by opening the image in a tool such as Fiji
15. Plot the selected single cells in random order by running the first cell of Section 6.1. This step allows a first view of the cells without the labels, so you can explore the images before knowing to which group the cells belong. Once you have explored the data, run the rest of the cells in Section 6.1 to append labels to see if your hypotheses were correct, to create an unshuffled version of the image, and to save the image to disk.

Looking at cells without labels allows users to formulate new hypotheses without bias about how they believe each treatment should look. Parameters to examine might include the organelle distribution within the cells; how mitochondria, endoplasmic reticulum, or Golgi apparatus are organized; changes in overall intensity of individual stains; or overall cell structure changes. This can be quite valuable for unbiased hypothesis generation!
16. Run Section 6.2 to display the full images from which the single-cell crops have been pulled (Fig. 5B). Looking at the entire field of view (FOV) may provide insights into additional biological aspects.

COMMENTARY Background Information
Image-based profiling typically starts with using fluorescent markers to stain different targets and/or compartments of the cell. In our example data for both protocols, we used Cell Painting data. Cell Painting is a morphological profiling assay that multiplexes six fluorescent dyes, imaged in five channels, to reveal eight relevant cellular components. The experiment's aim was to characterize chemical perturbations in cells by measuring morphological changes after cells were exposed to various treatments. Briefly, cells were plated in multiwell plates, perturbed with treatments to be tested, then stained, fixed, and imaged on a high-throughput microscope. Images were acquired for DNA, RNA, endoplasmic reticulum, mitochondria, and AGP (actin, Golgi, and plasma membrane).
Software such as CellProfiler (Stirling et al., 2021) makes it easy to obtain and extract information from these images, extracting thousands of morphological features distributed into categories relating to the com-partment measured (nucleus, cell, cytoplasm) and types of metrics (size, shape, texture, intensity, granularity, and more) to produce a feature profile that enables the detection of subtle phenotypes. To facilitate understanding of the features, CellProfiler feature name outputs are organized as follows: [Compartment] from one particular analysis of a Cell Painting assay at https: // github.com/ carpenterlab/ 2022_Cimini_NatureProtocols/ blob/ main/ CellProfiler_features.csv. Note that the names of the features will vary based on the parameters used to analyze the assay.
The essential steps after extraction of the features are aggregation, normalization, and feature selection. These are the steps we describe in our Jupyter Notebooks using pycytominer (Basic Protocol 1 support notebook and in the main notebook used for Basic Protocol 2). Profiles of cells treated with different experimental perturbations are then compared to identify the phenotypic impact of chemical or genetic perturbations, grouping compounds and/or genes into functional pathways and identifying signatures of disease. We demonstrate these last two steps using Morpheus software and scripts on Jupyter Notebooks in the protocols above.
Understanding the correlation coefficients calculated for the samples in both protocols is important for this protocol. A Pearson correlation coefficient is a way of representing the measurement of similarity, where it measures the strength of the linear relationship between two variables (in our case, between two wells across a large set of features or between two features across a large set of wells). A Pearson coefficient of 1 means a perfect positive correlation, 0 means no correlation, and -1 means a perfect negative correlation (Pearson & Galton, 1895). A similarity matrix is a way to assess the covariance in features between all pairs of columns or rows. In each square of the matrix, a Pearson correlation coefficient was calculated for all features in the dataset between each pair of samples. The squares at the intersection of those two samples are set as the value of that correlation coefficient, and so on for each pair of wells. This allows us to see at a high level how similar the overall phenotype is between any pairs of samples in our experiment, and therefore how phenotypically distinct our treatments are.

Critical Parameters and Troubleshooting
We reiterate that normalizing the features is fundamental before executing the steps in this paper. Normalization is usually performed on all of the features to fix range issues and allow comparison between features (Caicedo et al., 2017). Normalization is also recommended to increase the signal-to-noise ratio (Chandrasekaran, Ceulemans, Boyd, & Carpenter, 2021). Normalization performed on a plate level is recommended because this also corrects to some degree for plate-to-plate batch effects. Where sufficient negative controls exist, we recommend normalizing the features to the negative control. Check the profiling recipe for more information on how to process single-cell morphological profiles and how to normalize Cell Painting data for more information.
In data normalized to the negative controls, the negative control samples (or samples with otherwise weak phenotypes, here defined as a mean average precision across replicates of <0) will show limited similarity to one another and thus will show minimal clustering after step 16 of Basic Protocol 1 (hierarchical clustering). Somewhat unintuitively, this means that these samples will be spread across the entire dataset post-clustering. It is therefore expected, after hierarchical clustering and exploration (step 17 of Basic Protocol 1), to see one or a small number of "random" negative controls or weak perturbations clustering with a strong, consistent perturbation; this should not be taken as a sign that the strong perturbation in question is weak or similar to negative controls. Weak replicate correlation for any given sample can be checked in step 12 of Basic Protocol 1; if the replicate inconsistency looks possibly driven by technical issues (e.g., well position, Fig. 2B), one may consider performing another experiment to attempt to confirm if a profile is truly weak. In general, profiles with weak replicate correlation should not be used to draw biological conclusions, and hierarchically clustering results should always be checked for accidental spurious inclusion of weak profiles.
Proper reduction of the feature space is also an essential step to perform before analyzing new data in our protocols; this step will be automatically performed when following the profiling recipe (Chandrasekaran, Weisbart, Way, Carpenter, & Singh, 2022). If performing these steps on your own, a common starting point is to look for correlated features: when two features are too correlated, only one should be kept for further analysis. Since Pearson correlations are sensitive to large absolute feature values, we also recommend screening for unusual feature values; we provide guidance on performing this in Morpheus (see Basic Protocol 1, steps 3-6). Some feature reduction algorithms, such as support vector machines, give weights for each feature and remove the ones with fewer weights (Caicedo et al., 2017). We typically perform feature reduction in pycytominer, which provides six Garcia-Fossa et al.

of 21
Current Protocols Profiles should be assessed for their quality before data interpretation, to remove treatments with no apparent phenotype and, in some applications, to exclude compounds that are too toxic to the cells (Rezvani, Bigverdi, & Rohban, 2022). One method to perform profile quality assessment is to measure the precision with which one can correctly retrieve replicate wells. This approach was used in the example data we provide to check for the replicability of the profiles (for details see Way, Natoli, et al., 2022).
For troubleshooting of this method, problems, possible causes, and solutions are outlined in Table 1.

Understanding Results
When analyzing results, you may find that a profile of interest shows a dramatic differ-ence from controls or other samples based on only a small number of similarly named features (such as a large number of features that fall within the nucleus or many changes in the texture of a particular stain), and the feature names have obvious meanings (e.g., nucleus area or integrated intensity of the mitochondria channel in the cytoplasm). In this scenario, interpretation may be straightforward, though you may need to look up the meaning of the feature names in the CellProfiler manual (https:// broad.io/ cellprofilermanual) to understand them better and discern their connection to the biological meaning. Some caution is warranted here; for example, DNAdamaging drugs could affect actin features because F-actin plays a role in DNA repair. Damage induced to the DNA induces nuclear actin formation (Belin, Lee, & Mullins, 2015), and these nuclear actin structures play a role in double-stranded break (DSB) repair, such as recruitment of proteins to enable repair of the heterochromatin through homologous recombination and assisting DSB movement in euchromatin repair (Caridi, Plessner, Grosse, & Chiolo, 2019). There may not be a straight line from a feature name to the biological function because cells are deeply interconnected systems and changes that start in a single genetic pathway can ripple throughout other pathways in the cell. Nevertheless, feature names can often create insights.
Instead of a few, easily interpretable features, you may find there are many dominant features in the profile and their collective meaning is not obvious. In such cases, an expert might be able to stare at the list and derive some meaning. For example, an expert might realize that many different stains showing increased correlation may actually be related to a decreased x-y cell size (because in a rounded cell, organelles are more likely to overlap one another on the x-y plane and may be either truly colocalized or merely spread across the z dimension). If you've looked at your feature list but need some backup, consider sharing your data on forum.image.sc so that experts can weigh in. An example of this can be found in the morphological profile induced by the microtubule inhibitor and microtubulestabilizing agent in this dataset (cabazitaxel and ixabepilone, respectively). To understand the features that differentiate between our negative control (DMSO) and the microtubule perturbations, we performed marker selection using a T-test. Marker selection comes from genome analysis, but could be defined also as a feature selection. The model takes the features belonging to two classes as input and a T-test is calculated to assess marker features that discriminate between the two classes (DMSO vs. microtubule) (Gould et al., 2006). While individual T-tests performed in Morpheus do attempt to correct for sample number with a false discovery rate, it does not and cannot control for how many tests the user runs; these tests are therefore appropriate for gaining qualitative insight into the relative importance of various stains and/or feature classes in distinguishing a phenotype, but the values returned should not be directly reported, and any attempt to quantify these differences should be performed through standard statis-tical approaches. Our results show that many important features (Fig. 6A) belong to Granularity and Texture feature groups across a number of different stains, which makes sense in the context of induction of massive cytoskeletal disruption. Since microtubule disruption perturbs cell division, the presence of Nuclei_AreaShape_FormFactor (a measure of shape uniformity in which linear and/or irregular shapes have values near 0 and a perfect circle is 1) helps indicate that we are not looking at general cytoskeletal disruption, but specific disruption of the microtubules. This result highlights that the aggregate of different features is important for connecting profiles to perturbations.
Examining example images directly alongside a list of important features can also help decipher a complex profile. An example where looking at features and images could help uncover the biological meaning of an event is during an assay to identify cells in different phases of the cell cycle using fluorescent markers such as DAPI to measure DNA content (Ferro et al., 2017). Based on significant changes in the feature space where the minor axis of the Nuclei and Cell area are low and DNA staining intensity is high, the user could look at single cells and realize these feature changes relate to cells that are going through metaphase. Basic Protocol 2 facilitates displaying single cells and images, which can otherwise be challenging to locate and access in large-scale experiments. In our example images of cells treated with two microtubulerelated drugs, we observe that both drugs interfere with the cell cycle to produce similar morphologies, disrupting the overall appearance of every channel. As seen in Figure 6B, both treatments induce multinucleation (Fig.  6B, DNA column), as has been previously described for microtubule inhibitors (Azarenko, Smiyun, Mah, Wilson, & Jordan, 2014). Disruption of the cell cycle is also likely apparent in the lower overall cell count in treated vs. control cells (Fig. 6C). The Golgi localization and distribution are visually quite distinct compared to DMSO (Fig. 6B, AGP column), which could be related to the role of microtubules in vesicular trafficking and to their role in modeling the shape of organelles, including Golgi (Fourriere, Jimenez, Perez, & Boncompain, 2020;Thyberg & Moskalewski, 1985). We can therefore relate these morphological features and observations to the mechanism of actions of these drugs, providing a useful pattern to follow for investigators examining their own data and formulating their hypotheses. Sometimes, however, the most important differences are not visible to humans, and image-based profiling approaches have sometimes outperformed human expert image analysis for precisely such reasons (Gibson et al., 2015;Zhou et al., 2021).
Finally, we should note that, in some situations, following the procedures provided still does not allow you to make much headway in truly understanding the induced phenotype. If so, profile data can be used in other ways, e.g., by simply using the profile as a signature of the sample and trying to use drugs to revert this disease phenotype to a healthy-associated phenotype. If one has access to computational experts, one can also try to query their data against publicly available datasets , though these approaches are currently still experimental. The interpretation of complex profiles is a challenge, but when successful can propel research in new directions to uncover exciting new mechanisms.

Time Considerations
For Basic Protocol 1, supposing that data tables were pre-processed for normalization and feature selection before input into Morpheus, the total time to explore the data is ∼1 hr. Basic Protocol 2 could take up to 2.5-3 hr if running the protocol with different settings and taking time to evaluate the images and create hypotheses.