Cross‐Modal Graph Contrastive Learning with Cellular Images

Abstract Constructing discriminative representations of molecules lies at the core of a number of domains such as drug discovery, chemistry, and medicine. State‐of‐the‐art methods employ graph neural networks and self‐supervised learning (SSL) to learn unlabeled data for structural representations, which can then be fine‐tuned for downstream tasks. Albeit powerful, these methods are pre‐trained solely on molecular structures and thus often struggle with tasks involved in intricate biological processes. Here, it is proposed to assist the learning of molecular representation by using the perturbed high‐content cell microscopy images at the phenotypic level. To incorporate the cross‐modal pre‐training, a unified framework is constructed to align them through multiple types of contrastive loss functions, which is proven effective in the formulated novel tasks to retrieve the molecules and corresponding images mutually. More importantly, the model can infer functional molecules according to cellular images generated by genetic perturbations. In parallel, the proposed model can transfer non‐trivially to molecular property predictions, and has shown great improvement over clinical outcome predictions. These results suggest that such cross‐modality learning can bridge molecules and phenotype to play important roles in drug discovery.


A CIL-750K details
The original CIL dataset includes 919,265 five-channel fields of view containing 30,616 test compounds.It also includes metadata files which record morphological features for each cell in each image, both at the single cell level and at the population average level (i.e. per well); a workflow for image analysis to generate morphological features is also provided.Quality control indicators are provided as metadata, indicating fields of view that are out of focus or contain highly fluorescent material or debris.Chemical annotations are also provided for the application of compound processing.Figure S1 shows the molecular data distribution and the number of view per molecule in CIL dataset.In CIL, each molecular intervention is imaged from multiple views in an experimental well and the experiment was repeated several times, resulting in an average of 30 views for each molecule.In order to keep the data balanced, we restricted each molecule to a maximum of 30 images, resulting in a crossmodal graph-image benchmark containing 750K views.Each view has a resolution of 692×520 pixels and 5 channels.These images were imaged with the ImageXpress Micro XLS automated microscope at 20×magnification.We resize the images to 128×128 without any cropping to fit the CNN models' input format.Figure S2 shows examples of molecules and corresponding images from the CIL dataset and Figure S3 shows the multiple views of a random selected molecule.

B Implementation details and hyperparameters.
Here we describe the implementation details for the pre-training and fine-tuning stages.

B.1 Generative Graph-image matching.
We employ Variational Auto-Encoders (VAE) as generative agents, which are asked to recover the representation of one modality given the parallel representation from the other modality.For example, we need to model the conditional likelihood p(z I | z G ) when generating the cellular image from their corresponding molecular graph.The reparameterized variable could be defined as and ζ ∼ N (0, 1).Therefore, we have the following lower bound:

logp(z
Similarly, when generating the molecular graph from their corresponding cellular image, we have: Both the above objectives are composed of a conditional log-likelihood and a KL-divergence.Following the variation representation reconstruction (VRR) of [1], we use the mean-squared error (MSE) for reconstruction on the representation space: Thus, combining both two regularizers mentioned above, the final GM loss function can be formulated as:

B.2.1 Dataset
To standardize the clinical-trial-outcome predictions, we use the Trial Outcome Prediction (TOP) benchmark constructed by HINT, which incorporate rich data components including drug molecule information, disease information, trial eligibility criteria and trial outcome information.Herein, we consider phase-level evaluation on the trial outcome, where we predict the outcome of a single-phase study.Since each phase has different goals (e.g., phase I is for safety, whereas phases II and III are for efficacy), we evaluate phases I, II, and III separately.We follow the data splitting proposed by HINT and data statistics are shown in Table S1.We first include three machine learning-based methods (RF, LR, XGBoost) and a knowledge-aware GNN model HINT as our baseline.Random Forest (RF) is a bagging algorithm for classification or regression problems, which obtains the prediction by voting or averaging of each base learner (decision tree).Logistic regression (LR) is a simple, parallelizable classification method that uses maximum likelihood estimation for parameter estimation.XGBoost, also called an extreme gradient boosting tree, uses CART regression tree or linear classifier as a base learner to ensemble model predictions.These machine learning baselines utilize 1024-dimensional Morgan fingerprint features for trial outcome prediction.HINT is a hierarchical interaction network designed for clinical-trial-outcome predictions.It uses (1) 1024-dimensional Morgan fingerprint features, (2) a pre-trained BERT model to encode eligibility criteria into sentence embedding and (3) a graph-based attention model GRAM to encode disease information.Furthermore, we also include the self-supervised learning methods to constitute our baselines, including ContextPred, GraphLoG, GROVER, GraphCL and JOAO.For this downstream task, we use the molecule encoders over input molecule graphs for the fine-tuning of clinical outcome prediction.

B.2.3 Fune-tuning hyperparameter
For fine-tuning, an extra linear classifier is appended to the pre-trained GNN.We fine-tune the model for 100 epochs using a batch size of 32 with a dropout rate of 50%.We use the Adam optimizer with an initial learning rate of 1e-3.Experiments are performed for 5 times, with mean and standard deviation of ROC-AUC and PR-AUC are reported.The feature extraction contains three parts: 1) Node feature extraction.2) Bond feature extraction.3) Topology connection matrix.We use RDKit to extract all features as the input of GNN.Table S3 and Table S4 show the atom and bond features we used in MIGA.

B.3.3 Fune-tuning hyperparameter
For fine-tuning, we followed the GraphCL's [2] settings.An extra linear layer is appended to the pretrained GNN to perform classification and regression, respectively.We fine-tune the model for 100 epochs using a batch size of 32 with a dropout rate of 50%.We use the Adam optimizer with an initial learning rate of 1e-3.Experiments are performed 5 times, with means and standard deviations of AUC and RMSE are reported.

F Ablation study F.1 Loss Modules
Table S6 shows the effect of the weights of each loss in our proposed framework, highlighting the importance of each component in the overall system.The results provides insight into the contribution of individual loss weights to the framework's efficacy.The variations in performance can be attributed to the interplay between the loss functions, which suggests that the synergy between different components is critical.We also conducted the comparative experiments of the ablation variant GIC+GIM and CLIP pre-trained encoder on the molecular property prediction task (Table S7).

F.2 Cellular Images
Figure S8 (a) shows the effect of the number of view per molecule perform on graph retrieval task.We notice that with less than 10 views, the more views involve in pre-training the better the results would be, but using all views (on average 25 views per molecule) does not achieve more gains.We attribute this phenomenon to the batch effect of the cellular images, where images from different wells can vary considerably because of the experimental errors and thus mislead the model.

F.3 Model Architecture
Figure S8 (b) studies the impact of CNN architecture choices on the graph retrieval task.Due to the relatively small amount of data, we use small CNN variants (ResNet34 [5], EfficientNet-B1 [6], DenseNet121 [7] and ViT tiny [8]) for evaluation.We note that small CNN models such as ResNet, EfficientNet and DenseNet achieve comparable performance while bigger models like ViT do not show superiority.We assume this is because these models are pre-trained on non-cellular images, so heavier models do not necessarily lead to gains on cellular tasks.

Figure
Figure S2: A random selection of 10 molecules and corresponding cellular images (1 view).

Figure S3 :
Figure S3: Five different views on the same molecule.

B. 3 . 1
DatasetBBBP: The Blood-brain barrier penetration dataset includes binary labels for 2035 compounds on their permeability properties.Tox21: The Tox21 dataset was created in the Tox21 data challenge, which contains qualitative toxicity measurements for 7821 compounds on 12 different targets, including nuclear receptors and stress response pathways.HIV: 41K compounds with binary labels for HIV virus replication inhibition.ToxCast includes 8576 drug compounds with binary labels of toxicity experiment outcomes with 617 targets.ESOL: The ESOL is a small dataset consisting of water solubility data for 1128 compounds.Lipophilicity: Experimental data for the octanol/water distribution coefficient of 4200 molecules.

Figure S5 :
Figure S5: Additional examples for graph retrieval task.The top-ranked molecules retrieved by our method and baseline are shown.Molecules that hit the ground truth are flagged.

Figure S6 :
Figure S6: Additional examples for image retrieval task.The images retrieved by our method and baseline are shown.

Figure S7 :
Figure S7: Additional examples for zero-shot graph retrieval task.The figure shows the cells induced by the cDNA interventions for specific genes (HIF1A, HSPA5, TP53, STAT3) and our model can identify diverse molecules that have similar functions to these cDNA interventions (ticked).

Figure S8 :
Figure S8: (a) Effect of number of view per molecule and (b) effect of CNN architecture.

Table S1 :
Statistics of Clinical Outcome Datasets.

Table S2 :
Statistics of datasets.GC for Graph Classification, GR for Graph Regression.

Table S6 :
Ablation Study on Graph-image retrieval tasks.