Meta‐analysis of massively parallel reporter assays enables prediction of regulatory function across cell types

Abstract Deciphering the potential of noncoding loci to influence gene regulation has been the subject of intense research, with important implications in understanding genetic underpinnings of human diseases. Massively parallel reporter assays (MPRAs) can measure regulatory activity of thousands of DNA sequences and their variants in a single experiment. With increasing number of publically available MPRA data sets, one can now develop data‐driven models which, given a DNA sequence, predict its regulatory activity. Here, we performed a comprehensive meta‐analysis of several MPRA data sets in a variety of cellular contexts. We first applied an ensemble of methods to predict MPRA output in each context and observed that the most predictive features are consistent across data sets. We then demonstrate that predictive models trained in one cellular context can be used to predict MPRA output in another, with loss of accuracy attributed to cell‐type‐specific features. Finally, we show that our approach achieves top performance in the Fifth Critical Assessment of Genome Interpretation “Regulation Saturation” Challenge for predicting effects of single‐nucleotide variants. Overall, our analysis provides insights into how MPRA data can be leveraged to highlight functional regulatory regions throughout the genome and can guide effective design of future experiments by better prioritizing regions of interest.

regulatory sequences, and increase our understanding of the regulatory code and how its alteration can lead to phenotypic consequences.
Previous work have used single MPRA data sets to better identify functional DNA sequences and then study the features that make a sequence regulatory active (Grossman et al., 2017;Lee et al., 2015;Sharon et al., 2012). For example, in the expression quantitative trait loci (eQTL) causal SNP challenge of the Fourth Critical Assessment of Genome Interpretation (CAGI4) community experiment, participants developed methods for predicting regulatory activity of candidate genomic regions and the effect of minor variants on their regulatory potential in MPRA (Beer, 2017;Kreimer et al., 2017;Zeng, Edwards, Guo, & Gifford, 2017). The main lessons learned from this community effort highlighted the effectiveness of ensembles of nonlinear methods, especially when used on features related to transcription factor (TF) binding and chromatin accessibility. Interestingly, epigenetic properties predicted from DNA sequence (Alipanahi, Delong, Weirauch, & Frey, 2015;Zeng, Hashimoto, Kang, & Gifford, 2016;J. Zhou & Troyanskaya, 2015) were shown to be more predictive features than experimentally measured epigenetic properties.
Although these efforts provided an important first step, each of them focused on a single MPRA data set in a specific cellular context.
Critical questions, therefore, remain as to how generalizable the insights from MPRA experiments are to other data sets or other cellular contexts. Here, we present a first comprehensive analysis of several MPRA data sets collected by different labs and in various cellular systems; these data sets explore the effect of endogenous loci in several different cell-types. We derive a large set of properties to characterize each putative regulatory region and compare the performance of different methods and features for predicting MPRA output. We show that MPRA activity is predictable and that prediction methods tend to perform consistently well when tested on different data sets, with better performance for nonlinear methods and favorable results when using an ensemble approach.
Consistently, the predictive capacity of individual features is comparable across data sets, with TF binding and epigenetic properties being the top predictors.
We next turned to investigate the generalizability of our models across data sets, which allowed us to distinguish between determinants of MPRA activity that are dependent on the cellular context (e.g., protein milieu in the cell) versus ones that are intrinsic to the DNA sequence. Here, we demonstrate that predictive models trained in one cellular context can be used to predict the MPRA output in another with reduced prediction power and that, as expected, regions whose activity is cell-type specific are harder to predict in this cross-data set setting. We also observe that gene expression of TFs is overall consistent with the predictive ability of their binding instances, with highly expressed TFs being generally more predictive of MPRA activity. When comparing pairs of data sets for TFs that are predictive of MPRA activity, we notice that in some cases, TFs with cell-type-specific functionality are better predictors in that cell-type.
In addition, we wanted to evaluate the applicability of our predictive models in studying the function of naturally occurring mutations. We, therefore, tested the ability of our framework to detect the effects of small variants-single-nucleotide variants (SNV) or short insertions or deletions (indels)-on MPRA activity and achieved similar accuracy to the state of the art methods . Finally, we applied our approach to the Regulation Saturation challenge of the Fifth Critical Assessment of Genome Interpretation (CAGI5), and demonstrate that it achieves top performance in identifying functional effects of SNVs in supervised settings.

| MPRA data sets
We used five publicly available MPRA data sets and one unpublished data set. (a) K562-putative regulatory regions (Kwasnieski et al., 2014) selected from ENCODE-based annotated regions in K562 cells (Encode-Project-Consortium, 2012;Ernst & Kellis, 2010;Hoffman et al., 2013). corresponding to the first test group in the CAGI4 challenge (Kreimer et al., 2017) were used for the remaining analyses. (d) HepG2-chr-2,236 candidate liver enhancers (Fumitaka Inoue et al., 2017) and 102 positive and 102 negative control sequences. Each sequence is 171 base pairs and tested in chromosomal context. (e) HepG2-epi-the same set of elements (Inoue et al., 2017) as above, tested in episomal context. For both data sets 4 and 5, all regions were used to fit MPRAnalyze and the 2,236 candidate enhancer regions were used for the remaining analyses. (f) hESC-2,464 putative enhancer regions  and 200 negative controls. Each region is 171 base pairs and tested in chromosomal context in hESC cell line. All regions were used to fit MPRAnalyze, whereas only the 2,268 candidate enhancer regions were used for the remaining analyses.

| Quantifying activity of regions using MPRAnalyze
For each data set, we obtain the RNA and DNA raw counts for each barcode. We obtain a quantitative measure of enhancer-induced transcription using MPRAnalyze (Ashuach et al., 2019). MPRAnalyze assumes a linear relationship between the RNA and DNA counts, with the scaling parameter, denoted ɑ, as the transcription rate. The method uses a parametric graphical model to incorporate external covariates and dispersion estimates into quantifying ɑ.
The MPRAnalyze model assumes the DNA counts are Gamma-distributed and that given the latent plasmid count, the RNA counts are Poisson-distributed centered around the product of the plasmid count and ɑ. This results in a closed-form negativebinomial likelihood function for the RNA counts. External covariates such as barcode effect, batch effects and conditions of interest are then incorporated into the model by constructing a pair of nested generalized linear models: One using the DNA counts to estimate the latent plasmid counts, and the other using these latent plasmid counts along with the RNA raw counts to estimate ɑ.
Classification of active/inactive enhancers is done by using the fitted ɑ values. If a data set has control regions (K562 and hESC), we first calculate a robust version of the standard score from the ɑ values by subtracting the median over the control regions and dividing by the median absolute deviation (MAD) of the control regions. If no control region exists for the data set, we perform the previous step with the median and MAD over all regions instead of just the control regions. We then compute the survival function for each standard score and apply the Benjamini-Hochberg (BH) correction. The active regions are then defined as regions with a false discovery rate (FDR) of less than 0.05.

| Features
We assessed the correlation of 56 single features (Table S1) (Siepel et al., 2005). (e) Closest Gene Expressionexpression (TPM) of the closest gene from RNA-seq data in the corresponding cell-type. (f) Promoter, Exon, Intron, Distal -binary features indicating whether the element intersects a promoter, exon, and intron.
Distal is defined to be 1 if the element does not intersect with either promoter, exon, or intron annotations. (g) #motifs, Motif Density-number of significant DNA-binding ENCODE motifs (Encode-Project-Consortium, 2012) from simple DNA-binding motif scoring (Grant et al., 2011)

| Statistical tests
We examine the predictivity of features and accuracy of prediction models using several statistical tests. For regression task-for example, predicting quantitative activity-we applied several correlation measures (Pearson, Spearman, and Kendall) considering either the entire test data or regions at the top 25% of quantitative activity; we also applied another Spearman's correlation test after first binning quantitative activity by quintiles. We refer to these seven tests as the regression tests. For classification task-for example, predicting active or not active-we record the area under receiver operating characteristic curve (AUROC) and area under precision-recall curve (AUPRC); we refer to these two tests as the classification tests. The significance of each regression task was evaluated by the respective statistical test q-values, which are obtained from p-values via the Benjamini-Hochberg correction. The significance of classification was evaluated by the q-values of the Kolmogorov-Smirnov test on the predictions with positive ground truth labels.

| Training and testing
We deterministically divide each data set into 10 sections; data sets with the same regions (LCL-eQTL and HepG2-eQTL, and HepG2-chr and HepG2-epi) are divided consistently. For the supervised case, we perform 10-fold cross-validation where each fold trains the model on nine training sections then evaluating on the remaining section. For the cross-data set case, we perform 10-fold cross-validation where each fold trains the model on nine sections from the training data set, then evaluating on the corresponding remaining section in the last data set. We use the statistics from each fold to calculate the overall mean and standard deviation statistics.
When comparing cross-data set learning performance between training on chromosomal MPRA data (HepG2-chr) versus training on episomal MPRA data (HepG2-epi), we observe that training on HepG2-chr showed better results than HepG2-epi 37 out of 40 times (comparing results across different statistical tests; Figure 3 and Table S4). Same regions were used for training and testing was done on the other four data sets.

| Prediction models
We predict the quantitative activity from element features with four regression models and their ensemble. The four models are a linear regressor with ElasticNet regularization (Zou & Hastie, 2005) with 0.5 as the L1 and L2 regularization coefficients and a RandomForest regressor (Breiman, 2001), an ExtraTrees regressor (Geurts et al., 2006), and a GradientBoosting regressor (Zhu et al., 2009), each with 1,000 KREIMER ET AL.

| 1301
estimators. The ensemble method is implemented by taking the average prediction of all four regression models.
For the classification task, we use a RandomForest classifier (Breiman, 2001) and an ExtraTrees classifier (Geurts et al., 2006), each with 1,000 estimators, as well as their ensemble. The ensemble method averages the predicted probability from each classifier.
For both regression and classification, we define a shuffle model with the same composition as an ensemble model but shuffles the labels of the training set before training. This allows us to quantify the probability of producing our ensemble results by chance.

| CAGI5 data processing
We predicted the variant impacts (positive, zero, negative) of 13,186 SNVs from five enhancers and nine promoters after training on 4,650 different SNVs from the same enhancers and promoters. For each SNV, we obtained the variant and wild-type sequence, each of length 187-600, then featurized both variant and wild-type with the 4,535 features that differ between variant and wild-type: Predicted epigenetic properties, DNA k-mer frequencies, #GC, #polyA/T, DNA shape features, and conservation; we collectively referred to these features as Sequence features.
For the discrete challenge, we concatenate the features from variant and wild-type into a feature vector of size 9,070. We trained one multiclass classifier to predict the discrete impact for all promoter variants at once, and another classifier for the enhancer variants. For submitting to the continuous challenge, we used the same feature processing steps as in the discrete challenge, but trained regressors to predict the continuous impact.
We retrospectively discovered that concatenating wild-type/variants features performs identically to taking their difference; we also retrospectively discovered that training a classifier for all promoters (enhancers) together performs better in classification, whereas training a separate regression per element performs better in regression. As such, for all analyses of the continuous challenge besides the original submission, F I G U R E 1 Individual feature correlation with massively parallel reporter assay output. The within-data set ranking is calculated by first ranking each feature by each test, then taking the median of the regression and classification test rankings. The comprehensive ranking is the median across all the data set rankings. The heatmaps are ordered according to the comprehensive ranking and colored according to (a) the within-data set rank, (b) the Spearman correlation coefficient for regression task, and (c) the area under receiver operating characteristic curve value for classification task we subtracted the corresponding wild-type features from each of the 4,535 variant features and used this difference in features to predict the continuous impact via regression, separately for each element (enhancer/promoter).

| RESULTS
We used five publicly available MPRA data sets and one unpublished data set collected at several labs using a range of experimental methodologies and cell-types (Section 2). In all cases, the MPRA constructs were designed to test endogenous human DNA sequences, and not in-silico designed synthetic sequences (Smith et al., 2013). Thus, each element tested in each data set is associated with a source genomic region. Each data set consists of approximately 2,000 sequences with length that varies between 121 and 171 base pairs (Section 2). Unless otherwise noted, the MPRA experiment was performed in an episomal context. The first data set (Kwasnieski, Fiore, Chaudhari, & Cohen, 2014), which we refer to as K562, consists of putative regulatory regions selected from ENCODE-based annotated regions in K562 cells (Encode-Project-Consortium, 2012; Ernst & Kellis, 2010;Hoffman et al., 2013). The second and third data sets, which we refer to as LCL-eQTL and HepG2-eQTL (Tewhey et al., 2016), consist of sequences that contain F I G U R E 2 Performance of (a) regression models and (b) classification models with different feature combinations. The within-data set ranking is calculated for each cell by taking the median of the rankings for all the (a) regression or (b) classification tests within a data set. Each heatmap is colored according to the within-data set rankings. The statistics are mean ± standard deviation for (a) Spearman and Kendall tests or (b) area under receiver operating characteristic curve and area under precision-recall curve tests F I G U R E 3 Performance of cross-data set learning for (a) regression task and (b) classification task between cell-types. All cross-data set learning models are ensemble models with full features. Each cell is colored according to the median over the ranks of all (a) regression tests or (b) classification tests. The statistics are mean ± standard deviation of (a) Spearman and Kendall tests or (b) area under receiver operating characteristic curve and area under precision-recall curve tests an eQTL in lymphoblastoid cell lines (LCLs). The same sequences were tested in LCL and HepG2 cells, thus forming the two data sets.
Notably, the LCL-eQTL data set was used as the primary source for the CAGI4 eQTL causal challenge (Kreimer et al., 2017). The fourth and fifth data sets (Inoue et al., 2017) include candidate liver enhancers, tested in either episomal or chromosomal context. We refer to these data sets as HepG2-epi (for MPRA plasmids) and HepG2-chr (for MPRA integrated in the genome). The sixth data set includes putative enhancer regions (Inoue, Kreimer, Ashuach, Ahituv, & Yosef, 2018) tested in chromosomal context in human embryonic stem cells (hESC). We refer to this data set as hESC.
Separately, for each data set, we applied MPRAnalyze (Ashuach,

| Predictive features for MPRA activity are consistent across data sets
We first defined a set of features that characterize each MPRA sequence and inspected each feature individually (Section 2; see Table S1 for a complete description of all features). Overall, we examined 56 features that can be divided into four categories (similarly to Kreimer et al. (2017)). (a) Experimentally measured epigenetic properties: To define these, we mapped each assayed region to its corresponding position in the reference human genome, and then queried this position against tracks of epigenetic properties from ENCODE (Encode-Project-Consortium, 2012). These properties were measured in multiple cell lines and include the overall number of observed TF binding sites (TFBS), histone marks, binding by chromatin structure-associated proteins (e.g., P300), chromatin accessibility (primarily by identifying DNase-hypersensitivity sites; henceforth abbreviated as DHS), and DNA-methylation. For all these features we either aggregate over all available cell-types, or restrict the analysis to the same cell-type in which the MPRA was conducted.
(b) Predicted epigenetic properties: This set of features covers similar properties as the experimentally derived ones (e.g., TFBS or histone marks). However, instead of being directly measured, the properties are inferred based on the DNA sequence of the respective MPRA construct, using models trained on experimental data (e.g., protein-binding microarrays for TFBS [Newburger & Bulyk, 2009] or ChIP-seq for histone marks [Encode-Project-Consortium, 2012]).
We use three models for this purpose: scoring of protein-DNAbinding motifs (Grant, Bailey, & Noble, 2011), the more recent supervised methods DeepBind (Alipanahi et al., 2015), and DeepSEA (J. Zhou & Troyanskaya, 2015). In all three cases (motif scoring, DeepBind, and DeepSEA), we do not retrain the models with additional data, but only use the pre-trained models to score each MPRA sequence.
Another We use these 56 features (Section 2; Table S1) individually in two ways: (a) we test how well each feature correlates with the quantitative MPRA output of each data set using seven regression tests (Section 2) and (b) we test how well each feature discriminates between active and inactive regions using two classification tests (Section 2). We rank each feature for each of the nine tests and then take the median of these ranks to obtain a data set-specific feature ranking. We then take the median across all data set-specific ranking to obtain a global ranking of the features and sort them according to their global rank ( Figure 1). Notably, the different statistical tests are largely consistent with the global rank ( Figure 1 and 80% of the loci in the data, and report the mean and standard deviation of the resulting accuracy (Table S1).
To further explore cell-type specificity in the context of TF binding, we stratified the TFs into three groups according to their expression level in the cell-type of interest (low/intermediate/high) and sum over the number of binding sites in each group. Although these three features #tf-high, #tf-med, #tf-low had a strong correlation (especially #tf-high) with MPRA activity (Figure 1), they are still less predictive than TFBS Mean (the simple mean across all TFBS-related features). Consistently, we found several cell-type agnostic features such as GC content and #motifs that are predictive of MPRA activity as well (Figures 1, S4, and S5).
Furthermore, we found that limiting the set of TFs in a manner specific to the cell-type under investigation (e.g., for the K562 data set, TFBS Cell Mean only considers TF ChIP-seq experiments conducted in K562 cells) does not improve accuracy (Table S1) tested regions comparing with LCL-eQTL regions ( Figure S6). classification models (Random Forest [Breiman, 2001], Extra Trees [Geurts et al., 2006], ensemble), which we applied separately for each data set. We trained these models using a set of features that extends the one investigated in Figure (Table S1).

|
We evaluate the accuracy of prediction in each combination of data set × prediction method × feature category using 10-fold crossvalidation. We report the mean and standard deviation of the resulting scores ( Figure 2). Importantly, we do not use our evaluation  specific does not increase accuracy ( Figure S7 and Table S3). Finally, it is interesting to note that the accuracy achieved with a chromosomal MPRA library in HepG2 cells (HepG2-chr) tends to be slightly higher than the one obtained with an episomal library (HepG2-epi; regression: 0.59 vs. 0.45 Spearman correlation and 0.41 versus 0.31 Kendal correlation; Figure 2). These results are consistent with a recent comparison between these two experimental approaches (Inoue et al., 2017) that found chromosomal MPRA to be more reproducible, have higher correlation with epigenetic marks and work in variety of cell-types that are harder to transfect (e.g., hESCs); however, more data sets are required to substantiate this finding.

| Transferring knowledge between cell-types
Using existing MPRA data to build models that can be applied across different cellular backgrounds and for genome-wide predictions of regulatory elements can be useful for prioritizing functional regulatory regions, which can guide the design of new MPRA panels and used for analysis purposes. To evaluate how well our models generalize to a new cellular context where MPRA data is not available, we tested the extent to which models trained in each data set can be used to predict the outcome in the remaining data sets.
Based on the results in Figure 2, we take the Full set of features (i.e., all feature categories) and use the ensemble model for both the regression and classification tasks. We avoid training on any genomic region from one data set (e.g., LCL-eQTL) that is already in the test set from another data set (e.g., HepG2-eQTL).
We observe that performance in the regression task is reduced in this cross-data set setting compared with the supervised setting. For example, for the K562 regression task, the best model trained on K562 data achieves a cross-validation Spearman of 0.58 ( Figure 2 and Table S2), whereas the best models trained on LCL-eQTL, HepG2-eQTL, HepG2-chr, HepG2-epi, hESC data set only achieve Spearman's correlations of 0.23, 0.21, 0.44, 0.3, 0.33, respectively (Figure 3 and   Table S4). However, performance in the classification task is generally robust. For example, for the K562 classification task, the best model trained on K562 data achieves an AUROC of 0.85 ( Figure   2), whereas the best models trained on LCL-eQTL, HepG2-eQTL, HepG2-chr, HepG2-epi, hESC achieve AUROCs of 0.7, 0.67, 0.75, 0.74, 0.68, respectively ( Figure 3 and Table S4). These results suggest that MPRA data in one cellular context can be leveraged to distinguish between regions of regulatory importance in another.
We hypothesized that genomic regions that are uniquely active in a certain cell-type would be harder to predict in a cross-data set setting. To explore this, we took advantage of the LCL-eQTL and HepG2-eQTL data sets, which include the same set of genomic regions. We first examined the distribution of three region categories in these two data sets (Figure 4a): common regions (i.e., active regions in both data sets), cell-type-specific regions (i.e. regions active in one of the data sets), and inactive regions (i.e., regions not active in both data sets). We then examined prediction performance for each of the region categories (Figure 4b) in cross-data set analysis where we apply the classifier built on one data set to annotate regions in the other data set as active or not. To assess this, we defined the "hardness" of the region based on the difference between the predicted score (in range [0, 1]) and the class label (1 for active and 0 for not-active region). Reassuringly, we observe that cell-typespecific regions are harder to predict in cross-data set learning ( Figure 4b). These results suggest that while the MPRA signal can be predicted to some extent using cell-type agnostic components, it also depends on cell-type-specific ones. Interestingly, and consistent with our cross-validation (i.e., per-data set) analysis, we observe that the cross-data set accuracy achieved with models trained on chromoso-  Figure 3 and Table S4).
Finally, we wanted to examine the predicted MPRA signal of the 14 endogenous regions included in the CAGI5 data set, which were shown to induce transcription by MPRA when tested in their corresponding cell-types (Kircher et al., 2018). To this end, we trained a regression model on LCL-eQTL regions with the full set of features and used it to predict the MPRA activity of CAGI5 regions.
As the sequence lengths in the LCL-eQTL were 150 base pairs, we refeaturized the center 150 base pairs of the CAGI5 regions. We find that when comparing the predicted MPRA activity of CAGI5 regions with the distribution of LCL-eQTL regions activity, most CAGI5 regions were in the >90th percentile ( Figure S4). This is consistent with our expectation that the CAGI5 regions should be recognized as having a significant transcriptional activity.

| Contributions of individual TFs to the accuracy of predicting MPRA outcome
We wanted to explore which factors in different cells drive the activity of regulatory regions, and hypothesized that the protein milieu in the cell might act as one. To this end, we examined the contribution of individual TFs to MPRA activity. We recorded the correlation between each TF binding signal (DeepBind prediction) and the activity of each MPRA region (Alipanahi et al., 2015). Similarly to our analysis in Figure 1, we then ranked the TFs based on their predictive ability across data sets, thus revealing several TFs whose binding is generally informative of regulatory activity of MPRA constructs in all cellular contexts in this study (Figures 5,S8,and S9 and Table S5). For instance, two TF families with a data set-wide high predictive capacity, that is also supported by experimentallyevaluated binding from ChIP-seq ( Figure S9 and Table S5) sites are JUN and FOS. Proteins of the FOS family dimerize with proteins of the JUN family, thereby forming the TF complex AP-1, which has been implicated in a wide range of cellular processes, including cell growth, differentiation, and apoptosis across different cell-types (Ameyar, Wisniewska, & Weitzman, 2003). More generally, we find that TFs whose binding is commonly predictive of MPRA activity across data sets are also highly expressed across all the three cell-types, as indicated by RNA-seq data ( Figure 5). Indeed, the gene expression of

| Exploring common and distinct TF binding between data sets
We next proceeded to explore TFs whose binding is predictive of MPRA activity only in specific cell-types. To this end, we defined, for each data set, a set of predictive TFs, as the set of bound TFs (predicted by DeepBind) that is significantly correlated with MPRA output (Spearman's FDR corrected p < 0.05). We then compare across pairs of data sets to determine if there is significant overlap in predictive TFs. To this end, for each pair of data sets, we calculated the fold enrichment of the overlap between the predictive TF set, and evaluated the significance of this overlap using a hypergeometric p-value (Figure 6a). Overall, we see that there is significant overlap across every pair of data sets. Interestingly, the similarity between data sets seem to be dominated by the similarity between the MPRA sequences and less so by the similarity in cellular context. Specifically, the HepG2-chr and HepG2-epi pair and LCL-eQTL and HepG2-eQTL pair had the strongest overlap, with higher similarity between experimental versions tested in the same cell-type (HepG2-chr/HepG2-epi) than same elements tested in different cell-types (LCL-eQTL/HepG2-eQTL), suggesting that the same genomic regions tested in different conditions have correlated signals in MPRA.
However, this result may depend on the specific sequences studied, and further data needs to be collected to substantiate it.
We further examine the predictive TFs that differ between pairs of data sets (Figure 6b and Table S6), and provide a list of top predictive TFs in at least one data set. In some cases, we find proteins whose function is related to the cell-type under investigation. For instance, when comparing the two data sets with the lowest similarity score for predictive TFs, K562 to HepG2-eQTL, we find that RARG (a retinoic acid receptor which belongs to the nuclear hormone receptor family and is associated with liver risk phenotype [Roberts et al., 2010]) is predictive in HepG2-eQTL but not K562. When comparing K562 to LCL-eQTL, we observed that the genes in the ETS family (ELF1, ELF5, ELF3, ETV6, ELK3) are predictive only in K562. These genes are known to be expressed in hematopoietic tissues and cell lines, and play a role in hematopoietic cell development (Clausen et al., 1997).
When comparing hESC to the other data sets, we observe a known pluripotent factor-POU5F1 (Boyer et al., 2005) to be predictive only in hESC for most of the comparisons (Table S6).
F I G U R E 4 (a) LCL-eQTL versus HepG2-eQTL MPRA activity by log2 ɑ values. The points are colored according to activity in each of the data sets (active/inactive is defined as above/below 1.5 cutoff, respectively). (b) We define hardness as the rank-normalized absolute difference between the ground truth binary activity label (0 or 1) and predicted probability. The cumulative distribution function of the hardness for each of the four activity groups when training the ensemble, full feature cross-data set model on HepG2-eQTL (Left subfigure) and LCL-eQTL (Right subfigure), and testing on LCL-eQTL and HepG2-eQTL, respectively Overall, these results support the notion that both sequenceintrinsic (i.e. derived solely from the sequence tested in MPRA) and cell-type-specific (i.e., derived from the cell-type where MPRA was tested) properties are determining MPRA activity. We also find that the cell-type-specific component may be captured by the activity of TFs whose function is associated with the cell-type under investigation.

| Studying the effects of small genetic variants on MPRA output in CAGI5
MPRA can be used to study the transcriptional effects of small variants that commonly occur in regulatory regions, namely SNVs and small indels (Tewhey et al., 2016). We wanted to examine if we can predict these effects using our framework. To test this, we used the data from the CAGI5 challenge, which consisted of saturation mutagenesis analysis of 14 regions, overall testing the effects of 17,500 SNVs. The training data consisted of 25% of the saturation mutagenesis results in each of the 14 regions.
The participants in the CAGI5 challenge were asked to use the training data to predict the impact of SNVs on the MPRA signal in the held-out parts of the experiment. The challenge was divided in two.
In the discrete challenge part, the goal was to predict the category of impact (negative effect, no effect, positive effect). The  (Table 1).

| Post submission analysis on CAGI5 training data
We examined the correlation between the continuous variant impact and the difference between variant and wild-type Sequence features (Figure 7 and Table S8). We observe that some of the strongest features are the ones highly correlated with MPRA activity as found in previous data sets ( Figure 1). We conclude that predictive feature differences for variant impact are consistent with predictive features for MPRA activity.
Following the assessors recommendations (Shigaki et al., 2019), we focused the rest of our analysis on the continuous challenge, as the discrete challenge did not provide enough data for more in-depth analysis. The reason for this is that several (3 out of the 14 elements) of the mutagenesis experiments contained as few as three variants exhibiting what was deemed by the organizers as significantly positive or negative impact.
We retrospectively discovered that training one regression model per reference genomic region (14 altogether) to predict variant impact outperforms training one regression model per type (promoter or enhancer), so we proceeded with the former strategy. We observed that the full set of Sequence features resulted in the best performance in all of the enhancer and promoter elements, compared with specific feature sets ( Figure 8 and To examine whether predictions of SNV effect can be generalized from one data set (here, perturbations of a single DNA element) to F I G U R E 6 Similarities and differences in transcription factors (TFs) whose binding is predictive of Massively parallel reporter assay (MPRA) activity between data sets. For each data set, we define a set of predictive TFs, as the set of bound TFs (predicted by DeepBind) that is significantly correlated with MPRA output (Spearman's q < 0.05). (a) Similarity of predictive TFs between data sets. For each pair of data sets, a hypergeometric test is performed on the sets of predictive TFs of both data sets, resulting in a q-value indicating the likelihood of the overlap occurring by chance (color scale). We also calculate the enrichment ratios of predictive TFs for each pair of data sets (cell text). (b) Differences in predictive TFs between data sets. For each data set, we only plot high confidence predictive TFs (i.e., significant TFs) that have Spearman's q-values of less than 0.01, and nonpredictive TFs (i.e., nonsignificant TFs) have q-values of greater than 0.1 the next, we applied the same procedure as before-namely training a regression model on one data set and testing it on the remaining ones (keeping promoters and distal elements separate). We found that the accuracy of the predictions from cross-data set models were mostly weaker than the predictions from the supervised models ( Figure 9 and Table S10). We note that cross-data set prediction between TERT-GBM and TERT-HEK293T, which was trained on the same region but tested in different cell-types, exhibited moderate strength compared with the supervised predictions. We observe poor performance for cross-element predictions in the same cell-type, for example, training on LDLR and testing on F9 (both promoters tested in HepG2 cells) or training on GP1BB and testing on HBG1 (both promoters tested in HEL 92.1.7 cells). These results suggest that for saturation mutagenesis data, supervised prediction models using both the same element and same cell-type work generally better than cross-data sets models, highlighting the specificity of local variation presented in this data. However, more data sets are required to substantiate this claim. Generally, the absolute performance for predicting variant effects is substantially lower than that achieved in the task of predicting the transcription of individual sequences, which can be expected as this task relates to a much more nuanced signal.

| DISCUSSION
MPRA holds a great promise to be a key functional tool that will increase our understanding of gene regulatory elements and the F I G U R E 7 Pearson correlation between variant effect and the difference between variant and wild-type feature across the relevant individual features for the nine promoters and five enhancers tested in CAGI5. *Cells whose correlation's p < 0.05 enhancer, in overlapping windows. Activity measurements were highly variable between windows of the same enhancer, while many of the features we use were shared among the overlapping windows.
We explore the extent by which knowledge of regulatory activity in one cellular context can be used to make predictions in a held out cellular context. Finally, we examine the ability of our framework to detect the effects of small variants on MPRA activity. Our results represent, to the best of our knowledge, the first such comprehensive analysis.
Our work highlights genome accessibility and TF binding as the strongest predictors of regulatory activity, with no observed advantage to cell-type-specific features. When applying prediction models, we observe that performance is improved when using an ensemble of all features, with no significant prediction improvement when using cell-type-specific features. These results imply that part of the signal observed in MPRA studies is not cell-type specific. Interestingly, models trained with chromosomal MPRA data yield better predictions across data sets than those trained on episomal MPRA data, stressing the importance of this experimental approach that conveys a more reliable representation of the endogenous settings.
When training on one cell-type and predicting on another celltype, we observe overall lower but robust results, with regions enriched in cell-type-specific signal being harder to predict. Notably, we detect a communal component across data sets with a group of TFs being top predictors, as well as some cell-specific factors that seem to be involved in phenotypes associated with the corresponding cell-type. In the MPRA setting the cis environment (e.g., chromatin) is altered, thus generally not cell-type specific, and the trans environment (e.g., TF binding) remains similar, hence we can still observe predictive factors that are cell-type specific.
As seen through its performance in the CAGI5 Regulation Saturation challenge, our approach is competitive in the high-resolution task of predicting the functional effects of SNVs in a supervised setting.
Our work provides a comprehensive resource of annotation for thousands of endogenous sequences across the genome. 1R01HG006768 and 1UM1HG009408) and National Institute of Mental F I G U R E 8 Performance of the regression model with different feature combinations, using Pearson's correlation for nine promoters and five enhancers tested in CAGI5. *Next to correlations with p < 0.05 F I G U R E 9 Performance of cross-data set learning of the regression task for CAGI5 data (a) nine promoters (b) five enhancers. All cross-data set learning models are ensemble models with full features. Each cell is colored according to the correlation coefficient of the Pearson test. *Next to correlations with p < 0.05