Overcoming the Model‐Data‐Fit Problem in Porous Media: A Quantitative Method to Compare Invasion‐Percolation Models to High‐Resolution Data

Invasion percolation (IP) models offer a computationally inexpensive way to simulate multiphase flow in porous media, but only very few studies have compared their results to actual laboratory experimental image data. One reason might be the difficulty in quantitative assessment: IP models do not have a notion of experimental time but only have an integer counter for simulation steps that imply a time order. Previous experiments‐to‐model comparison studies have either used perceptual similarity or spatial moments as measures of comparison. In this work, we present an objective and quantitative comparison method that overcomes the limitations of the traditional approaches. First, we perform a volume‐based time matching between real‐time experiments and IP model results. Then, we evaluate the quality of fit using a diffused version of the so‐called Jaccard coefficient, which is known from image recognition. We demonstrate our method's applicability on a laboratory‐scale experimental video of gas injection in homogeneous, saturated sand, comparing it to a Macro‐IP model's simulation results. We consider random realizations of the initial entry pressure field to capture the sand's inherent pore‐scale heterogeneity. We find that our proposed method is reliable and intuitive in identifying realistic model realizations. The “strictness”of the method can be adjusted to relevant scales of interest via the blurring (diffusion) radius of the compared images. Beyond the application presented here, our comparison method can be used to compare any high‐resolution space‐time model output to experimental data given as raster images, thus providing valuable insights for model development in many research areas.

Both continuum modeling and stochastic rule-based modeling approaches are used to describe such multiphase flows in porous media systems. Continuum models are a near-accurate representation of the physical, real-world system, but they are complex, time-consuming and computationally intensive. On the other hand, stochastic rule-based models are a simplified representation of reality, based on many assumptions. They are simple, fast, and computationally cheap.
Perturbations at the pore scale highly influence gas migration processes. The different length scales involved in these processes increase the difficulty to use the already computationally expensive continuum models for such systems (Glass et al., 2001). Undoubtedly, a fine-mesh discretization or multi-scale schemes are required to incorporate pore-scale heterogeneities correctly for these models, but this increases the computation time even further. If coarser mesh discretizations are used to decrease computational effort, artifacts of numerical dispersion smooth out the pore-scale heterogeneities (Geistlinger et al., 2009;Glass et al., 2001). Thus, the prediction capability of such models for gas migration processes is reduced, and it contradicts their claim to be "physically correct".
Several stochastic rule-based models for the different regimes of immiscible multiphase flow in porous media were suggested in the works of Lenormand et al. (1988). Out of those, Invasion Percolation (IP) models with a representation of gravity forces have been used for upward migration of a low-density fluid (like gas) displacing a high-density fluid (like water) in a porous medium (Frette et al., 1992). Many studies have used IP models with variations to model different types of gas flow systems in porous media (e.g., Ewing & Berkowitz, 2001;Glass et al., 2001;Mumford et al., 2010Mumford et al., , 2015Wagner et al., 1997). Out of these variations, Macro-IP models (e.g., Glass et al., 2001;Mumford et al., 2015) conceptualize the porous medium as a group of internally homogeneous isotropic blocks consisting of sub-networks of pores instead of individual pores as in pore-scale IP models that consider individual pores (e.g., Wagner et al., 1997;Wilkinson & Willemsen, 1983). Macro-IP models are a reasonable choice for scales greater than or equal to the experimental data used in this study because using a pore-scale IP model would require a very high computational effort for typical sandbox experiments. Hereafter, we will use the term "IP-type models" to refer to the entire class of IP model variations. These models' simplicity and low computational cost make them an ideal choice for describing such systems stochastically, which allows quantifying uncertainties at bearable computing times.
While IP-type models have the advantages mentioned above, it is challenging to calibrate, test or validate them for two reasons: 1. There is no time parameter in IP-type models but only a model loop counter. This unawareness of the model towards time makes it hard to compare the models to real-time experimental or field data. Also, the missing time parameter makes it challenging to use IP-type models along with other models like reactive transport models, which are important for subsurface carbon dioxide storage application (Oldenburg et al., 2016). 2. Metrics for quantitative comparison of IP-type models to actual experimental or field data are lacking.
Various authors have presented many IP-type models (e.g., Berkowitz & Ewing, 1998;Glass et al., 2001;Kueper & McWhorter, 1992 Willemsen, 1983;Wilkinson, 1984). However, only a few studies have compared their models to actual (field or experimental) data. In the few studies of model-to-experimental data comparison, the comparison has been made either based on perceptual similarity (like the shape of gas clusters, channels or pools) (Glass et al., 2001;Mumford et al., 2010;Trevisan et al., 2017) or by comparing spatial moments (Mumford et al., 2015). A perceptual comparison is subjective, very time consuming due to its manual character, and maybe deceitful, and when comparing moments, there is a loss of information from the data due to the aggregation of detailed images to just a few summary pictures. Also, when using several moments, it is unclear how to combine them into a unique metric.
Our research aims to develop and demonstrate a comparison method for IP-type models and experimental data that solves the time-matching problem and is automated, objective, and quantitative. We propose a volume-based time matching followed by evaluating a metric: the Jaccard coefficient from image recognition, to quantify the degree of similarity between experimental and model images. While having IP-type models with their specific characteristics in mind, the proposed metric, in fact, allows for quantitative comparison of arbitrary model results to high-resolution data. Hence, our contribution generalizes to any highly spatially (and temporally) resolved model predictions in various disciplines.
The presentation of our method is structured as follows. Section 2 discusses the traditional approaches of comparison for IP-type models to experiments (Sections 2.1 and 2.2) and describes our comparison method in detail (Sections 2.3 and 2.4). To demonstrate the applicability of our method, we use it to compare Macro-IP model simulation data to gas-injection experiments in homogeneous sand in Section 3. This demonstration helps us assess our method's strengths and weaknesses, discussed in Section 4. Section 5 summarizes our research findings and provides a future scope of research.

Methods of Comparison
It is often difficult to obtain direct measurements in the subsurface to ascertain the behavior of the phases. So, indirect measurements using laboratory experiments in a representative porous medium are often used as an alternative. Non-invasive imaging methods like optical imaging using UV or visible light, dual-energy gamma radiation, X-ray microtomography, and magnetic resonance imaging are a few popular choices for such laboratory experiments (Oostrom et al., 2007;Werth et al., 2010). The time problem of IP-type models, as discussed above, is traditionally handled in one of the following ways: 1. The experimental image and the model output image are compared at characteristic/specific time points (like breakthrough time) (Birovljev et al., 1991;Glass et al., 2001;Mumford et al., 2015;Trevisan et al., 2017). 2. When an IP-type model is used in combination with a continuum model for transport (heat or mass), that transport controls the time scale, and the continuum model's time steps are used to compare to the real-time experimental data (Molnar et al., 2019;Mumford et al., 2010).
This section introduces and discusses existing methods for comparing IP-type models to experimental data via imaging methods (like perceptual comparison and spatial moments comparison). After that, we describe our proposed comparison method in detail, with an optional time-matching step aiming to solve the time problem associated with IP-type models. The proposed metric for quantitative comparison of model results to high-resolution data itself is general and could also be applied to arbitrary other types of models.

Perceptual Comparison
Perceptual comparison is the method of visually comparing experimental and model data for similarities. For example, Birovljev et al. (1991) compared the width of the fronts between the two phases from their experiments and IP simulations. In the work of Glass et al. (2001), the length of gas clusters, pool height, as well as saturation distribution images from injection experiments and IP model results were perceptually compared. Trevisan et al. (2017)  This method's primary advantage is that no pixel-based information is lost from the highly resolved data because the images are visually compared on a pixel-to-pixel basis. Also, multiple global attributes of an image (width of a finger, tortuosity and so on) from simulation and the experiment can be compared simultaneously. A valuable (but immeasurable) strength of the method is that it intuitively applies the user's expert knowledge to the judgment. However, the method of perceptual comparison is non-quantitative. Although it includes no computational effort, with the increase in the number of images to compare visually, the task may require an enormous effort from the user. Thus, it can become time-consuming, cumbersome, and non-objective.
Consider a situation where a user compares the breakthrough image (image when the invading phase has percolated across the entire defending phase-saturated porous medium) from the experiment and an IP type model (e.g., Macro-IP) visually. Initially, the user compares only two images. Now, since IP type models rely on stochastic simulation, the user runs 1,000 entry pressure realizations for the model to determine the near-accurate entry pressure field as in the experiment. The number of images to compare increases from 2 to 1,001. Next, for this Macro-IP model, the user wishes to calibrate model parameters like the saturation of the defending phase and porosity and so decides to use saturation values and porosity values from 0 to 1 in increments of 0.1 excluding 0 and 1. Thus, for each entry pressure field for the model, the user runs the model 18 times to fit these two parameters. Now, the user has to compare 1.8 × 10 4 model images to 1 experimental image. The user is further interested in comparing various versions of the IP model (say four different versions) including the calibration of the two parameters mentioned above for each model version. Now, the user visually compares 7.2 × 10 4 model images against one experimental image. Worse still, the user intends to visually inspect all images from the time of injection until the breakthrough. We realize: the effort of comparison amplifies extremely. In short, perceptual comparison can be too tedious and subjective for stochastic analysis and also where the experimental data is spatially and temporally intensive.

Comparison of Spatial Moments
Spatial moment methods have often been used to quantitatively describe both experimental data and numerical simulation output for multiphase flow in porous media (e.g., Essaid & Hess, 1993;González-Nicolás et al., 2017;Jawitz et al., 2003;Kueper & Gerhard, 1995;Pantazidou & Liu, 2008;Trevisan et al., 2015, to name just a few). This method was used to compare Macro-IP models and gas-injection experiments in the work of Mumford et al. (2015). In a 2D water-saturated porous medium invaded by a gas phase, the method involves the calculation of zeroth (M 00 ), first (M 10 , M 01 ) and second moments (M 20 , M 02 ) to describe the spatial distribution of the gas: Here, ϕ is the porosity, ρ g is the density of the gas, S g is the gas saturation value, x and z are horizontal and vertical dimensions in the 2D space. These moments represent the mass, position and spread of the gas, respectively. Further, the centroid position of gas (X c , Z c ) and its spatial extent as variance   Theoretically, the spatial moments of experimental gas saturation data should be reproducible by the model simulation output.
The method of spatial moments comparison is quantitative and objective. Also, it requires less time and human effort than the perceptual comparison method. This method's main disadvantage is the loss of information from the data (for images: pixel information) by the aggregation to a handful of summary statistics before comparison, which can be relevant for process understanding.
BANERJEE ET AL.

10.1029/2021WR029986
4 of 20 Also, in 2D space, we evaluate five spatial moment values, which individually comment on the quality of fit. However, there exists no standard method to combine the information from all of them in one unique metric to identify the best-fit realization. One reason for this is the difference in magnitude of the different spatial moment values. Moreover, for any arbitrary choice of combination of the spatial moment values, the conclusions obtained would be ambiguous. For example, any combined-moment metric would produce the best possible value for an exact mirrored copy of the experimental image (same first and second moments). When interested in predicting preferential flow pathways of a phase, identifying a mirrored image might be completely misleading. Due to this reason, the comparison of spatial moments does not enable any unique decision between competing models or in the calibration of models. In Section 4, we will further show how the loss of information of pixel data in the moments method can yield misleading results.

Proposed Metric of Comparison: (Diffused) Jaccard Coefficient
We propose to compute the Jaccard coefficient (Tan et al., 2005) to quantify the similarity between experimental and simulated images. The metric is widely used in the context of object identification or image recognition. Per set theory, it is defined as the size of the intersection between two sets A and B divided by the size of the union of these sets.
The Jaccard coefficient ranges between zero and one, with zero meaning no similarity and one meaning complete similarity. To understand how we compute the Jaccard coefficient for binary (black/white) images, consider Figure  ( Figure 1b) without taking into account the experimental setup at the moment (this will be explained in Section 3.1). The intersected blocks (|A ∩ B|) are colored purple ( Figure 1c). We count the number of purple blocks and divide by the total number of colored blocks (|A ∪ B|) in the images of Figures 1a and 1b without double counting the pixels that agree to be colored in both of them (purple blocks of Figure 1c), to compute the Jaccard coefficient. We can automate this to be calculated for a time-series of images.
A pixel-by-pixel comparison would reject even a perfect model in a scenario where the point of gas injection in the experiment is not precisely known, leading to an offset between experiment and model. In many real-world applications, this offset would be of no concern; on the contrary, one would wish to identify such a perfect model run. Thus, we propose to blur the images from both the experiment and the model before computing the Jaccard coefficient. Since the blurring produces non-binary values, we use a slightly adapted implementation of the Jaccard coefficient for sets A = {a i : a ∈ R, i = 1, 2, …n} and B = {b i : b ∈ R, i = 1, 2, …n}, which is also known as the Ruzicka similarity coefficient (Deza & Deza, 2016): We call this metric J d the Diffused Jaccard coefficient. We illustrate its evaluation in the bottom row of Figure 1.
We blur the images using the 2D-Gaussian blur function: Mathematically, the blurred images are produced by convolution with a Gaussian kernel of specified width (standard deviation σ). Hence, the radius of blurring is increased or decreased by altering the σ value in Equation 5. We use the kernel size relative to the original domain size as a unit for the images' blur-radius. The blurring results in non-binary images (see Figures 1d-1f) with values between 0 and 1. These represent a spatial tolerance in matching and do not refer to intermediate values of gas saturation.

Volume-Based Time Matching for IP-Type Models
Before using our proposed metric on IP-type models, we need an additional step of time matching because of the lack of real-time notion. That means we need to know which images of the experimental time-series and the model iterations to compare with each other. So, we perform a volume-based time-matching between the experiments and the IP-type models as a first step. We use Equations 6 and 7 to compute the gas volume in the experiment and the model, respectively.
Here, Q exp is the rate of injection of gas in the experiment [volume/time], t exp is the time step in between the capture of two successive images in the experiment and t end is the end time of the experiment.
Here, n blocks is the number of blocks invaded per loop counter n in the Macro-IP model described in Section 3.2, n top is the model loop counter when the gas reaches the top of the domain, and V block is the volume of each discretized block of the Macro-IP model. The porosity ϕ of the sand used in the experiment is assumed to be uniform, and S g is the gas saturation value assigned to the entire gas cluster.
BANERJEE ET AL.

10.1029/2021WR029986
6 of 20 After that, for all the time-wise elements in the V exp and V model data vectors, the Euclidean distance is computed. Based on this distance value, each element of the V exp vector gets a nearest neighboring element (minimum Euclidean distance) in the V model vector. For all those nearest-neighbor pairs, we assign the experimental time t to the corresponding model loop counter n. We may now compare the respective time-matched images based on our proposed Diffused Jaccard Metric (Section 2.3). In our specific implementation, we use the MATLAB inbuilt function knnsearch with the exhaustive search algorithm to efficiently conduct the nearest neighbor search.
Depending on the type of model, the model loop counter may increase without an increase in volume because of a fragmentation process. A fragmentation process refers to a combination of an imbibition (of water) step and an invasion (of gas) step. If this imbibition occurs at the bottom of a gas cluster, mobilization of the gas cluster occurs. If the imbibition occurs in any other block along the gas cluster, it can lead to snap-off events, dividing the cluster into two discrete clusters. Hence, this process does not add gas to the system but is actually a re-arrangement of the gas-occupied blocks (Birovljev et al., 1995;Mumford et al., 2009;Wagner et al., 1997). In such a situation, all these model loop counters will match the same single experimental time (because each experimental image by definition shows an increase in volume due to the fixed gas injection rate Q exp ). We recommend assigning the experimental time to the last matching model loop counter in that case because this model state represents the completed processes at the matched volume.

Demonstration of the Method
In this section, we present a demonstration of our comparison method on a real experiment of gas injection into homogeneous sand. First, we describe the experiment (Section 3.1) and the model (Section 3.2) chosen for our comparative study. After that, we list the steps we conducted in this experiment-to-model comparison (Section 3.3).

Experimental Data
We use experimental data obtained from an air injection experiment in homogeneous water-saturated sand of 0.7 mm average grain size. The experimental setup and the data processing method are described in detail in Van De Ven and . Here, we present a brief description of the experiment (Experiment 0.1-A) used for this study.
Air was injected in a thin acrylic glass cell of size 250 mm × 250 mm×10 mm, filled with homogeneous water-saturated sand (porosity = 0.366) at an injection rate of 0.1 ml min −1 . At this injection rate, the gas flow is observed as discontinuous clusters that fragment and coalesce while migrating vertically. The occurrence of discontinuous gas flow was further verified by continuous gas pressure measurements at the gas injection point (Van De Ven & Mumford, 2019). For more details on the gas flow regimes, we recommend referring to the work of Geistlinger et al. (2006). The data is obtained as a time series of 2D images collected at the rate of 30 frames per second for a total experimental time of 330 s, using the light transmission technique (Kechavarzi et al., 2000;Niemet & Selker, 2001;Tidwell & Glass, 1994).
Individual pixel intensity values of these raw images are averaged over a block size of 1 × 1 mm, and the block is assigned the average intensity value. Then, optical density values are computed per block, and a detection limit of 0.02 is set (Van De Ven & Mumford, 2019). Optical density is defined as the negative logarithm transform of the ratio of the transmitted to incident light intensity (Kechavarzi et al., 2000). Wherever the block intensity exceeds the detection limit, gas is considered to be present. This results in a time series of binary images (Van De Ven & Mumford, 2019).

Model
In this work, we use the Macro-IP conceptual model for simulation of gas (air) flow in water-saturated sand, as in Mumford et al. (2015), Initially, the defending phase (here: water) occupies all the blocks. Then, we place the invading phase (here: air) in the block that contains the gas injection point. The model operates at a block-based spatial discretization of 1 × 1 mm in accordance with the processed experimental conditions. We evaluate the blocks connected to the invaded block for their invasion thresholds (T e ) using:   e e w T P P Here, P e is the local entry pressure, which is the capillary pressure required by air to percolate a water-occupied block, and P w is the pressure of the water phase. Then, we invade the neighboring block with the lowest T e .
The term P w in Equation 8 incorporates the buoyancy forces to more accurately simulate gas migration in a water-saturated porous medium. Hydrostatic water pressure is assumed, and therefore P w is calculated using: Here, ρ w is the water density, g is the acceleration due to gravity, and z is the height from the top of the acrylic glass cell (along Z axis).
The P e field of a porous medium depends on the pore-scale arrangement of the solid and its interaction with the fluids. Thus, a precise measurement of the P e field at our experimental scale is difficult and highly computationally expensive. Since P e is a point on the capillary pressure (P c )-saturation (S) curve, we sample the P e values that we assign individually to all model blocks, using the Brooks-Corey P c − S relationship (Brooks & Corey, 1964) for our material of interest: The term   1 w r r S S S is called the effective wetting phase saturation (S e ), where S w is the wetting phase saturation and S r is the residual wetting saturation. Also, P c is the capillary pressure, P d is the macroscopic displacement pressure which we used to generate the cumulative distribution function of P c , and λ is the pore-size distribution index. We sample the P e values from the inverse of the cumulative distribution function of P c (using Equation 10): Here, u is a random number from the standard uniform distribution in the interval [0, 1]. This sampling method is called the Inverse Transform sampling method, which has been used in the works of Glass et al. (2001) and Mumford et al. (2015). Note that there exists no spatial correlation between the P e values assigned to the blocks, but this extension could easily be achieved via geostatistical simulation.
The Macro-IP model from Mumford et al. (2015) also includes an additional rule for fragmentation and mobilization of gas clusters based on the work of Wagner et al. (1997). In Wagner et al. (1997), the re-invasion of water (causing fragmentation and mobilization) into the gas-filled pores is realized by a withdrawal pressure threshold at the pore scale. At the scale of the Macro-IP model, the threshold for re-invasion, also known as the terminal threshold (T t ), is calculated as the summation of the terminal pressure (P t ) and the hydrostatic pressure (P w ).
The values of P t are calculated based on a P e − to − P t ratio obtained from the characteristic drainage and imbibition curves for the porous medium that accounts for the capillary-pressure hysteresis (Gerhard & Kueper, 2003 A fragmentation or mobilization event of a gas cluster occurs when: where g and w stand for gas-and water-occupied blocks, respectively (Mumford et al., 2015). Re-invasion of water into a gas-occupied block causes a complete expulsion of gas from that block. If the re-invasion of water occurs at sites adjacent to a gas cluster, the gas cluster's mobilization occurs. If the re-invasion causes a disconnection in the gas cluster, fragmentation occurs. After each fragmentation or mobilization step, clusters are re-identified to account for coalescence events. A gas cluster's growth occurs only when (re-) connected to the injection block. Table 1 lists the parameter values taken for the Macro-IP model. We run the model 1,000 times, each one differing in the random spatial arrangement of the P e values, thus also differing in the terminal pressure's spatial arrangement. These model runs represent the uncertainty in these parameter values, which is difficult (if not even impossible) to determine with experimentation. For each of these 1,000 model runs, the simulation ends when the gas reaches any block on the top of the model domain.
In our Macro-IP version, we assume a constant gas saturation value of 0.2, based on Van De Ven et al. (2020), for all the gas clusters (Mumford et al., 2015). This approach may not be an accurate representation of reality. However, we make this assumption to retain the model's simplicity and hence reduce its computational cost, and this assumption does not affect the discussion of our comparison method.

Steps of the Model-Experiment-Comparison
First, we carry out the experiment-to-model time matching (discussed in Section 2.4) using Equations 6 and 7. The model simulation starts from the time gas first percolates into the water-saturated sand in the experiment. Then, we compute the Jaccard coefficient between the time series of experimental images and the corresponding model images across all the model realizations (see Section 2.3). Further, we compute the Diffused Jaccard coefficient. We alter the radius of blurring from 0% to 50% of the domain size in steps of 1% by changing the standard deviation value.
For comparing our method to the existing method of spatial moments, we also calculate the difference between centroid and variance values (Section 2.2) of the final experimental image and of the corresponding time-matched model output images. We normalize this difference by the centroid and variance values of the experimental image and then identify the minimum of the normalized differences of each of the centroid and variance values in X -Z axes. Comparing zeroth moments M 00 is unnecessary because our time matching is based on a volume matching, so M 00 is always accurate.  (Schroth et al., 1996) Pore-Size distribution index λ 5.57 - (Schroth et al., 1996) Model domain size x-z 25 × 25 cm 2 Block discretization -1 × 1 mm 2 we calculate the Jaccard coefficient, the Diffused Jaccard coefficient and the spatial moments. We present and discuss the results of this application in Section 4.

Results and Discussion
In this section, we present the results obtained from our case-study in Section 3. Section 4.1 discusses results based on the experimental image at the end of the experiment and corresponding time-matched images of all model realizations. Section 4.2 focuses on using the proposed metric on the time-series of the timematched images from the experiment and model.

Comparison Based on Final Experimental Image
We can compute the (Diffused) Jaccard coefficient for any of the time-matched images of the experimental data time-series. Here, we begin our discussion by picking the experimental image at the end of the experiment (at t = 330s) and the corresponding time-matched model image per model realization for clarity. The quality-of-fit (no matter if assessed perceptually, by spatial moments or by our proposed metric) varies significantly across the 1,000 model realizations. For example, the values for J vary between 0.003 and 0.17, and those of J d vary between 0.01 and 0.79. Therefore, we select a set of model realizations for discussion that show a reasonable agreement with the experimental image. We will motivate our choice of individual realizations in Section 4.1.1.

Spatial Moments Can Be Misleading
We start by comparing the spatial moments of seven arbitrary (time-matched) images to the final experimental image (Figure 2).
By comparing the height of the gray bars (model realizations) against the height of the solid dark blue bars (experimental data) in Figure 2, we see that none of the six model realizations clearly outperforms the others in all spatial moments; rather, performance varies significantly across the four measures. Indeed, we have chosen four of the realizations (numbers 3, 4, 5, and 6) to exactly match X c , Z c ,  2 xx and  2 zz , respectively (see Figure 2). Recall that all of our time-matched realizations by definition match the zeroth moment, which is equivalent to the volume in the domain. Table 2 Table 2; numbers in the bars denote respective realization number. injection point is not precisely known. Consistently, this realization meets the second spatial moments (spread) in X and Z but fails in reproducing the exact position of the gas (first moments). Hence, if we relied on a spatial moments comparison, we might fail in identifying this realization as an (almost) perfect fit. In real applications, identification of this realization as the true one is essential because it would tell us that the model we use is, in fact, correct, but only the injection point is assumed at the wrong position. We cannot afford to not detect a perfect model realization. Therefore, we strongly recommend using our proposed (Diffused) Jaccard coefficient as a metric of comparison, as discussed in the following sections.

Jaccard Coefficient Yields More Conclusive Ranking
Realization 1 is chosen as the realization that has scored the best Jaccard coefficient (cf. Table 2). When assessed through the eyes of spatial moments, this realization does not look especially convincing (the gray bar for realization 1 is of the same height as the solid dark blue experimental bar for X c , but shorter for Z c and  2 xx and taller for  2 zz in Figure 2). However, from Figure 3b we see that, perceptually, it does achieve a good fit, which confirms that comparison on a pixel-to-pixel basis as done with the Jaccard coefficient yields more conclusive results than the spatially-aggregated moments method. We observe that the gas finger of realization 1 with the highest J value extends farther to the top than the original experimental image (Figure 3a). This is not a systematic result, but this particular randomized entry pressure field yielded the highest pixel-by-pixel agreement with the experimental image. Suppose the distance of the gas finger to the top of the glass cell was of particular interest to the modeler. In that case, one could modify the Jaccard coefficient calculation such that it only compares, for example, the blocks of the outer gas finger boundary.
Realizations 3-6 (Figures 3d-3g) have a poor perceptual fit to the experimental image, and this is reflected in their Jaccard coefficient values. However, a combination of spatial moments of these realizations cannot lead us to this conclusion. For example, in Figure 2, compared to the height of the experimental dark blue bars, the height of realization 3 gray bars vary slightly for Z c value but highly for  2 xx and  2 zz values. Simultaneously, the height of realization 5 gray bars vary slightly for X c and Z c values but highly for  2 zz value. Thus, when we compare the height of the gray bars for realizations 3-6 to the height of the dark blue experimental bars in Figure 2, we cannot identify any pattern in their variations. In contrast, the Jaccard coefficient provides a single value and a clear, precise ranking of the realizations. Experimental image translated in X-Z space ("shifted copy of experimental image") 7 5 Note. The individual spatial moment values for the realizations can be read off from Figure 2. The values for J are provided in Figure 3, and those for J d are provided in Figure 6.  Table 2, with box colors and numbers in figures representing the respective realization.

Table 2 Characteristics of the Seven Realizations Selected for Detailed Analysis Along With Their Ranking Obtained From Their Jaccard Coefficient (J) and Diffused Jaccard Coefficient (J d ) Values at 4% of Domain Size Blur Radius
Yet, we observe that Realization 7 has a very low J value of 0.03, although it is just a translation of the original experimental image. Therefore, we recommend using the diffused version of the Jaccard coefficient (see next Section).

Diffused Jaccard Coefficient Provides Most Insightful Ranking
We have already schematically illustrated the evaluation of the Diffused Jaccard coefficient with Figure 1 in Section 2.3. In fact, the experimental image shown in Figure 1 corresponds to the final experimental image (Figure 3a), and the shown model realization corresponds to realization 2 ( Figure 3c) from Table 2. When we blur the images as in Figures 1d and 1e (here, with 4% domain size blurring), we compare their similarity (using J d ) on a scale larger than that of the individual pixels. Figure 4 summarizes the performance of the seven realizations of Table 2 as a function of the Gaussian blur radius varied from a detailed pixel level of 0% of the domain size (non-diffused Jaccard coefficient value) to a very summarized level of 50% of the domain size.
In Figure 4, upon increasing the blurring radius, the J d values for the shifted copy of the experimental image (realization 7) keep increasing and go up to 0.9 (cyan blue plot line). Thus, the problem with using the Jaccard coefficient mentioned in the previous section for realization 7 is resolved using the Diffused Jaccard coefficient. So, for cases like the shifted copy of the experimental image, while spatial moments lead to inconclusive results and the non-diffused Jaccard coefficient evaluates it as a poor fit, the use of the Diffused Jaccard coefficient is a safer choice. Also, we picked realization 2 (see Table 2) for having the steepest increase of J d values upon increasing blur radius in Figure 4 and we see that it is a good perceptual match to the experimental image (see Figure 3c). Recall that Figure 1 shows the intersection of realization 2 with the experimental image in a blurred and non-blurred state. Indeed we find that both realizations look very similar "in nature," although the pixel-by-pixel comparison would not see that because they differ slightly in position and tortuosity.
Further, note that in Figure 4, the J d values of realizations 3-6 do not improve as fast as realizations 1 and 2 with increased blurring. This slower rate of increase implies that the Diffused Jaccard metric does not favor perceptually different realizations as much as it favors realizations with small offsets to the original experimental image. To further demonstrate that blurring does not favor all realizations equally, we show seven arbitrary model realizations at a blur radius of 4% of the domain size in Figure 5. At this blur radius, the value of J d amongst the 1,000 model realization images varies between 0.05 and 0.54. From Figure 5, it is evident that blurring does fade out the information in the pixels; however, for a given blur radius, differences between good and bad realizations are revealed reliably.
BANERJEE ET AL.

10.1029/2021WR029986
13 of 20  To investigate to what extent we can blur the images for a meaningful comparison, we show our selection of seven realizations from Table 2 along with the experimental image at different blur radii in Figure 6. It is evident from this figure that the images lose more and more pixel details with increasing blur radius. We notice that, at blur-radius values of 20% and 40% of domain size, the images from the experiment and realizations 1-7 look almost the same. This explains why the J d values of the different realizations do not improve significantly after approximately 20% of domain size blur radius in Figure 4. Thus, the upper limit of the blur radius would depend on the application for which the experiments and modeling are done. However, in general, it seems safe to say that any radius of blurring exceeding 50% of the domain size is questionable.

Comparison Based on Experimental Time Series
This section explores the possibility of computing the (Diffused) Jaccard coefficient for the time-series of time-matched images. Figure 7 summarizes the achieved diffused Jaccard values J d of the six model realizations of Table 2 as a function of experimental time at a blur radius of 4% of the domain size. This plot helps evaluate the model performance over the entire process of evolution of the gas finger instead of only assessing the final gas finger at the end of the experiment. Note that we refrain from computing the spatial moments over time because that would give us four time-series per model realization; we would not know how to combine them into a single meaningful measure for model realization ranking. Remember that realization 7 is just the final experimental image shifted in space, so we do not show its temporal evolution here.
All the time-series plotted in Figure 7 show some common trends. As examples, we pick two instances of time for further investigation. First, the value of J d for all realizations drops at a time of around 15s. Second, all the realizations show an abrupt increase at around 240s.
When we zoom into the gas cluster at 15s (column 1 of Figure 8), we see that the experiment gas cluster has far more gas-occupied blocks than the corresponding model realizations. This is why, even upon diffusion, the J d value decreases. The smaller number of gas-occupied blocks in the model images can be due to an over-estimation in the gas saturation value used for volume-based time matching (Section 2.4). Alternative-BANERJEE ET AL.

10.1029/2021WR029986
14 of 20   ly, the higher number of gas-occupied blocks in the experimental image can be due to noisy pixels wrongly identified as gas-occupied pixels.
To investigate the abrupt rise in the J d value at around 240s, we zoom into the gas finger images at time 239 and 241s in the second and third column of Figure 8, respectively. At both times, when we compare the experimental finger to the model fingers, perceptually, they are different. But we notice that the experimental image (dark blue gas finger) at 239s, with more gas accumulated toward the bottom, is much shorter in height than at 241s. The model tends to expand toward the top, as is seen in all the model realization images of columns 2 and 3 of Figure 8. So, when the gas finger in the experimental image at 241s suddenly moves up, the J d value increases abruptly for all the model realizations. Note here that the increase in J d value at around 240s is not the same across all model realizations in Figure 7. Realization 2 (third row, red box in Figure 8) has the closest resemblance to the experimental image at 241s. Accordingly, the maximum increase in J d value is seen for realization 2 (red plot line) in Figure 7.
On the one hand, looking at the temporal evolution of J d for specific realizations could help us improve the model structure (identify "wrong decisions", common errors to all realizations, and so on). On the other hand, the experimental team can use this temporal evolution of J d as a diagnostic tool to detect unexpected behavior in the experimental data set. Such behavior detection could help them identify and resolve problems with experimental setup conditions or data processing techniques.

Summary of Findings
We summarize that spatial moments are hard to aggregate into a single meaningful measure; they are inconclusive and can even be misleading (Section 4.1.1). They would be a poor choice for time-series evaluations. Perceptual comparison takes into account our expert intuition about the quality-of-fit, but it becomes an impossible task for the number of realizations (and potentially even timesteps) that we would like to investigate (Section 2.1). The Jaccard coefficient quantifies pixel-by-pixel agreement and is hence close to a "measure of perception", but it evaluates the perfect-but-shifted model realization only as mediocre (Section 4.1.2). The Diffused Jaccard solves this problem: with increasing blur radius, this image emerges as the best one. We can understand the blur radius as a switch that helps the user control the extent of details to be retained in the images. So, the most meaningful blur radius for a specific application has to be decided by the user; it can reflect the intended purpose of identification (Section 4.1.3). For example, suppose the user wants to know the radius of a gas contamination zone in the subsurface to protect the groundwater table.
In that case, the images can have a relatively high blur radius. Our recommendation is to try increasing blur BANERJEE ET AL.

10.1029/2021WR029986
16 of 20  radii to observe such effects as shown here synthetically with the translated experimental image and then decide its value. We recommend to always report Diffused Jaccard coefficient value together with its blur radius (with an appropriate unit; here, we have used the percentage of the domain size) for transparency.

Summary, Conclusions and Outlook
Our proposed volume-based time-matching method enables us to compare IP-type models with no sense of real-world time to real-time experimental data. To avoid subjective and tedious perceptual comparison of images, we propose to compute the Jaccard coefficient as a quantitative, automated and therefore objective metric for comparison. Applying the proposed method to a real experiment of gas injection in homogeneous saturated sand and Macro-IP model has shown that the Jaccard coefficient provides intuitive results, while spatial moments (as an alternative for an objective method) may be inconclusive or misleading. We also designed a synthetic test case by slightly translating the original final experimental image in space -a situation that might be realistic in many cases. The binary Jaccard coefficient is not able to identify this almost-perfect model due to its pixel-by-pixel strictness. We, therefore, improve on that by using the Diffused Jaccard coefficient. The Diffused Jaccard coefficient can be seen as sliding between rigorous pixel-by-pixel assessment to aggregated (blurred) assessment over space as done in spatial moments. The advantage is that we have a single insightful number for ranking and that we have this slider that can be used for model investigation and improvement.
Our comparison method can be used to compare models other than IP models, and space-time discretized experimental data-sets not limited to light-transmission images. The method can be used to compare any high-resolution model output and experimental data given as raster images.
Beyond the scope of the present research, we anticipate that our proposed metric will be handy for model calibration (identification of best-fit parameter values) and for comparison of alternative model types, for example, to predict under different multiphase flow regimes in porous media, but also, again, in any other discipline that works with raster model output and data.

Data Availability Statement
The experimental data set used in this study is made available by Van De Ven et al. (2021) at Scholars Portal Dataverse: https://doi.org/10.5683/SP2/RQKOCN. The modeling data and codes used for this study are made available in the DaRUS dataverse: https://doi.org/10.18419/darus-1776 .