A high-performance computing framework for Monte Carlo ocean color simulations

SUMMARY This paper presents a high-performance computing (HPC) framework for Monte Carlo (MC) simulations in the ocean color (OC) application domain. The objective is to optimize a parallel MC radiative transfer code named MOX, developed by the authors to create a virtual marine environment for investigating the quality of OC data products derived from in situ measurements of in-water radiometric quantities. A consolidated set of solutions for performance modeling, prediction, and optimization is implemented to enhance the efﬁciency of MC OC simulations on HPC run-time infrastructures. HPC, machine learning, and adaptive computing techniques are applied taking into account a clear separation and systematic treatment of accuracy and precision requirements for large-scale MC OC simulations. The added value of the work is the integration of computational methods and tools for MC OC simulations in the form of an HPC-oriented problem-solving environment speciﬁcally tailored to investigate data acquisition and reduction methods for OC ﬁeld measurements. Study results highlight the beneﬁt of close collaboration between HPC and application domain researchers to improve the efﬁciency and ﬂexibility of computer simulations in the marine optics application domain. © 2016 The Authors. Concurrency and Computation: Practice and Experience Published by John Wiley & Sons Ltd.


INTRODUCTION
Research at the interface of Geosciences and Computer Science increasingly demands large-scale high-performance computing (HPC) applications. Simulation-driven experimentation in Geosciences generally involves complex mathematical models and large volumes of data. HPC solutions tailored to specific application areas can help improve the quality of numerical results through a comprehensive exploitation of computational resources. Scientific studies relying on HPC facilities need to meet different performance and efficiency requirements including the following: (1) optimization of time and space costs for running large-scale computer simulations; (2) comprehensive exploitation of consolidated and emerging computer architectures in different geographical settings; and (3) ease of access to computing resources and supporting HPC techniques. These requirements have been the rationale of numerous HPC applications in Geosciences (e.g., [1][2][3][4][5][6]), stimulating a variety of Computer Science studies on HPC-oriented problem-solving environments (PSE). Key PSE components include parallel and grid computing infrastructures [7,8], middleware and programming tools [9][10][11][12], and computational methods for Geoscience applications [13][14][15][16].
Within the scope of the Geo-Info project [17] aiming to promote joint research between computer scientists and Geoscience experts, this paper presents an HPC framework for performance prediction and optimization of Monte Carlo (MC) simulations for ocean color (OC) applications. This framework concerns a parallel MC radiative transfer code [17][18][19][20], hereafter referred to as MOX, developed by the authors to support investigations on OC data products derived from in situ radiometric measurements perturbed by uncertainties due to environmental factors [21][22][23]. The MOX implementation is underpinned by the increasing importance of accurate long-term in situ OC measurements for the development of bio-optical models, vicarious calibration of satellite sensors, and validation of remote sensing (RS) radiometric products [24][25][26][27][28][29][30]. Earth observation programs of relevance include the MERIS and OLCI projects of the European Space Agency and the Sea-WiFS, MODIS, and VIIRS projects of the US National Aeronautics and Space Administration. MOX computes the light distribution in natural waters by tracing a number of photons and tracking their trajectories. The output of the photon tracing is a two-dimensional virtual representation of in-water radiometric fields, which can then be used as a controlled environment to derive OC data products with statistical features analogous to those in reality. MOX can hence contribute to theoretical assessments of uncertainty budgets in OC field measurements and provide guidelines to refine data reduction methods for in situ marine radiometry [31][32][33].
An integrated set of HPC techniques for MOX performance prediction and optimization is implemented in this study to enhance the efficiency of MC OC simulations on large-scale parallel computers. This objective is challenged by heterogeneous uncertainty characterizing OC applications, MC simulations, and HPC systems. These uncertainty budgets, hereafter respectively referred to as OC, MC, and HPC uncertainty, are addressed in a component-wise manner (Figure 1). The OC uncertainty represents perturbations affecting the quality of OC data products, specifically the following: (1) the precision (i.e., reproducibility) of in situ measurements and (2) the accuracy of RS observations with respect to reference in situ data. The former refers to those perturbations due to environmental conditions and measurement settings that MOX is intended to analyze. The latter uncertainty is a matter of satellite OC investigations, which set up the encompassing context of the present work. The MC precision accounts for statistical fluctuations intrinsic to the randomness of MC methods [34][35][36] affecting the reproducibility of simulation results. The MC accuracy determines the physical correctness of simulated radiometric fields [18]. The HPC uncertainty finally refers to performance variability observed in parallel computing systems [37][38][39][40]. This variability influences the precision of execution time assessments based on which performance prediction is attempted, and hence also the accuracy of prediction results with respect to reference execution time.
Large-scale MOX simulations are motivated by (1) the necessity to satisfy MC precision requirements and (2) the need to explore various environmental cases of interest in OC investigations. Specifically, the present HPC framework aims to address two goals: (1) Modeling radiometric fields at the required precision level. In order to provide meaningful insights for OC applications, the virtual radiometric environment produced by MOX has to meet the precision requirement of real field measurements. From a computational perspective, this means tracing a sufficiently large number of photons. Simulation configurations (i.e., MOX input parameters) are given by OC experts in compliance with environmental settings of interest and the technical specification of radiometric measurement systems. The required photon population size N p to reduce MC-intrinsic statistical noise to a negligible level is an unknown function of these input parameters. In addition, MOX execution time depends on the performance variability in HPC architectures. It is thus beneficial to determine N p at run time while taking account of these uncertainty budgets, in order to improve execution efficiency.
(2) Efficient exploitation of HPC systems. MOX is used by domain experts to explore a wide multi-dimensional parameter space defined by case study specifications. This may require hundreds of simulation jobs. It is hence desirable to distribute the simulation jobs to the available computing environments in a cost-effective manner based on an automated job scheduling mechanism. The foremost objective of job scheduling is to minimize time costs because of expensive computation time in HPC facilities. The job-environment mapping can be automated by means of execution time estimates. However, predictive performance modeling of the MOX code is challenged by the aforementioned issue of photon population size, as well as by a nonlinear dependence of the execution time upon multivariate simulation settings, including illumination conditions, seawater optical properties, and irregular boundary conditions imposed by sea-surface waves (e.g., [18]).
The main achievements of the present work are summarized hereafter: (1) From a Computer Science perspective, this study makes a case of an effective combination of HPC, machine learning, and adaptive computing techniques to address the performance prediction and optimization problems posed by large-scale MC simulations in the presence of OC, MC, and HPC uncertainty budgets. (2) From an applicative point of view, instead, it provides OC scientists with a suite of computational methods that improve the efficiency and flexibility of MOX experiments, benefiting application case studies conducted on supercomputers.

Related work
The present work advances reference research achievements [19,20] by integrating new results and methods in a unifying framework of performance optimization techniques. Numerical results acknowledge an extended range of input simulation parameters to investigate the effects of environmental factors on run-time performance optimization. Besides, the offline performance tuning scheme outlined in [20], here revised from a methodological perspective, is evaluated based on novel numerical results. The study is part of a wide range of performance modeling solutions for optimizing time and space in simulation-driven investigations for Geosciences, as reviewed next. Performance modeling and prediction in the literature can be roughly divided into three classes: Fully analytical modeling based on in-depth knowledge of target computer programs and environments [41][42][43]; semi-analytical modeling based on a priori knowledge of the programs and environments to identify appropriate model expressions, and least-squares fitting of model coefficients to observed performance data [44][45][46][47][48][49][50]; and empirical performance modeling relying on very general model expressions whose coefficients are learned from data [48,51]. The key element exploited for MOX performance modeling is the identification of linear and nonlinear components in the photon tracing time. The linear dependence is treated analytically, while an empirical multi-layer perceptron (MLP) scheme is used to model the nonlinear relationship.
MLP neural networks have been widely used for a number of applications including performance modeling [48,51], job scheduling [52], load balancing [53], design space exploration [54,55], and OC bio-optical inversion [29,30,[56][57][58][59][60][61][62][63][64]. A common finding among these studies is that a key to successful MLP applications is empirical tuning of MLP algorithms in terms of network architecture, data pre-processing, and application-specific feature selection. Developing MLP algorithms usually requires a thorough adaptation of the generic neural computing framework to specific application cases. Performance analyses of MC particle transport codes have been addressed by several researchers. Main differences are in whether the analyses are qualitative [65,66] or intended for performance prediction, as well as in target aspects of the modeling (e.g., parallel performance in [43] vs. serial photon tracing time in this study). The reported dependence of the execution time on input parameters defining the simulated media and the degree of parallelism [65,66] are in agreement with the dependence of the MOX performance on seawater inherent optical properties (IOPs) and the number of CPU cores. Adaptive computing applications commonly exploit domain-specific elements in order to enable automated dynamic optimization. The online accuracy evaluation scheme applied in this study is an example of a widely used procedure where computation continues (either iteratively or recursively) until predefined accuracy conditions are met. Many other examples of domain-specific adaptive computing techniques are found in the literature, for instance in self-adapting linear algebra [67][68][69], adaptive quadrature [70,71], and adaptive mesh refinement [42,72] just to name a few. These previous studies and the present work differ in the ways how domain-specific elements are used to control adaptive iteration/recursion procedures.
Our adaptive performance tuning shares strategies with the adaptive control theory [73]. In both cases, the objective is the adaptive scheduling of computations upon run-time events. Canonical control theory relies on formal expressions through an analytical (e.g., [74]) or a machine learning approach (e.g., [75]) to model the distribution of the system state. In the present study, decision making does not require learning the occurrence of different events (hence there is no need for probability density estimation). Instead, adaptive strategies are decided on prior knowledge of runtime events.

Structure of the work
The paper organization is as follows. Section 2 presents a general architecture of HPC-oriented problem-solving environments based on which the present work is developed. Section 3 examines MOX performance characteristics. On this analysis basis, the components of the proposed HPC framework are formulated in Section 4. The system components are then verified by experimental results in Section 5 and further discussed in Section 6. Finally, Section 7 concludes the work with a summary of the study and elements for future progresses.

HPC-ORIENTED PROBLEM-SOLVING ENVIRONMENTS
Scientific and engineering applications increasingly rely on virtual experiments by means of numerical simulations to better understand the behavior of real complex systems under different configuration scenarios. The growing demand of a systematic support for simulation-based experiments has pushed forward the development of integrated solutions commonly called PSEs. This concept, appeared in the 1990s [76], is now a recognized approach to help scientists and engineers manage the complexity of problem solving. The scope is to provide a transparent and easy-to-use interface to state-of-the-art algorithms and problem-solving strategies. Large-scale and data-intensive applications as those addressed here require dedicated support and easy access to underlying run-time HPC systems, as well as efficient and scalable means for job and resource management, data visualization and analysis.
The schematic in Figure 2 identifies the following four main PSE functional layers for HPC applications.
(1) Application set-up. This layer accommodates domain-specific abstractions, mathematical models and solvers, as well as experiment scenarios in terms of parameters and objective functions for optimization. In the present study, this layer includes the MOX code and case study specifications. (2) Experiment management. This layer comprises generic components for managing the problem-solving process, that is, (a) Experiment specification, which allows users to describe structural or template plans for computer experiments. (b) Design space exploration, optimization, and tuning, based on a set of design alternatives corresponding to various configurations of input parameters. (c) Data interpretation, analysis, and visualization, by means of tools for supporting the evaluation of computer simulations, on the basis of a specific interpretation and analysis process for each application problem. In this regard, the present study addresses performance enhancement of MOX simulations through online accuracy evaluation and offline threshold parameter tuning.
(3) Job and resource management. This layer offers functionalities to automate the instantiation of simulation jobs and management of their execution in underlying run-time infrastructures.
In the case of the present work, this layer includes techniques for predictive performance modeling and job scheduling on the basis of past MOX execution records. (4) Computing infrastructure. This layer represents execution platforms for HPC applications including clusters, grids, and clouds. The target computing environments of this work are distributed-memory and shared-memory parallel machines.

MOX PERFORMANCE CHARACTERIZATION
This section addresses the performance characterization of the MOX code. MOX features relevant to this study are also presented for convenience of the readers (see [18] for more detail). MOX applications start with MC simulations of in-water radiometric fields, followed by 'virtual' optical profiling where the simulated radiometric fields are used to derive data products relevant for OC studies.
MC simulations constitute the most compute-intensive part of MOX applications. The simulation domain is an x-´cross section of a seawater column (from surface to bottom) modeled as vertically stacked layers of horizontal photon-collecting bins ( Figure 3). The output of photon tracing provides the spatial distributions of different radiometric quantities < k .k D 1; 2; 3; : : :/. Each radiometric field, expressed by a matrix whose .i; j / entry accumulates photons that hit the j -th bin of the ith layer, is then used for virtual optical profiling complying with measurement protocols of in situ marine optics.
For example, Figure 4(a) shows a computed distribution of a radiometric quantity (namely, the downward irradiance E d [18]) in the presence of sea-surface waves, illustrating the light focusing and defocusing effects of the sea surface. White diagonal lines indicate a virtual deployment path of a profiler system below the sea surface (black line). Note that in reality, waves propagate horizontally, while the optical system is deployed vertically. Figure 4   (circles) sampled at different depths ı´along the deployment path in the radiometric field. A data reduction method is then applied to these sampled values to compute the radiometric data products of interest.
Only the MC computation has been parallelized, because the time spent for virtual optical profiling is negligible with respect to the photon tracing time.

Parallelization
MC radiative transfer simulations belong to the class of embarrassingly parallel problems, where each photon transport can be computed independently of the others. MOX exploits data parallelism through full replication of radiometric field matrices as follows. The total number of photons N p is evenly divided into N c CPU cores. Each CPU core computes the trajectories of N ppc D N p =N c photons using a different random number seed. For each radiometric quantity, matrices on the N c CPU cores are gathered and summed up at the end of the simulation. This summation is carried out by a parallel reduction operation, which involves N c 1 matrix additions.
Two versions of the parallel MOX code were implemented, one with MPI for distributed-memory parallel computers and the other with OpenMP for shared-memory machines. Both implementations show similar parallel performance because of the embarrassingly parallel workload. This paper focuses on the MPI version of the code, since it enables numerical experiments with a larger number of CPU cores.

Execution time components
The execution time of the parallel MOX code consists of three components: initialization, photon tracing, and matrix summations. The initialization time, spent for reading a simulation configuration file and preparing zero-filled radiometric field matrices, is negligible. The photon tracing time is dominant (more than 99 %) when considering the number of photons per CPU core (10 6 or larger) typically required for application studies. The time spent for matrix summations can also be neglected with respect to the photon tracing time. Only the photon tracing time is henceforth referred to as the execution time of MOX.
The photon tracing time may vary significantly depending on various input simulation parameters, including the seawater IOPs (i.e., attenuation and absorption coefficients, and volume scattering function), illumination conditions (i.e., the sun position, sky radiance distribution, and diffuse-tototal irradiance ratio), sea-surface geometry (expressed by the lengths and heights of harmonic waves), dimensions and spatial resolution of the simulation domain, and photon weight threshold. Clearly, the photon tracing time also depends on performance factors of execution environments such as CPU clock frequency, memory bandwidth, and memory latency.

Parallel performance and uncertainty
This section describes MOX performance characteristics through experiments using the Milipeia cluster (University of Coimbra, Portugal). The cluster consists of identical compute nodes on a Gigabit Ethernet interconnect. Each node has two dual-core AMD Opteron 275 processors at 2.2 GHz and 8 GB RAM. Table I shows photon tracing time T (in seconds; averaged over five samples) as a function of N ppc in the case of N c D 4 and N ppc set equal to a power of 10. The ratio r D T .N ppc /=T .N ppc =10/ is approximately 10 when N ppc > 10 5 , which indicates that T is proportional to N ppc in operational MOX applications.
Even though the same number of photons N ppc is traced by each of N c CPU cores, the time T i spent for photon tracing on the i-th CPU core can vary due to non-deterministic MC radiative transfer computations and performance variations (i.e., jitters) in computing environments. A barrier synchronization is thus performed at the end of photon tracing to ensure that all matrices on the N c CPU cores are ready for parallel reduction. Therefore, the photon tracing time T of a parallel MOX run is the maximum per-core photon tracing time T max D max¹T i º among the N c CPU cores (i D 1; 2; : : : ; N c ). Ratio r D T .N ppc /=T .N ppc =10/ is approximately 10 when N ppc > 10 5 , indicating that T is proportional to N ppc .  Table II shows the maximum, mean, standard deviation, and coefficient of variation (CV) of percore photon tracing time T i as a function of N c (set equal to a power of 2). Statistical figures report mean results of five tests. For each test, a MOX simulation was performed with fixed parameter settings (including a unique random number seed for each CPU core). It is observed that (1) the mean values are almost the same for all N c values, and (2) the maximum per-core photon tracing time T max slightly grows with an increase in N c . The first point is an expected consequence of the embarrassingly parallel photon tracing with an even number of photons per CPU core. The second point can be explained by performance variations in the per-core photon tracing time. Figure 5 shows the distribution of observed T i values in the case of N c D 128 (Table II). Each of the five histograms shows the frequency distribution of 128 per-core photon tracing time observations during a parallel run. By using the same simulation setting, the five runs produced the identical simulation results, but T i values are normally distributed. The increasing trend of T max at larger N c can be explained by theoretical results of extreme value statistics: the expected largest element in a set of n independent and identically distributed (i.i.d.) samples drawn from a normal distribution gets larger as n increases [77,78]. Figure 6 gives additional confirmation by numerical results. The theoretical expected maximum value observed in a set of N c i.i.d. samples drawn from N .0; 1/ is in agreement with MC results with 1000 trials (pluses). Estimates of expected maxima are less accurate with a smaller MC trial count, as seen in the MC results with five trials (crosses) as well as in the measured T max values (circles).   (Table II) normalized by z-score transformation .T max /= , where and are the mean and standard deviation of measured T i values, respectively. The solid line, based on Eq. (9) in [78], shows the theoretical expected maximum value among N c independent and identically distributed samples drawn from a normal distribution with zero mean and one standard deviation, N .0; 1/. The pluses and crosses show two sets of expected maxima determined by Monte Carlo (MC) simulations [78] with 1000 and 5 trials, respectively.
In summary, the MOX parallel performance is quantified by the maximum per-core photon tracing time, which is proportional to N ppc (Table I) and weakly depends on N c (Table II). These characteristics define MOX performance modeling and job scheduling methods (Sections 4.1 and 4.2).

Performance tuning
MOX challenges conventional code optimization techniques because of sparse computations due to a high frequency of conditional branching addressing complex boundary conditions for photon tracing, as well as due to poor data locality resulting from irregular stride access to radiometric field matrices for photon trajectory tracking [20]. These code characteristics prevent improving the MOX performance with conventional code optimization techniques such as register blocking (loop unrolling), cache blocking (different data matrix decomposition schemes), and instructionlevel vectorization (x86 SIMD instructions). Nevertheless, this study demonstrates that careful exploitation of application-specific knowledge can still lead to significant performance gains (Sections 4.3 and 4.4).   Figure 7 shows the interaction of the four components, which are presented in the subsequent sections.

Hybrid approach for execution time prediction
MOX execution time is predicted jointly using an analytical model and an empirical MLP regression scheme. The rationales of this hybrid approach are the following. First, the MOX execution time is proportional to the number of photons per CPU core and approximately inversely proportional to the number of CPU cores (Section 3.3). These properties are easy to capture by a linear expression. Second, the photon tracing time shows a nonlinear dependence on input simulation parameters, which is analytically intractable. Hence this component is modeled using MLP neural networks (see [19] for more detail). The expected execution time T of a production MOX run in a particular parallel computing environment is modeled as a function of a set of input simulation parameters, the number of CPU cores N c , and the number of photons N p as follows: Note that a small prediction error is expected because of increased time variability with a large N c (Section 3.3). This error is, however, much smaller than the time variability due to multivariate simulation settings, and thus neglected in Eq. (1).

Job-environment mapping algorithm
This section presents the JEM algorithm to minimize the total time for running MOX simulation jobs distributed to multiple execution environments [19]. Let S D ¹s 1 ; : : : ; s N s º be a set of N s simulation jobs, E D ¹e 1 ; : : : ; e N e º be a set of N e execution environments, and f ij D f .s i ; e j / represent the cost of running s i on e j . The cost function f is defined by means of MLP algorithms for execution time prediction. Figure 8 shows the JEM scheme. It employs a greedy strategy to assign jobs to environments in descending order of time costs. This mapping algorithm complements the MLP performance prediction, and they provide a complete tool set of HPC job scheduling foreseen in Figure 7.

Adaptive termination of photon trajectory tracking
The execution time of a MOX simulation job is decoupled into two parts. One is the time employed for computing photon trajectories. The other is the time for tracking the photon trajectories by adding the photon weight into the entries of radiometric field matrices along the photon trajectories. Some matrices are updated, while the others are left unchanged depending on the photon traveling direction and field-of-view angles of radiometers being simulated [18]. For example, the radiometric field shown in Figure 4(a) is computed by accumulating only those photons traveling downwards.
Light in a homogeneous seawater column decreases approximately exponentially with depth ı´ [79]. Indicating with < k radiometric quantities (k D 1; 2; 3; : : :) and omitting the spectral dependence for brevity, the decrease of < k with ı´is expressed as follows: where < k .0 / are subsurface radiometric values just below the sea surface, K < k are diffuse attenuation coefficients, and ı´is positive downward. In field marine radiometry, < k .0 / and K < k are the OC quantities of interest derived applying a regression scheme to a set of < k .ı´/ values measured at different depths ı´in the water column. In MOX application scenarios, simulated radiometric fields are used for virtual optical profiling to evaluate the precision of OC data products under controlled simulation conditions that could not be attained in the natural environment. Data reduction results of virtual optical profiling are affected by two sources of variability. One is due to statistical noise intrinsic to the MC approach. This uncertainty must be decreased to a negligible level by tracing a sufficiently large number of photons. The other depends on environmental effects such as light focusing and defocusing induced by sea-surface waves seen in Figure 4(b) as periodical large values (spikes, also known as caustics in computer graphics). This variability can be assessed only after the MC noise has been reduced to a negligible level. To evaluate these types of uncertainty budgets, virtual optical profiling is repeated multiple times.
where N is the number of = j values and MOX simulations concern two general types of environmental conditions hereafter referred to as Types I and II, respectively. Type I simulations are those with an ideal flat sea surface, whereas Type II simulations acknowledge the presence of sea-surface waves. Experimental results with MOX simulations have shown that (1) CV = decreases infinitely with an increasing number of photons in Type I simulations, and (2) CV = reaches a plateau at a certain number of photons in Type II simulations [18]. In the Type I simulations, tracing more photons does not significantly improve the reproducibility of OC data products after the MC noise is reduced to an acceptable level. In the Type II simulations, the variability cannot be reduced below a certain level no matter how many extra photons are traced, due to genuine light variability caused by the focusing effects of sea-surface waves. Note that the number of photons necessary to neglect the MC noise is unknown in advance in both cases.
The MOX performance is optimized based on online quality evaluation of radiometric field matrices to stop tracking photon trajectories when predefined precision criteria are met [20]. The rationale behind this early termination of photon trajectory tracking is that it is a waste of time and computing resources to keep updating the matrices that are already free of MC-intrinsic statistical noise.
The adaptive termination of photon trajectory tracking is performed as follows. Every time a certain number of photons on different orders of magnitude (e.g., 10 3 , 10 4 , and 10 5 ) have been traced, the statistical quality of radiometric field matrices is quantified by the CV = values of Eq. (3). Specifically, for each radiometric quantity < k and data product =, the following stopping criteria are tested: where CV 0 = refers to the CV = value in the previous evaluation. The first condition imposes a target precision of 0.1% to take into account that the CV of data reduction results keeps decreasing as photon tracing continues in Type I simulations. The second condition relies on the unbiased percent difference between the present and previous CV values to detect a saturation pattern of CV reduction rate in Type II simulations. The unbiased percent difference threshold is here set to 5%. If one of the stopping criteria is met, then the updates of the < k data matrix are terminated. Photon tracing is, however, continued until photon trajectory tracking has been stopped for all radiometric quantities of interest.

Offline optimization of photon weight threshold
Tracing of a photon continues until either it escapes into the sky from the top boundary of the simulation domain (Figure 3) or its weight (starting from one) becomes lower than threshold w (e.g., set to 10 6 ) after a series of scattering and absorption events. How many photons exit into the sky mostly depends on seawater IOPs (including attenuation and absorption coefficients) and boundary conditions (e.g., sea-surface geometry and sea-bottom reflectance). These environmental parameters, defined as part of case study specifications by OC experts, are not tunable from the computational of 26 perspective. In contrast, thresholding the photon weight is suggested by a cost-effectiveness consideration, because a photon weight close to zero does not significantly contribute to the statistical quality of simulated light fields.
A thorough examination of OC and MC application requirements thus permits performance optimization based on quality-speed trade-offs. The case considered here is that a smaller w improves the precision of simulated radiometric fields, whereas a larger w leads to a shorter photon lifetime, which translates into a shorter execution time.
Seawater IOPs and photon weight threshold jointly determine the average depth that photons can reach within the water column. Simulated light fields below this average depth are not be suitable for virtual optical profiling due to increased MC statistical noise. Note also that virtual optical profiling is conducted by sampling radiometric values within an upper layer of the water column, namely, between just below the surface and a predefined depth referred to as the regression layer depth [31]. Hence the average photon depth must be deeper than the regression layer depth.
The objective is then to determine the largest value of w so that data reduction results are not affected by MC statistical noise. To this end, the photon weight w is parameterized as a function of the depth´[m] (positive downward) in the seawater medium having attenuation c and absorption a coefficients [m 1 ]. To simplify this parameterization, an ideal flat surface illuminated only by the sun at the zenith is considered. By the same token, forward scattering is adopted for an analytical approximation of the problem.
The initial photon weight is set to one. Every time the photon is scattered by the seawater, the weight is scaled by single scattering albedo ! D .c a/=c to account for absorption by the seawater. Therefore, the number of scattering events m after which the weight becomes smaller than w is implicitly expressed as ! m D w : By solving for m, m D log w = log !: This gives the maximum number of scattering events that a photon may undergo before tracing of the photon is terminated. Equation (8) indicates that for a fixed ! lower than 1 by definition, an exponential increase of w (e.g., from 10 6 to 10 3 ) leads to a linear decrease of m and thus of the per-photon and total execution time.
The optical distance that a photon travels before a scattering event is given by an exponential probability density function p. / D exp. /; where > 0. The expectation value of is E. / D 1. The geometrical distance r [m] is defined as r D =c, so that the photon travels E.r/ D 1=c on average between scattering events. The average depth´that the photon can reach after n scattering events is at most and hence n D c´: The maximum photon weight after n scattering events is finally expressed as a function of´as w D ! c´: This analysis relates the photon weight w to the depth parameter´and a given set of IOP values, guiding a proper definition of the weight threshold w that satisfies the quality requirements for virtual optical profiling. An estimate of w that will guarantee a sufficiently low MC uncertainty in the upper layer of the seawater column is given by letting m D n and w D ! c´, and setting´to a value greater than the regression layer depth chosen by OC experts.  . Four sets of seawater optical property values, two sun elevations, and three sea-surface states are considered. Sea-surface geometry is modeled as the sum of harmonic waves. Attenuation and single scattering albedo indicate that the considered simulations concern various seawater types ranging from clear to turbid waters (c 2 OE0:2; 1), as well as from less to more scattering waters (! 2 OE0:17; 0:9).

MLP algorithms for execution time prediction
This section concerns the validation of the MLP performance modeling method (Section 4.1). MLPs for predicting the execution time of the 20 simulation configurations in Table III are developed on the basis of execution time measurements collected by running training simulations. Selected MLP input parameters are attenuation c and absorption a coefficients and sun zenith angle Â s . To account for different sea-surface states, the 20 test simulation cases were split into three groups W1, W2, and F as shown below, and a separate MLP was trained for each group: combines each of five a values at log-scale regular intervals with three c values between a and a max at linear uniform intervals, where a max was set equal to 1.1. In all cases, a total of 15 pairs of c and a values as well as the same set of three Â s values (i.e., 20, 45, and 70 degrees) were used, so that each training dataset includes 45 samples.
MLP training and validation were repeated M D 10 times to assess the average MLP performance taking account of variations in prediction results due to the nonlinear optimization performed to train the MLP. The relative prediction error in percent is defined as where t i denotes the observed execution time of the i-th simulation configuration and y k i is the corresponding execution time prediction from the k-th MLP. The average MLP performance is assessed by the mean and standard deviation of k i values computed over N simulation configurations and M predictions for each configuration: : (15) Figure 9 shows MLP prediction results in the right column panels. Circles (blue), triangles (green), and squares (red) indicate results for Groups W1, W2, and F, respectively. The markers show the mean of M execution time predictions, with error bars indicating˙1 standard deviation. The MLPs trained with the dataset in Panel 9(a) gave smaller values but larger prediction uncertainties, especially at longer execution time. This is likely explained as an extrapolation problem due to the training data that do not completely cover the parameter space including the validation cases. The training data in Panel 9(c) represent a more regular sampling but also an uneven distribution with respect to the validation samples. This may explain the large prediction errors due to MLP learning with sparse training data. The most performing MLPs (i.e., with a moderate and the smallest on average) were those trained with the dataset in Panel 9(e). This case study highlights the importance of a careful selection of training samples to avoid MLP extrapolation and learning from sparse data.

Job-environment mapping algorithm
The relevance of the proposed JEM algorithm is evaluated by considering the scheduling performance of random job allocation (RND) as benchmark. When no prior information is available about the execution time of individual MOX simulations, the most general scheduling approach is random allocation assuming the same running cost for all jobs (homogeneous cases). If relative running costs of different computing environments are known, then more jobs can be assigned to less costly computing environments (heterogeneous cases). These two scheduling scenarios are used for evaluation, considering as a case study the problem of allocating the 20 simulation jobs of Table III to a different number of computing environments ranging from 2 to 5. Results are presented in Table IV through the quantities defined hereafter.
The columns labeled e i (i D 1; : : : ; 5) indicate relative costs for running jobs in the five computing environments (blank entries mean unused). Concerning the JEM approach matter of this study, MLP performance models are trained adopting the dataset of Figure 9   time observed upon 10000 random job allocations. By the same token, RND and CV RND are the standard deviation and the CV, respectively. Finally, D 100 .T obs JEM T obs RND /=T obs RND reports relative performance gains by the JEM method with respect to the benchmark results.
In homogeneous cases, relative costs e j were set to 1 for all environments, and only the number of available computing systems was changed. In the case of two environments e 1 and e 2 , for example, the JEM algorithm allocated 9 and 11 jobs to e 1 and e 2 , respectively. The observed (modeled) total execution time in e 1 and e 2 was 2178.6 (2082.1) and 2151.9 (2080.0) seconds, respectively. With the RND scheme, the 20 jobs were evenly assigned to the two environments at random. The result is T obs RND D 2495:1, which corresponds to a gain of D 12:6%. In heterogeneous cases, relative differences in job execution costs were modeled by a linear scaling factor indicated as e i in the table. Both the observed and modeled execution times of the 20 MOX simulation jobs were scaled by e i to account for heterogeneous performance of the execution platforms. In the case of using two environments e 1 D 1 and e 2 D 1:5, the JEM method allocated 12 and 8 jobs to e 1 and e 2 , respectively. The observed (modeled) total execution time in e 1 and e 2 was 2577.7 (2488.0) and 2629.2 (2511.3) seconds, respectively. With the RND approach, 12 and 8 jobs were also allocated to e 1 and e 2 , respectively, due to the different relative running costs. The benchmark result is T obs RND D 2996:7, corresponding to D 12:2%. The experimental results of Table IV show an increase of CV RND with the number of available environments in both the homogeneous and heterogeneous cases. In contrast, ı JEM values document an accurate estimate of the total execution time within the range of prediction errors reported in Figure 9(f). Note that this is the case even when the T mod JEM values are computed by summing up execution time estimates of individual jobs. Consequently, the performance gain by JEM increases with the number of computing environments in both the homogeneous and heterogeneous cases. Figure 10 shows CV trends of derived data products as a function of the number of traced photons, by considering selected Type I (Conf. 5a-5d) and Type II (Conf. 1a-1d) simulations of Table III and computing simultaneously three radiometric quantities (referred to as downward E d and upward irradiance E u and up-welling radiance L u [18]). CV values in the left column panels show a steady decrease with an increasing number of photons, as expected in Type I simulations with the ideal flat surface. In contrast, all CV plots in the right column panels exhibit saturation patterns foreseen in Type II simulations performed in the presence of sea-surface waves. Thick markers indicate the number of traced photons at which an adaptive termination condition [Eq. (5) or (6)] was satisfied and photon trajectory tracking was discontinued. The dashed horizontal line shows the 0.1% target precision in Eq. (5). These results confirm that the termination conditions for Type I and II simulations are duly met in the flat and rough sea-surface cases, respectively.

Adaptive termination of photon trajectory tracking
Performance improvements by means of the adaptive termination of photon trajectory tracking are assessed as follows. First, the 20 simulation cases of Table III were Table III, where < k generically denotes radiometric quantities of interest (i.e., downward E d and upward irradiance E u and up-welling radiance L u [18]).   Table V reports performance gains based on the adaptive termination of photon trajectory tracking. Varying processor counts are required to address specific constraints of the job execution platform Milipeia employed for the numerical experiments. Speed-ups ranged from 2.16 to 11.5, demonstrating significant run-time performance improvements. These results have been obtained using extrapolated references (due to time resource constraints in the computing system). Note that the extrapolated execution time T and the performance gain S are likely an underestimate since the MOX execution time might slightly increase when the processor count is larger (Section 3.3). Figure 11 shows the performance of selected simulation configurations as a function of photon weight threshold w . Time measurements were performed on Milipeia using 16 CPU cores. As anticipated by Eq. (8), an exponential increase of the weight threshold leads to a linear reduction of the execution time. This result clearly endorses the photon weight optimization based on qualityspeed trade-offs (Section 4.4).

Optimization of photon weight threshold
A case study is henceforth addressed to the assessment of the photon weight model [Eq.  weight set to one. Photon tracing was continued until the photon went below the depth´. When the photon tracing was stopped, the following values were recorded: (1) the number of scattering events n k that the photon underwent; (2) the depth´k that the photon reached; and (3) the weight value w k at that time, where subscription k is the photon index. Figure 12 shows the frequency distributions of n k ,´k, and log 10 .w k / values. The distribution of´k values is in agreement with the exponential probability density function of optical distance [Eq. (9)]. Instead, both n k and log 10 .w k / approximately follow a Gaussian distribution. The mean values of n k and log 10 .w k / are also close to those predicted by Eqs. (11) and (12), respectively. Figure 13 highlights the effects of photon weight threshold w on the variability of derived data products = [i.e., subsurface values < k .0 / and diffuse attenuation coefficients K < k ] considering three radiometric quantities of interest. The left and right column panels show CV = values [Eq. (3)] with the regression layer depth set to 5 and 15 m, respectively. The CV values were computed based on simulated light fields accounting for different w values and photon counts. The result is that when the regression depth interval is fixed, the overall CV trends remain the same (except for missing CV values when considering a larger photon weight threshold). For instance, the CV trends in Panels (a) and (c) display a significant equivalence. The only differences in Panel (c) are missing CV values of L u .0 / and K L u at the number of traced photons equal to 10 7 . Analogous differences can also be observed between Panels (b) and (d). Missing data points depend on MC-intrinsic statistical noise in the simulated radiometric fields because a larger photon weight threshold leads to a shorter photon lifetime and a less populated regression layer. of 26 Figure 13. Effects of photon weight threshold w on the coefficient of variation (CV) trends of derived subsurface values < k .0 / and diffuse attenuation coefficients K < k , considering the same radiometric quantities as in Figure 10. The selection of MLP training data is largely application-dependent. Hence the documented MLP results report a specific case study of MLP algorithm development considering the problem of performance prediction for MOX simulations. The set of MOX input parameters for constructing training datasets was chosen based on the knowledge of application domain researchers on the radiative transfer processes in natural seawater. Given a set of MLP input parameters, the prediction capability of trained MLPs further depends on the following: (1) the ranges of parameter values; (2) the intervals between values of each parameter; and (3) the total number of training samples. A previous publication [19] has examined the effect of the size of training data, which is confirmed as a major MLP performance factor. The case study reported in the present work follows this research venue by taking the ranges and intervals of parameter values into consideration.
MLP performance models are built upon execution time samples. Hence the presented method is suitable when (1) the number of simulation cases at hand is larger than the number of training samples, and/or (2) historical performance records from previously executed simulations are available. If the number of simulation cases is small (e.g., up to a few tens), the total execution time can be reliably estimated by running all of them with a reduced photon count O N p using a small number of CPU cores O N c , and extrapolating the measured execution time O T by Eq. (1). If the number of simulation cases is large, then the MLP method can reduce the total cost of execution time prediction. In addition, application studies may accumulate a number of historical performance records by addressing many different simulation cases over time. Such performance records effectively provide a dataset for training an MLP execution time model, which can then be applied to future simulations. Hence the MLP prediction method and the 'run all' approach are complementary. Speed-ups by means of the adaptive termination of photon trajectory tracking may vary significantly depending on the convergence property of individual simulation cases and modeled radiometric quantities. Seawater optical properties, sea-surface geometry, and illumination conditions jointly determine how quickly the variability due to MC statistical noise is reduced by tracing an increasing number of photons. In addition, multiple radiometric quantities are simultaneously computed in a single simulation case, and they display distinct variability convergence trends because of the anisotropic light distribution in the water column. Hence the adaptive stopping criteria [Eqs. (5) and (6)] are satisfied at a different number of photons for each simulation case and radiometric quantity, as documented by numerical results in Figure 10. Smaller photon counts at which photon trajectory tracking is terminated lead to larger speed-ups.

CONCLUSION
This paper presented an HPC framework for MC radiative transfer simulations in the OC application domain. Specific solutions for performance modeling, prediction, and optimization were developed to support efficient and flexible execution of computer experiments for OC studies on in situ marine radiometry by means of the MOX simulation code.
The present study contributes to a clear separation and systematic treatment of HPC, MC, and OC application requirements through a combination of HPC, machine learning, and adaptive computing techniques. The added value of the work is in fact the improvement and assessment of computational of 26 methods and tools for large-scale MC simulations integrated in the form of an HPC-oriented problem-solving environment specifically tailored to OC applications.
The study underlines the importance of close collaboration between HPC researchers and application domain scientists to identify domain-specific knowledge of relevance to the performance characterization of target applications and exploit it for the development of supporting HPC tools and techniques. In the case of MLP performance modeling and job scheduling, the use of domain knowledge is essential for feature selection during MLP algorithm development. By the same token, both the adaptive termination of photon trajectory tracking and the offline tuning of photon weight threshold are realized based on in-depth understanding of marine radiometry requirements. Note that some of the results documented in Section 5 refer to real application studies [31][32][33].
The novelty of the predictive performance modeling and scheduling is the decoupling and separate handling of linear and nonlinear execution time components. Modeling the former component depends on the a priori characterization of MOX parallel performance, whereas predicting the latter relies on a learning-from-data approach. The original contribution of the online performance tuning is the combined use of the general convergence property of MC precision (i.e., at the rate of 2 = p n, where 2 is the variance and n is the sample count) and an application-specific property (i.e., a saturation of noise reduction due to rough sea-surface geometry). The general MC noise convergence usually applies only to the ideal flat sea surface. In contrast, the proposed online tuning method exploits the new finding that the variance of radiometric data reduction results is constrained by the presence of sea-surface waves. The OC application relevance is to quantify the uncertainty due to rough sea-surface geometry over data reduction methods for field radiometric measurements, with the flat sea surface as a benchmark case. MOX simulations have then shown that the latter convergence property can offer major performance gains. The novel aspect of the offline threshold parameter tuning is instead the identification of specific MC application requirements to set up simulation configurations accounting for quality-speed trade-offs. Remarkably, both the online and offline tuning methods are orthogonal to parallelization, so the reported performance gains can be further amplified by parallel speed-ups.
The validity of the developed methods and the relevance of documented insights are general because (1) MLP performance modeling and job scheduling solely rely on data samples (i.e., input simulation parameters and corresponding execution time observations) and (2) the online and offline tuning methods only deal with quality indices, no matter what application problem and computing environment are considered.
Future work includes the application of the presented HPC framework to three-dimensional extensions of the MOX code to investigate OC above-water radiometry [32,33,80]. Dynamic changes in the execution time of simulation jobs (due to the online performance tuning) will challenge efficient resource utilization based on job scheduling. Hence a future study is planned to address run-time monitoring of simulation jobs and dynamic rescheduling of waiting jobs to improve the utilization of available HPC systems. Extending the target execution platforms to cloud computing environments is also foreseen on the basis of scientific workflow systems for describing application scenarios and controlling the distributed execution of MC OC simulations.