Statistical analysis of multi-day solar irradiance using a threshold time series model

The analysis of solar irradiance has important applications in predicting solar energy production from solar power plants. Although the sun provides every day more energy than we need, the variability caused by environmental conditions affects electricity production. Recently, new statistical models have been proposed to provide stochastic simulations of high-resolution data to downscale and forecast solar irradiance measurements. Most of the existing models are linear and highly depend on normality assumptions. However, solar irradiance shows strong nonlinearity and is only measured during the day time. Thus, we propose a new multi-day threshold autoregressive model to quantify the variability of the daily irradiance time series. We establish the sufficient conditions for our model to be stationary, and we develop an inferential procedure to estimate the model parameters. When we apply our model to study the statistical properties of observed irradiance data in Guadeloupe island group, a French overseas region located in the Southern Caribbean Sea, we are able to characterize two states of the irradiance series. These states represent the clear-sky and non-clear sky regimes. Using our model, we are able to simulate irradiance series that behave similarly to the real data in mean and variability, and more accurate forecasts compared to linear models. days is from 7:00 h to 17:00 h, and data are collected every second. We have then a total of 345 daily time series, each consisting of 36,000 measurements. We divide the complete data set into a training set and a testing set. We fit our multi-day model (3) to the first 300 days to describe the stochastic variability of the irradiance time series. Then, we compare the prediction using our fitted model for the F I G U R E 14 Comparison of the clear sky and non-clear sky regimes. The autocorrelation function suggests a stronger time dependence for the non-clear sky regime. The log-spectrum shows that the non-clear regime has a higher amplitude than the clear regime

F I G U R E 1 (a) Solar radiation measurements (figure from slidesshare.net). (b) GHI measured by a pyranometer every 30 s. Daily irradiance variability can be very different from day to day and the drops of the irradiance values can be observed in any time of the day and a more extended time scale change can affect the storage system (Kleissl, 2013). Thus, it is essential to understand such variability for the management of solar power production.
The amount of electricity generated by PV systems depends on the intensity and wavelength of solar radiation available to the PV device. In this article, we focus on the analysis of the global horizontal irradiance (GHI) that measures the total hemispheric down-welling solar radiation on a horizontal surface ( Figure 1a). There is a variability (deterministic) in the GHI daily time series that corresponds to the Sun's movement during the day, and that can be precisely predicted. For example, Figure 1b shows two examples of daily GHI time series that raise to a positive value around 6:30 a.m. and drop after 6:00 p.m. (period depending on location and season of the year). The main interest in GHI data is to characterize the variability that could be caused by clouds, temperature, or any other atmospheric conditions. Figure 1b (left) shows high variability between 11:00 a.m. and 1:00 p.m. followed by a long period of low irradiance and Figure 1b (right) shows a high variability in the morning but a very clear sky afternoon with low variability.
One of the main goals for modeling solar irradiance is to quantify the volatility/variability at a specific location. PV integrated systems rely on this accurate variability estimation, which is relevant to the grid manager to efficiently distribute the power or select a new site for new system installation. In practice, estimating the variability means a precise estimation of the probability distribution that generates the data. The most common strategy is to propose a model that generates synthetic irradiance time series which reproduce similar expected values and variability. For example, Zhang et al. (2019) proposed a model based on a non-Gaussian distribution to generate high-frequency solar irradiance scenarios. Recent developments for analyzing GHI data focus on stochastic simulation of high-resolution data and forecasting of spatiotemporal GHI (Zhang et al., 2021).
Although this method performs extremely well, they need to first classify cloudy and clear days, based on a fixed threshold rule of low and high variability observed in the increments of GHI. A more flexible approach where the days are classified by clustering algorithms was recently introduced by Frimane et al. (2019). Under this model, a posterior inference using Markov chain Monte Carlo algorithm is applied to estimate daily distributions of the solar irradiance.  proposed an alternative model with a more robust classification procedure between clear and non-clear days. Previous research efforts in this area also include models with covariates or cloud motion models (Boland et al., 2001;Bright et al., 2015Bright et al., , 2017Fernández-Peruchena et al., 2015). However, the classes vary among these models and the number of classes may be subjectively selected.
Popular statistical models that allowed regime switching between time-dependent data are the hidden Markov model (HMM) and the mixture chain model. The HMM assumes the presence of a hidden random process that determines the changes in the dynamics of the observed data. An HMM with three hidden states can have an interpretation linked to three categories of cloud cover (Shepero et al., 2019). However, the goodness of fit for this model to generate GHI data can be very poor in some cases. More hidden states need to be considered to improve the fitting, but this increases the computational cost. The use of multiple time series (Fox et al., 2014) could improve these models, but this option has not been explored for irradiance time series. Barbič et al. (2004) propose a mixture model to detect the distinct behavior of motion data. For irradiance data,  investigated these models in the case where observations are from different spatial locations. In this article, we model irradiance data without an a priori classification. We allow our model to change the state (clear or non-clear sky) based on its immediate past values. By considering a two-state model, our proposal is interpretable in terms of clear and non-clear sky regimes, which can be observed partially across days. Although a model with more than two regimes could result in a better fitting of the data, the resulting model might lack interpretability.
A feasible approach for modeling GHI is to consider a time series model. Time series models are efficient in forecasting. Accurate predictions are relevant for efficient energy management, where efficiency means a precise balance between electricity production and consumption at any moment. The grid management requires short, medium, and long-term predictions to feed the electricity networks. Some of the desirable characteristics of such time series model are seasonality and nonlinearity. Grantham et al. (2018) proposed a time series model that represents the seasonality with as a cosine and sine functions of the main Fourier frequencies observed on the data and an autoregressive component. To increase the flexibility of their model, a nonstationary white noise component is considered. However, the nonstationary assumption may cause inconsistent estimators. Similarly, Al-Awadhi and El-Nashar (2002) also consider a Fourier expansion to model the seasonality of the irradiance data, and a bilinear time series component to model the stochastic component of daily global radiation. However, a priori division of the data set in clear and non-clear days is also required. Voyant et al. (2020) proposed a prediction model based on the periodic autoregressive (PAR) model coupled with a Box-Cox transform. This model considers a set of coefficients per hour that uses all-day information for estimating, which gives flexibility to the model. Das et al. (2021) proposed a cyclostationary model for short term prediction of hourly solar irradiance. Although, for high temporal resolution data, this might result in a large number of coefficients for PAR models or non-identifiable cycles due to the high level of noise variation. To avoid these issues, we propose a nonlinear time series approach to model solar irradiance data. Within the existing nonlinear time series, threshold models have shown to be generally sufficient enough to model cyclical econometric data (Tong & Lim, 1980). In the literature, we can find some examples of econometric models applied to environmental data (Grillenzoni & Carraro, 2021;Magnus et al., 2011;Samanta & Ghosh, 2011).
The GHI series can be transformed to clear-sky index, K(t), as where the time unit t are seconds, I(t) is the GHI observed at time t, and I CS is the horizontal extraterrestrial irradiance that can be computed using an atmospheric model (see Kleissl, 2013). Then, K(t) isolates the variability caused by random environmental conditions. Generally, researchers model the logarithm of clear-sky index to avoid the restriction of being strictly positive. In this article, we propose the use of threshold autoregressive (TAR) model on the logarithm of clear-sky index series to forecast and describe solar irradiance data. Unlike the typical models developed for econometric time series, we treat daily irradiance time series as independent replicates and propose a new multi-day model, where daily irradiance time series share a common cyclical component but with different intra-day variability. Our model identifies clear and non-clear day periods without any prior classification of the data by introducing a threshold effect. We propose an estimation procedure that takes into account the single day variability and all days common information. The estimated variability per day also serves as a measure of weather variability across days. Additionally, we show that our model can forecast the irradiance time series more accurately than the persistence ensemble model (commonly used on energy forecasting), some machine learning algorithms, PAR models, and ARIMA models.
This article remaining is organized as follows: Section 2 introduces the real irradiance data set from the Guadeloupe island group and presents the preliminaries about the existing nonlinear time series model and its theoretical properties. Section 3 presents our proposed model and the statistical inference procedure for the model parameters. Lastly, Section 4 presents a detailed data analysis of the Guadeloupe island group irradiance series by applying our multi-day model. This section also compares our model with different competitors and proposes an algorithm for a probabilistic forecast.

Data description
Here, we study an irradiance time series, {I(t)}, from January 21, 2011, to December 31, 2011. Data are collected in Guadeloupe island group, a French overseas region located in the Southern Caribbean Sea (Lat 16.22 to Long −61.53 approximately, see Figure 2). Data were collected every second, and the global horizontal irradiance is reported. Then, we use the horizontal extraterrestrial irradiance reported at this location by the National Solar Radiation Database (NSRDB) (Sengupta et al., 2018) to compute the clear-sky index, K(t).  Figure 3 shows a subset of the I(t) series that corresponds to March 2011 from 7:00 a.m. to 5:00 p.m. On the top of the GHI series, we plot a dashed line that represents the extraterrestrial irradiance, I CS . On 1-s data, a clustering of the I(t) series will result either in one big cluster (non-clear days) or several small clusters (different degrees of non-clearness). Therefore, we do not consider a prior classification of the data. We now compute the logarithm of the clear-sky index defined in (1). Figure 4 shows the log K(t) series corresponding to March 2011. We observe that March 9, March 18, and March 31 are days on which the irradiance is close to the horizontal extraterrestrial irradiance, whereas March 5, March 14, and March 25 are days on which we observe a high variability on the horizontal irradiance. In general, irradiance profiles per day are different across the year.

Nonlinearity and non-Gaussianity
Linear models are the most commonly used statistical models. However, for complex real data applications, they suffer from certain limitations. For example, under a linear representation of a time series, if a Gaussian white noise is assumed, all the finite-dimensional distributions of the process are expected to be Gaussian. To visualize if the distribution of the F I G U R E 4 Log K(t) time series for 1-month data; the dashed line at 0 represents the irradiance equal to the horizontal extraterrestrial irradiance. Drops on the irradiance value can be observed at any time of the day log K(t) series may be Gaussian, we plot the histogram of the series per day in March (see Figure 5). Some days, such as March 14 and March 23, show strong evidence of a bimodal density, which is visually against Gaussianity. The bimodal density strongly suggests a non-Gaussian distribution of noise, even more, the feasible presence of regimes. Additionally, we plot the bivariate density contour plots of (log K(t), log K(t + 1)), where we observe empirical evidence of a two-state correlation structure on days like March 13, 14, 21, and 23. These visual tools together are evidence against a linear model. In the literature, there are several tests to revise whether or not a linear time series model is reasonable. Two of the most commonly used are the nonlinearity test proposed by Keenan (1985) and Tsay (1986). Keenan's test is motivated by the approximation of a nonlinear stationary time series using a Volterra expansion. Assume the following model, where t are iid (0,1). Keenan's test is equivalent to the F-test for = 0. It is computationally simple and handy on small sample sizes. Tsay's test extends Keenan's test with a more general nonlinear alternative by replacing the term Both tests are implemented in the TSA R package (Chan & Ripley, 2018). We apply both tests to each of the daily log clear-sky index time series from Jan 21, 2011 to Dec 31, 2011, and find strong evidence against linearity (p-value <0.01).

TAR model
In general, linear models have been shown to be very useful to predict weakly stationary processes. However, under the evidence of non-normality or nonlinearity, they can perform poorly. In nonlinear time series models, one possibility F I G U R E 5 Empirical evidence against linearity. Top: Daily histogram of logarithm of clear-sky index time series. A bimodal histogram is a strong evidence against Gaussianity. Bottom: Bivariate density contour plots of (log K(t), log K(t + 1)). The presence of two modes with different correlation structure supports the motivation on a two-state time series model is an autoregressive polynomial model of higher order degrees. However, such models are not useful for extrapolating, since they quickly blowup to infinity. In this article, we propose to use the TAR model to describe the dynamics of irradiance time series. We briefly introduce here the TAR model; more details can be found in Cryer and Chan (2008) and Tong (2010). Tong and Lim (1980) proposed the TAR model as a piecewise linearization through the introduction of a switching mechanism. A general form of the TAR model assumes that where t are iid (0,1) and {J t } is an indicator time series that takes values between {1, 2, … , J}. To simplify this form, the most commonly used model is the two regime self-exciting threshold autoregressive (SETAR) model With this model, the parameter is the threshold parameter, and d is the delay parameter. Then, we refer to the SETAR model of order p to be where { i,j , i = 1, 2, j = 0, 1, … , p} are real constants and t are iid (0,1). When dealing with time series models, we need to question the existence of the stationary distributions for the process. Chan and Tong (1985) showed the sufficient conditions for a SETAR model to be ergodic and therefore the existence of a stationary distribution. However, these Lyapunov conditions are difficult to verify. In practice, a handy tool is to run the skeleton model (noise-free model) with different initial conditions (Tong, 2010). Amendola et al. (2009) showed that the SETAR model with two regimes (2) is weakly stationary if the matrices Φ (1) and Φ (2) have both dominant eigenvalues less than one, where ) .
denote the identity matrix of dimension p − 1 and a matrix of zeros of dimension p − 1 × 1, respectively. SETAR models have been widely used in financial time series analysis (Tong, 2010). The threshold model has also been applied to study epidemiological time series (Watier & Richardson, 1995) and cyclical fish landings (Samanta et al., 2011). These studies showed the versatility of SETAR nonlinear time-series models, which is capable of better describing cyclical fluctuations when there is evidence of nonlinearity.

PROBABILISTIC MODELING OF THE CLEAR-SKY INDEX
We propose a new approach to model the irradiance time series. Our model does not require a priori classification of days among cloudy or bright days, which makes the statistical analysis more robust.

Model
Let log K r (t) be the logarithm of clear-sky time series defined in (1) for day r at time t, where r = 1, … , n and t = 1, … , T.
Remark 1. Usually, time series analysis is performed with one single long (preferable) time series. However, in some applications, the available data consist of replicated series (Silva et al., 2005). Our model assumes that each day is a replicate of a similar SETAR model. The model has the same autoregressive coefficients each day, although they may have different noise variations, { i,r , i = 1, 2}.
Since irradiance data can only be recorded during the day time with sunlight (e.g., from 7 a.m. to 5 p.m.), it is then not reasonable to join the daily time series as a uniquely long time series. Instead, our model assumes that, for each day, the observed time series is a replicate of a similar nonlinear cyclical time series, which we model as a SETAR model. To allow for different stochastic variations on irradiance data across days, which occurs due to different environmental conditions, we allow ( 1,r , 2,r ) to be day-specific.
Remark 2. Note that the multi-day SETAR model has the same skeleton model (noise free) for all days. Then, the conditions shown by Amendola et al. (2009) for the SETAR model to be stationary are also sufficient conditions for the multi-day SETAR model to be stationary per day.

Estimation procedure
Although we know the conditions for the stationarity of the multi-day SETAR model that can be used to compute the full likelihood, the stationary distribution does not have a closed form as in the SETAR model and it involves computing multiple integrals. Similar to the estimation of a SETAR model, we propose a two-step inference procedure. First, we estimate the threshold and delay parameters, and d, by using log-likelihood criteria. We then consider a weighted least squares method to estimate the autoregressive parameters,

Estimation of d, , and p
We temporarily assume that p is known and assume normality on all r (t). Then, for each day r and value (d, ), let Denote by m i,r the cardinality of M i,r , and m 1,r + m 2,r = T − p for all r. For day r, we regress the series of log K 1,r (t) on its lags 1 to p to find estimates of̂r 1,0 , … ,̂r 1,p and compute the maximum likelihood noise variance estimate,̂2 1,r , by the sum of squared residuals divided by m 1,r . Similarly, we estimatê2 2,r using the series of log K 2,r (t). Then, we estimate (d, ) by maximizing the profile conditional log-likelihood function In practice, the autoregressive order p can be different for the two regimes. If this is the case, let p 1 and p 2 be the autoregressive orders for the low and high regimes, respectively, and estimate the parameters (d, , p 1 , p 2 ) by minimizing the AIC, where AIC(d, , p 1 , p 2 ) = −2l(d, ) + 2(p 1 + p 2 + 2).

3.2.2
Estimation of { i,j , i = 1, 2, j = 0, 1, … , p} To estimate the autoregressive coefficients, we consider a conditional least squares (CLS) approach using the information across all days. The CLS estimator is obtained by minimizing where denotes the vector or autoregressive coefficients for both regimes. Note that we assume different variances 2 1,r and 2 2,r within each day r. The CLS estimator will then be unbiased but not necessarily the one with the least variance, and therefore not optimal. Our setting is very similar to the analysis of linear models with m different groups where variances are constant within each group. Then, weighted least squares (WLS) is the natural estimation method to apply. Hooper (1993) studied the problem of optimal weighted least squares for the estimation procedure. If 2 1,r and 2 2,r are known for each r, a feasible solution is to apply WLS using weights 1∕ 2 i,r , since these weights correspond to the optimal solution when all noise terms are normally distributed. However, we do not know the true variability within each day. Empirically, the effect of only using the MSE as estimators for the weights produces a poor estimation of the AR coefficients by shrinking too much of the data. This issue was already observed on Hooper (1993) paper and we also observe the poor estimation in our model and propose a similar strategy. Hooper (1993) shows that a WLS estimator with a set of weights that combines a model-free and model-based estimators for the variances is asymptotically optimal (unbiased and lowest variance). We adjust this technique to our model setting as follows.
The idea of weighted CLS (WCLS) estimation is to transform the variable of interest by multiplying each term with a weight w i,r and then apply the CLS to estimate the parameters . Our proposed weights combine two elements: (1) the maximum likelihood noise variance estimate for day r,̂2 1,r and̂2 2,r , and (2) the Bayesian estimate of noise variance using a normal model (O'Hagan & Forster, 2004). Our model-based estimator for the variance is the Bayesian estimator formulated as follows. For each i and r, the prior distribution for 1∕ 2 i,r is Gamma(a, b) and the prior distribution for , where a > 0, b > 0, I p+1 denotes the identity matrix of dimension p + 1 and ′ denotes the transpose. Then, the corresponding posterior distribution is , where X, Y denotes all the data information, past and present (analogous to linear regression models), and IG denotes the inverse-gamma distribution. The posterior parameters are where Λ = (X ′ X + I p+1 ) −1 and = ΛX ′ Y.
By considering the previous model and applying Theorem 3 in Hooper (1993), we obtain the optimal weights as Lastly, the WCLS estimator for is the one that minimizes: where Z r (t) = w 1,r log K r (t) if t ∈ M 1,r and Z r (t) = w 2,r log K r (t) if t ∈ M 2,r . Compared to the one day estimator, the WCLS estimator̂W CLS = argminQ( ) has a reduced variance by a factor 1∕n because of available replicates.

STATISTICAL ANALYSIS OF GUADELOUPE ISLAND'S DATA
As mentioned in Section 2, we analyze the log clear-sky index, log K(t), from January 21, 2011, to December 31, 2011, corresponding to the Guadeloupe island group. The considered period for all days is from 7:00 h to 17:00 h, and data are collected every second. We have then a total of 345 daily time series, each consisting of 36,000 measurements. We divide the complete data set into a training set and a testing set. We fit our multi-day model (3) to the first 300 days to describe the stochastic variability of the irradiance time series. Then, we compare the prediction using our fitted model for the last 45 days.

F I G U R E 6
Estimation of (d, , p 1 , p 2 ) via minimum AIC. To reach the minimum AIC the smaller value of d = 1 is preferable

Estimation of (d, , p 1 , p 2 )
First, we choose the threshold value , delay parameter d, and autoregressive orders p 1 and p 2 as described in the previous section. Figure 6 shows the AIC values for different combinations of the parameter space. We search for the parameter within the 0.15% and 0.85% quantile of the observed values. The smallest value for d gives lower values for the AIC function and a threshold value within (−0.1, 0) for all different combinations of the autoregressive orders. By increasing the autoregressive order of the high regime, p 2 , we also decrease the AIC values. In contrast, there is no significant decrement when increasing the autoregressive order of the low regime, p 1 . Then, we getp 1 = 4,p 2 = 5, and by minimizing the AIC we choosed = 1 and̂= −0.02. If log K(t) = 0, then the observed irradiance is equal to the expectation that corresponds to a clear sky period. Then, the estimated threshold valuê= −0.02 can be interpreted as a split between a clear sky regime (high regime, log K(t) > −0.02) and non-clear sky regime (low regime, log K(t) ≤ −0.02) due to environmental conditions.

Estimation of noise variability and autoregressive parameters
We fit a SETAR model per each day r with (d,̂,p 1 ,p 2 ) and estimate the across days variabilities 2 i,r i = 1, 2 for the non-clear sky regime and clear sky regime, respectively. Now, we use these values together with the model-based estimators to compute the weighted least squares estimator of the autoregressive parameters { j,i , j = 1, 2, 3, 6, i = 1, 2}. For the Bayesian normal-mixture model, we fix the hyper parameters a = 0.01 and b = 0.01. We update the estimated varianceŝ2 i,r values by computing the squared residuals with the WCLS estimators. Figure 7 shows the distribution of the estimated variability across days, 2 1,r and 2 2,r . The noise variability can be higher in the non-clear sky regime than in the clear-sky regime. Lastly, our estimated multi-day model is: where { i,r , i = 1, 2, r = 1, … , n} are positive constants, r (t) are iid (0,1) and 1{A} is equal to 1 if A is true.

Uncertainty of parameter estimates
Based on our 2-step estimation procedure, quantifying the uncertainty propagation is a challenge. We investigate if the small changes in the selection of the threshold/delay parameter could produce significant changes on the autoregressive F I G U R E 7 Estimated variability across days for non-clear sky regime (log K r (t − 1) ≤ −0.02) and clear sky regime parameter estimates. We perform a small numeric experiment, where we select threshold values, , between −0.01 and 0.04, and d = 1, 2, 3. Then, we repeat the estimation of the autoregressive parameters as in Section 4.2. Figure 8 shows that there is not a strong influence. We observe variability on the estimated parameters but the effect is very small in most cases. It might be because we have a large data set, so the autoregressive estimated parameters are consistent. Now, we assume our estimated model is the ground truth and generate bootstrap series as follows.
(1) Sample a paired value of ( 1 , r, 2,r ). (2) Start K(t) = 0 for t = 1, … , 5. (3) Using Gaussian noise, we generate a series of lengths T = 36,000 following model 6. (4) Repeat steps 1-3, 300 times. The results of 1-4 are bootstrapped data series of the same length and days as the original data. (5) Use the bootstrapped data set to estimatêB i,j . We repeat this procedure M = 500 times. Figure 9 shows the boxplots of the estimated parameters. Overall, the uncertainty is low, being the non-clear sky regime of higher estimation uncertainty compare to the clear-sky regime.

Prediction accuracy of the estimated multi-day SETAR model
We use our estimated model (6) to obtain one step forecasted values. Assume for day r that we have information up to time t. Then, the one-step-ahead forecasted value islog K r (t + 1) = 0.00024 + 1.538 log K r (t) − 0.614 log K r (t − 1) + 0.120 log K r (t − 2) − 0.045 log K r (t − 3)if the latest value of log K r (t) belongs to the non-clear sky regime (≤ −0.02)

F I G U R E 9
Bootstrap results for parameters estimates { i,j }. Each point corresponds to an estimated autoregressive parameter using a bootstrapped sample and given = −0.02, d = 1, p 1 = 4, and p 2 = 5 F I G U R E 10 Residuals computed using the one-step ahead forecasted and observed values. Residuals are very small in both the training and testing data sets orlog K r (t + 1) = −0.00051 + 1.601 log K r (t) − 0.790 log K r (t − 1) + 0.240 log K r (t − 2) − 0.092K r (t − 3) + 0.040 log K r (t − 4) in the other case. For h > 1, the h-step-ahead prediction follows the same rule but replaces the log K r (t + h − 1) by its predicted valuelog K r (t + h − 1). We apply this rule to the training and testing data sets. Figure 10 shows the histogram of the residuals computed between the observed and forecasted values, log K r (t + 1) −log K r (t + 1). In both samples, training and testing the residuals are close to zero, suggesting a highly accurate fitting of the model to the real data. Figure 11 shows a 6-day subsample (three on the training and three on the testing sets) of the observed and predicted series. The fitting of our model is highly accurate for both clear sky and non-clear sky periods.
Additionally, we explore the uncertainty in prediction using the bootstrapped samples obtained in Section 4.3. We compute 10-s-ahead predictions on some of the testing days using the bootstrapped valueŝB i,j . Then, we built prediction bands using the 0.025 and 0.0975 empirical quantiles. Figure 12 shows segments of randomly selected irradiance series to illustrate better the uncertainty bands. Observed data are indicated using black lines and dots, blue dots indicate predictions, and the uncertainty prediction bands are red. The forecast is close to the actual values, even though the observed data is sometimes outside the uncertainty bands.
An interesting question is whether our method is more accurate than other prediction focus methods for solar irradiance data. We compare our results with those predictions obtained by the PAR model, ARIMA model, and two machine learning algorithms, the multilayer perceptron (MLP) and the bagged regression trees (RT). Fouilloy et al. (2018) show a set and second row corresponds to dates on the testing set. Observed and forecasted data are shown in black and red, respectively. The prediction performance is very accurate in both cases (for training and testing dates) F I G U R E 12 10-Step-ahead predictions illustration. Blue points are the 10-s-ahead predictions and the black line with points is the observed data. Prediction bands (red) are computed using 500 bootstrap samples comparative study of machine learning forecast methods under different weather variability factors. Among these methods, the MLP and bagged RT were the most accurate methods. Below, we describe briefly the competitors methods applied to log K r (t).

PAR model assumes that log
where r (t) is a periodic white noise with variance 2 (t). In other words, the PAR model assumes an autoregressive model where the coefficients depend on the day-hour, t. To fit the PAR model, we use least squares for each t using different days' information. Finally, the one-step-ahead forecast isl 2. ARIMA (p, d, q) model is the most general time series model which assumes that where B is the backshift operator and (z) and (z) are polynomials of degrees p and q respectively. Using AIC criteria we fit to the data an ARIMA(1, 1, 2) model to the irradiance series. More details and the h-step-ahead predictor see Brockwell and Davis (2006). 3. Bagged RT is an improvement on RT, the model creates B bootstrapped samples from the training set and finds the best (simple) regression tree within each sample. Then, we average all the predictions. This procedure reduces prediction variance. We use 25 added regression trees. 4. MLP is a type of artificial neural networks with one hidden layer and one output layer used. The prediction of the MLP using m neurons, one output neuron, and t input variables iŝ where i,j are the weights between the input i and the neuron m, g is the activation function, and w j is the weight between the output and the hidden neuron. Similar to Fouilloy et al. (2018), we select m = 5 hidden neurons and the sigmoid activation function.
For h-step-ahead forecast where h > 1, we apply the same rule as with the SETAR model. We compare the five methods using the mean squared error (MSE), We use the first 300 days to train the models and compare the forecasts values on the last 45 days. To fit the ARIMA, bagged RT and MLP we use the R packages forecast, ipred, and neuralnet (Fritsch et al., 2019;Hyndman & Khandakar, 2008;Peters & Hothorn, 2019). Table 1 shows the MSE values for 1-s-ahead and 5-s-ahead prediction forecast. We include the standard deviation of the SE (squared error) in brackets. In both cases, the multi-day SETAR model has the smallest MSE. For the 1-s prediction, the closest competitor is the PAR model while for the 5-s prediction the closest competitors are the ARIMA and MLP. Notice that PAR 5-s predictions are considerably high compared to the other competitors. Based on data results, the PAR coefficients are not robust to large drops on the irradiance series and highly affect the long-term prediction. Overall, the results agree with what we expected due to the highly nonlinearity observed on the data. We conclude that our model is preferable since the prediction is as good as other competitors and our model has a stronger interpretability in terms of clear-sky and non-clear sky regimes.

Stochastic features of the estimated multi-day SETAR model
In this section, we investigate some probabilistic features of our fitted model. Our goal is to identify characteristics of the clear-sky and non-clear sky regimes that could be used for the grid manager to determine in which regime the grid is working. First, we verify if the estimated model corresponds to a stationary model for each day. To do this, we compute the eigenvalues of the matrices 1.538 −0.614 0.120 −0.045 Then, the maximum eigenvalue for Φ 1 is 0.9975 and the maximum eigenvalue for Φ 2 is 0.9969. Following Amendola et al.'s, (2009) criteria, the estimated model corresponds to a stationary model. We also consider the skeleton plot as suggested by Tong (2010). The skeleton corresponds to the deterministic part of the model, that is, We start the skeleton with different initial values and follow the trajectory. If the skeleton is bounded when t increases, then it is visual proof of ergodicity of the process. Figure 13 plots some paths of the skeleton, and it also suggests that our model is stationary. Then, we have enough evidence to claim that our estimated multi-day SETAR model is stable. Therefore, we can compute a stationary form of the autocovariance and spectral density Now, we can compare the characteristics of the clear sky and non-clear sky regimes. Figure 14 (left) plots the autocorrelation function (ACF) computed with the coefficients for each regime. Here, a lag corresponds to 1 s. While the non-clear sky regime is more dependent on its past values, the clear sky regime (if we stay on it) values are almost independent after 10 min (600 s). Additionally, we visualize the spectral density for each regime computed as The spectrum gives information about the oscillatory behavior of the series. If the spectrum is concentrated in the low frequencies, it represents a slower oscillation with a longer period and an amplitude larger than that of a signal with a spectrum concentrated in high frequencies. Figure 14 (right) shows the spectrum for each regime computed using the mean variance per regime. The non-clear sky regime shows a higher power on the lowest frequencies in comparison with the clear sky regime, suggesting slower but stronger oscillations in the Remark 4. Note that the spectrum explains the variability of each regime, assuming a common constant variance for the noise term that represents the dominating oscillatory behavior. When the noise for each day has a different variability, the oscillatory behavior will change accordingly.
Remark 5. Regime models can also be estimated using the oscillatory properties. For example Hadj-Amar et al. (2021) fitted a HMM assuming that periodicity characterized each regime. We could use a combination between these ideas and Whittle likelihood to develop an alternative inference procedure.
As mentioned in Section 1, it is useful to generate synthetic irradiance time series to quantify the variability of the data. We simulate a 1-h time series from our estimated multi-day model (6) to verify if we can replicate some of the commonly observed paths in the data. To simulate from model (6), we need a pair value of ( 2 1,r , 2 2,r ). We randomly select an estimated pair value plotted in Figure 7 and then we run the model. Figure 15 shows four different simulated scenarios; the right plot corresponds to the different paired noise variability options, and the red dot corresponds to the selection. The first scenario simulates an hour with a very noisy first half hour but a bright sky on the last half hour. The third scenario can also reproduce a clear-sky period, although the non-clear sky period is not as strong (smaller drops) as the one obtained with the first scenario. In general, our model can capture different scenarios depending on the noise variability of the day. Another interesting question is whether these scenarios are related to the seasons of the year. We consider the estimated values for the noise variability and split them into four groups: spring, summer, autumn, and winter. We then plot the boxplot per group, as shown in Figure 16. This figure suggests that longer periods of clear sky can be observed during spring since the noise variability is smaller for both regimes during this season. In contrast, summer and winter are periods with a higher variability, and as a consequence, higher drops on irradiance values might be observed. This is a reasonable hypothesis since summer and winter are usually rainy seasons in the Caribbean sea.

4.6
Probabilistic forecasts using the multi-day SETAR model Although one step prediction are very informative for the mean behave of the irradiance series, we can obtain more information about the irradiance variability by using probabilistic forecasts. We consider the following algorithm to generate a probabilistic forecast of log clear-sky index series. First, we sample a pair of value for the intraday variability (̂2 1,r ,̂2 2,r ). Then, given the series up to time t we simulate the future irradiance value as log K r (t + 1) = (0.00024 + 1.538 log K r (t) − 0.614 log K r (t − 1) + 0.120 log K r (t − 2) − 0.045 log K r (t − 3) +̂1 ,r r (t + 1))1{log K r (t) ≤ −0.02} + (−0.00051 + 1.601 log K r (t) − 0.790 log K r (t − 1) + 0.240 log K r (t − 2)

F I G U R E 15
One-hour simulations of the multi-day model with different noise variabilities (shown in red). Depending on the noise variability, we observed long or short periods of clear sky regime where r (t + 1) ∼ N(0, 1) for all r and t. For values of more than one step ahead, log K t+h,r h > 1, we use the previous simulated values log K t+h−1,r as if they are observed values and apply the same random generator.
To evaluate probabilistic forecasts, we use scoring rules. Scoring rules provide summary measures based on the predictive distribution and the event or observed value. Gneiting and Raftery (2007) defined the continuous ranked probability score (CRPS) as where F is the probability distribution corresponding to the probabilistic forecasts, and y is an observed value. In practice, we may not be able to know the forecasts probability distribution F. Then, we can use an estimator of the CRPS based on a F I G U R E 16 Boxplot of noise variability values per season: Spring, summer, autumn, and winter F I G U R E 17 Computed CRPS. We compare our probabilistic forecast, which is based on the multi-day SETAR model (black continuous line), with the PeEn (red dashed line) method. Our method can forecast drops on irradiance series more accurately than the PeEn generated sample and a true observed value. We compute the CRPS using the scoringRules R package (Jordan et al., 2017). There are no direct competitors to our model in the literature of application of time series models to irradiance data and perform probabilistic forecast. Then, we consider the persistence ensemble (PeEn) as the benchmark for the probabilistic forecast of irradiance data . The idea of the PeEn method involves forecasting a future irradiance value based on the empirical distribution of the last s past values, that is, we generate log K r (t + 1) by randomly select a value from the set {log K r (t), … , log K t−s,r }.
In this experimental setting, we evaluate the performance for both methods, using the testing data set. For each day, we use the series until time t = 12, 13, 14, and 15 h and we simulate 1000 forecasted 5-min series (t + 300). For the PeEn method, we use the 10-min previous data. We then estimate the CRPS for each method and calculate the average per day. Figure 17 shows three different examples of irradiance data scenarios. The red dashed line corresponds to the PeEn method, and the continuous line corresponds to our multi-day SETAR model. On a day where the irradiance series can be very variable, our model can characterize the variability better than its competitor. When abrupt changes do not occur, the multi-day SETAR model can describe the irradiance series as well as the PeEn method. The improved CRPS suggests that our model improves the density estimation of the irradiance time series. Finally, we conclude that our model can reproduce the random variability of irradiance series more accurately than its competitor.

DISCUSSION
In this article, we propose a multi-day SETAR model to forecast the irradiance series, and provide a clear estimation procedure for our model parameters that guarantees the consistency of the estimators. We also provide a detailed description of the interpretation of our model for irradiance series. Our model naturally classifies the two regimes as a clear-sky regime and non-clear sky regime, without any prior classification of the data. Then, we can study the stochastic features of each regime in detail. Our model can accurately forecast irradiance values, not only on the average but its variability as well. We did not face any computational difficulty to fit our model; however, the complexity is similar to fitting a linear model. It could be a trouble for larger data sets. Accurate quantification of solar variability in a particular site is crucial to evaluate the solar-resource risk. These physical requirements will influence both technology development and financial performance. Our proposed model improves the variability quantification and reduces the bias of subjective classification for clear and non-clear days. Then we achieved a more robust and accurate quantification of resource-data uncertainty. Although we do not explore the online prediction, our model can achieve this purpose. The critical point is then the selection of noise variation. For a new day, the online prediction algorithm could start with a prior distribution of the noise variance. After 1 h, an update of the noise variance can be done using the observed data and draw an update value for the noise variation needed by the probabilistic forecasts.
SETAR models are versatile models for nonlinear time series. Our proposal, a multi-day SETAR model, uses the advantage of this family and the information provided across different days in a specific location. One unresolved challenge task is to forecast long periods of clear-sky state, but this can be improved if more information is available (e.g., external covariates or spatial neighbors). For example, covariates such as environmental conditions could be incorporate to model the threshold parameter or the noise variances. The estimation in such a case will be more challenging, and the proposed procedure here needs adaptations.
A future extension might be proposed using replicates across days and borrow information from different areas to forecast values in not observed nearby regions. A spatial SETAR model could use nearby location values to propose a regime indicator variable. We should consider some specific features related to the data application: (1) Distances dependence; if we take two close locations, the irradiance is likely to be the same values, and two very far could be completely independent. Therefore a study of accurate location of sensors or selecting sites needs to be done. (2) Alternative, other spatial covariates might be more informative. The proposal of a regime indicator based on spatial covariates such as temperature or cloud movements might significantly impact prediction accuracy.

ACKNOWLEDGMENTS
This publication is based upon work supported by King Abdullah University of Science and Technology (KAUST), Office of Sponsored Research (OSR) under Award No. OSR-2019-CRG7-3800.