Journal of Geophysical Research: Oceans

Non-stationary wave height climate modeling and simulation

Authors


Abstract

[1] The most popular methods of simulating time series for wave heights and other meteorological and oceanic variables are based on the use of autoregressive models and the transformation of variables to make them normal and stationary. Generally, when these models are used, attention is centered on their capacity to represent the autocorrelation of the series. In this article, a simulation model is proposed that is based on the following: (i) a non-stationary parametric mixture model for the marginal distribution of the variable, that combines a log-normal distribution for main-mass regime and generalized Pareto distributions for upper and lower tail regimes, and (ii) the use of copulas to model the time dependency of the variable. The model has been evaluated by comparing the original series and the simulated series in terms of the autocorrelation function, the mean, the annual maxima and peaks-over-threshold regimes, and the persistences regime. It has also been compared to an ARMA model and found to yield more satisfactory results.

1. Introduction

[2] The verification of coastal and harbor structures may require the use of Level III verification methods. These methods are usually complex and require the use of numerical simulation techniques (e.g., Monte Carlo techniques) [Losada, 2002].

[3] In coastal engineering, the main variables to be simulated are sea-state variables such as significant wave height, wind, and sea level, which characterize the sea state in a time domain in which processes are assumed to be stationary. For this purpose, generally speaking, the duration should not exceed O(1hr). This research focuses on the evolutionary behavior of the sea-state variables, i.e., on long-term analysis.

[4] From a physical point of view, the temporal evolution of sea-state variables is conditioned by phenomena operating on different time scales.

[5] Processes with a time scale of O(day)-O(weeks), such as synoptic phenomena and the cycles of spring and neap tides, produce dependence among the variables that originate and autocorrelation in each variable. The clearest example related to sea states is the passage of a storm. The storm will generate wind speeds and wave heights that are larger than average, and therefore, it is expected that these variables will be correlated during a storm. At the same time, the evolution of these variables (and others) over time is determined by the intensity and path of the storm, so there are physical reasons to expect that these variables will present significant autocorrelation within the time scale of the storm.

[6] O(year) scale processes, such as seasons, produce variations in the intensity and frequency of the O(day)-O(week) scale phenomena and thus cause temporal variations in sea-state variables. In the same way, O(>year) scale processes, such as interannual variability, influence the characteristics of each year (e.g., they create drier or wetter years and years with more or less wave action) and also produce temporal variations in sea-state variables.

[7] Regarding the statistical tools used in the long-term analysis of sea-state variables, it is important to note that such studies can be univariate or multivariate, may or may not include auto-correlation, and can be stationary or non-stationary. Table 1 summarizes the characteristics of a study: whether the variables are dependent on other variables (i.e., whether they are correlated with other variables), whether the variables are self-dependent (i.e., exhibit autocorrelation or time dependence), or whether they are dependent on time (i.e., whether their distribution is non-stationary). The long-term (climate) behavior of sea-state variables includes such characteristics and, consequently, should be studied using non-stationary multivariate models that represent the time dependence (or auto-correlation) of the variables.

Table 1. Outline of the Relationships of Dependence
Connection WithNOYES
Other variablesUnivariateMultivariate
Same variable TimeWithout auto-correlation StationaryWith auto-correlation Non-stationary

[8] In Figure 1, various physical phenomena evolving in different time scales are associated with statistical models that have been used in this study to appropriately model the sea-state variables for these time scales.

Figure 1.

Physical phenomena evolving in different time scales, and statistical models for the appropriate modeling of the sea-state variables.

[9] The maximum time scale that the simulation must take into account to be applied to engineering is the period used to verify the system. This period is generally the useful life of the system, which is 10–50 years, although it can be a shorter duration when the aim is to verify construction processes or evaluate other short-term phenomena.

[10] With regard to the simulation of times series for significant wave heights (Hs or Hm0), there are currently two lines of research: one that focuses on simulating storms and another that simulates complete series of values.

[11] The method most widely used to simulate storms involves developing joint or conditioned distributions for the random variables of storm occurrence, intensity, and duration. Based on these distributions, new time series are simulated assuming a standard shape for the storm.

[12] In general, storm occurrence is modeled using a Poisson distribution and storm intensity using a generalized Pareto distribution (GPD). It is common to condition the duration of a storm to its intensity. Some examples of this type of approximation are presented by DeMichele et al. [2007], Payo et al. [2008], and Callaghan et al. [2008]. Although stationary functions are generally used for this purpose, non-stationary functions can also be employed, such as those proposed by Luceño et al. [2006], Méndez et al. [2006, 2008], and Izaguirre et al. [2010]. A less frequent alternative in storm simulation is to assume that it is a Markov process and to use a multivariate distribution of extremes to model the time dependence of the variable while the storm lasts [Coles, 2001, chap. 8]. This technique is used by Smith et al. [1997], Fawcett and Walshaw [2006], and Ribatet et al. [2009].

[13] Monbet et al. [2007] review simulation methods for complete time series applied to wind and waves. The methods currently used can be classified as parametric and non-parametric.

[14] The Translated Gaussian Process (TGP) method [Walton and Borgman, 1990; Borgman and Scheffner, 1991; Scheffner and Borgman, 1992] is the most widely used non-parametric method. This method uses the spectrum of the normalized variable. According to Monbet et al. [2007], non-parametric methods such as those based on resampling (called resampling methods) are less frequently used and are not discussed in this article.

[15] The most frequently used parametric methods are based on autoregressive models. Studies employing such methods include Guedes Soares and Ferreira [1996], Guedes Soares et al. [1996], Scotto and Guedes Soares [2000], Stefanakos [1999], Stefanakos and Athanassoulis [2001], and Cai et al. [2007] for univariate series; for multivariate series, relevant studies include Guedes Soares and Cunha [2000], Stefanakos and Athanassoulis [2003], Stefanakos and Belibassakis [2005], and Cai et al. [2008]. As in the TGP, before autoregressive models can be used, the series must be normalized. For this purpose, non-stationary models of the mean and the standard deviation, like those proposed by Athanassoulis and Stefanakos [1995], Stefanakos [1999], and Stefanakos et al. [2006], are used.

[16] The current methods present the following limitations:

[17] (a) Methods of normalizing variables are either stationary [e.g., Cai et al., 2007, 2008] or non-stationary. However, they focus on the center of the data distribution, generally using the non-stationary mean and standard deviation for normalization [e.g., Guedes Soares et al., 1996; Athanassoulis and Stefanakos, 1995].

[18] (b) Parametric time dependence models are linear [e.g., Guedes Soares et al., 1996], piecewise linear [e.g., Scotto and Guedes Soares, 2000], or non-linear but are limited to the extremes [e.g., Smith et al., 1997].

[19] (c) Generally speaking, the simulation is only evaluated using the mean, the standard deviation and the autocorrelation.

[20] This article proposes a simulation method for non-stationary univariate series with time dependence. This method involves the use of a non-stationary parametric mixture distribution to model the univariate distribution of the variable and of copulas to model their time dependence.

[21] The rest of this paper is structured in three sections and seven annexes. In section 2, the proposed model is presented together with the procedure for simulating new time series. In section 3, the model parameters are fitted to a data series of significant wave heights, new series are simulated and the results obtained are discussed. Finally, in section 4, the conclusions are summarized. The derivation of the equations associated with the presented model is illustrated in the appendices at the end of the paper, along with a list of the abbreviations used throughout the paper (Appendix G).

2. Methodology

[22] The non-stationary model (section 2.1) includes variations of the order of months to years. Because it is a mixture distribution, it can be used to model both medium and extreme generation processes; i.e. this distribution is able to accurately model medium (or main-mass) states and extreme (or tails) states. The time dependence model (section 2.2) models processes whose time scale is composed of various states. Because it is copula-based, this model makes it possible to use various non-linear dependence structures that can be either symmetrical or asymmetrical.

[23] This section also describes the method used to simulate new data series (section 2.3) and the structure of the ARMA models (section 2.4), which are used to compare the results obtained with those obtained using the copula-based time-dependence model.

2.1. Non-Stationary Distribution Function

[24] S. Solari (Simulation of time series of geophysical variables; application to harbor engineering (in Spanish), doctoral thesis, University of Granada, Spain, submitted 2011) presents a mixture model

equation image

where Fc is the log-normal distribution (LN), Fm is the GPD of minima, and FM is the GPD of maxima. When continuity is imposed to the probability density function and the lower bound of the GPD has a value of zero, the GPD distributions are

equation image
equation image

with

equation image

[25] This model is similar to that proposed by Cai et al. [2007] for ARMA models with the exception that in equation (1), the continuity of the probability density function is assured by the conditions presented in equation (3). Furthermore, Cai et al. [2007] do not provide a method of threshold estimation, whereas Solari (submitted thesis, 2011) shows that the threshold can be estimated simultaneously with the other parameters.

[26] The five parameters of the model are (μLN, σLN, ξ2, u1, u2). To represent annual variations or those of a shorter duration, the parameters (μLN, σLN, ξ2) are approximated using a Fourier series whose main time period is the year:

equation image

where t is the time measured in years [see, e.g., Coles, 2001; Méndez et al., 2006].

[27] The parameters u1 and u2 are replaced by Z1 and Z2, using Fc(u1) = Φ(Z1) and Fc(u2) = Φ(Z2), where Φ is the standard normal distribution and Z1 and Z2 are stationary parameters. However, because the parameters μLN and σLN of the central distribution Fc are non-stationary, the thresholds u1 and u2 are non-stationary as well.

[28] The distribution parameters are derived using maximum likelihood estimation, minimizing the negative log-likelihood function (NLLF) after the redistribution of the data (Solari, submitted thesis, 2011). Redistribution involves taking the original data, truncated with precision 0.1 m, and distributing them uniformly at symmetrical intervals (X − 0.05, X + 0.05).

[29] The parameters are estimated by progressively increasing the order of approximation of the Fourier series. The parameters obtained for order na0, θa1, θb1, …, θan, θbn) are the first approximation used to estimate those in order n + 1, with zero used as the first approximation of the new parameters (θan+1, θbn+1) = (0, 0).

[30] To evaluate the significance of the improvement in fit obtained when the order of the Fourier series is increased, the Bayesian Information Criterion BIC = −2log(L) + log(Nd)p is used [see, e.g., Fan and Yao, 2005] where L is the likelihood function, Nd is the number of available observations, and p is the number of model parameters.

[31] Interannual variation (i.e., long-term cycles of over a year) and variation due to covariables (e.g., climatic indices) are incorporated in the distribution function in a manner similar to the way in which seasonal variation is incorporated [see, e.g., Coles, 2001; Izaguirre et al., 2010]. For parameter θ, a series of covariables Ci(t), and interannual variation of period Tj,

equation image

where long-term trends and other non-cyclic components are included as particular cases of the functions f(Cj(t), t) in which there is no dependence on any covariable.

[32] Once these parameters are estimated, the accumulated probability function for the time period (t, t + T) is calculated as

equation image

where P(HH*∣t) is the non-stationary LN-GPD model (1) (NS-LN-GPD):

equation image

[33] Goodness-of-fit is evaluated using PP and QQ graphs constructed by standardizing the variable xt following the procedure described in Appendix A.

2.2. Temporal Dependence

[34] The NS-LN-GPD model (6) can be used to transform the non-stationary series of significant wave heights {Hs(t)} into the uniformly distributed stationary series {P(t)} ∼ equation image(0, 1) using P(t) = Prob[HHs (t)∣t]. Next, copula theory is used to model the joint distribution of k successive states (Pt, Pt−1, …, Ptk+1). For an introduction to copula theory, see Joe [1997], Nelsen [2006], and Salvadori et al. [2007]. The use of copulas to model Markov chains is demonstrated by Abegaz and Naik-Nimbalkar [2008a, 2008b]. Stefanakos [1999], Serinaldi and Grimaldi [2007], DeMichele et al. [2007], Nai et al. [2004], and de Waal et al. [2007] apply copula theory to marine climate and other met-ocean variables.

[35] First, the time dependence between two consecutive states is studied. The joint probability Prob(Pt, Pt−1) is represented by copula C12 such that

equation image

On this basis, the conditioned probability function is obtained. This function defines the distribution of Pt given Pt−1 (or vice versa) and thus defines the first-order Markov process:

equation image

[36] To define a model of a higher order than 1, a copula construction process is used [Joe, 1997, chap. 4.5].

[37] Given copula C1…k (which defines the joint probability of k successive states) and, consequently, given the Markov model of order k − 1, variables F1∣2…k = Prob[PtPt−1, …,Ptk+1] and Fk+1∣2…k = Prob[PtkPt−1, …,Ptk+1] are constructed. The dependence between two variables is measured using Kendall's τk or Spearman's ρs statistic (see Appendix C). If this dependence is significant, then there is a relationship of dependence between Pt and Ptk that cannot be explained by the Markov model of order k − 1. In this case, it is necessary to construct a k-order Markov model. This can be accomplished using copula C1…k+1

equation image

where C1k+1 is a bivariate copula fit to the variables F1∣2…k and Fk+1∣2…k. This procedure is repeated until the value of k at which the dependence between variables F1∣2…k and Fk+1∣2…k is not significant.

[38] The procedure described is used to define multivariate copulas (i.e., those higher than the second order) based on a set of bivariate (i.e., second-order) copulas. Appendix D describes how this procedure is used to construct copula C1234, which defines a third-order Markov process.

[39] An alternative procedure that has not been implemented in this study involves using the autocorrelation function of the variable xt to set the order of the process k as the maximum time lag for which the autocorrelation is significant. Then, the copula construction method described above can be used to construct the multivariate copula C1…k.

[40] This research tested different copula families for the data used. The families selected were those that had the best goodness-of-fit based on the value of their likelihood functions and based on a visual evaluation. The two copula families used in this study were an asymmetric version of the Gumbel-Hougaard family and the Fréchet family (Appendix E). A list of copula families, their characteristics, and the different ways to fit them to the data can be found in the works of Joe [1997], Nelsen [2006], Salvadori et al. [2007], and Jaworski et al. [2010]. For a summary of methods and goodness-of-fit tests, see Genest and Favre [2007] and references therein.

2.3. Simulation Methodology

[41] The simulation process consists of two parts. First, the time-dependence model of copulas (9) is used to obtain the series of probabilities {Pt}; then, the non-stationary model (1) is used to transform the probabilities into wave heights. To simulate the realization Pt of the Markov process of order k − 1, once the previous realizations Pt−1 to Ptk+1 are known, utequation image(0, 1) is simulated and Pt obtained, resolving the following equation

equation image

where C1k is the bivariate copula fit to F1∣2…k−1 and Fk∣2…k−1 to construct C1…k and where F1∣2…k−1 and Fk∣2…k−1 are calculated using the set of bivariate copulas C1k−1, C1k−2, …, C12.

[42] When this procedure is used, it is not necessary to use equation (9) to perform the simulations because equation (10) can be resolved using the bivariate copulas. To obtain Pt, equation (10) can be numerically solved using the bisection method. The simulation process for a third-order Markov model is described in Appendix F.

2.4. ARMA Models

[43] An ARMA(p,q) model is given by

equation image

where ϕ and θ are the coefficients of the autoregressive component and of the moving average, respectively, and ɛt stands for the independent, identically distributed realizations with a null mean and variance σɛ2 (a normal distribution is generally assumed). The AR(p) model corresponds to the ARMA(p,0) case.

[44] To estimate the parameters of the ARMA model, the probability series {Pt}, obtained using the NS-LN-GPD model (6), is transformed into a series {Zt} via the inverse of the standard normal distribution. Once {Zt} has been obtained, the parameters ϕ, θ and σɛ2 can be estimated using maximum likelihood estimation.

[45] Once the model (11) is fitted, white noise is generated with variance σɛ2, and a new series {Zt} is simulated using parameters ϕ and θ. After the series {Zt} has been simulated, it is transformed into {Pt} using a standard normal distribution and afterwards into {Hs} using the inverse of the NS-LN-GPD model (6).

3. Application

[46] The research study described in this article used a series of 36, 496 data records of spectral significant wave height from 13 years and 3 months of sea states with a duration of 3 hours (although there were some gaps in the record). The data were obtained using the WAM numerical model, provided by Puertos del Estado, Spain (www.puertos.es), corresponding to WANA point number 1054046 (36.5°N, 6.5°W, Gulf of Cádiz, Spain). This is the same data series used by Solari (submitted thesis, 2011).

3.1. Non-Stationary Seasonal Distribution

[47] In this section, the NS-LN-GPD parameters are estimated. A non-stationary LN distribution (NS-LN) is also fitted (corresponding to the NS-LN-GPD with Z1 and Z2 parameters approaching infinity) for use in testing the goodness of fit obtained using the NS-LN-GPD model.

[48] In the first instance, the parameters are only allowed to have seasonal variations (i.e., variation of periods less than or equal to a year (equation (4)); interannual variation, covariables and trends were not considered.

[49] Fourier series are evaluated (equation (4)) with a maximum order of approximation n between 1 and 12. The order 1 represents annual variation, 2 represents semiannual variation, and so on. For each fit distribution, the BIC is estimated.

[50] The models are identified using three digits [abc]; a is the order of approximation of the Fourier series used for μLN, b is the order of approximation of the series used for σLN, and c is the order of approximation of the series used for ξ2. When a maximum approximation n is allowed, a, b, cn should hold. The total number of parameters of the model [abc] is 2(a + b + c) + 5; i.e., there are 2a + 1 parameters to be used in the Fourier series representation of μLN, 2b + 1 parameters to be used in the Fourier series representation of σLN, 2c + 1 parameters to be used in the Fourier series representation of ξ2, and the two stationary parameters Z1 and Z2.

[51] Figure 2 shows the value of the BIC, depending on the total number of parameters when maximum approximations are permitted of order n = 4, 6, 9. For each number, only the minimum BIC model is included. The minimum BIC models are identified for each n–order maximum approximation. Although each curve has a relative minimum, the minimum decreases as the maximum allowed order n increases. This finding implies that to use the BIC as a selection criterion for the model, one must first define the maximum allowed order of approximation n.

Figure 2.

Minimum Bayesian Information Criterion obtained for different numbers of parameters in the NS-LN-GPD model, with maximum approximation of the fourth order (equation image), 6th order (equation image) and 9th order (□).

[52] In this study, the minimum variation period for the parameters has been limited to 3 months (the maximum allowed order of approximation n is limited to 4). The minimum BIC model in this case is [4 2 2]: i.e., a Fourier series of order 4 for μLN and of order 2 for σLN and ξ2. Figure 3 shows the annual temporal evolution of parameters μLN, σLN and ξ2 from model NS-LN-GPD [4 2 2]. As can be observed, the principal component is the annual period, and the other components provide non-negligible corrections of a lesser order. The only exception is parameter ξ2, for which the semi-annual component is of the same order of magnitude as the annual one. The fit of the [4 2 2] model obtained using the NS-LN-GPD parameters is compared with that of the model obtained using the NS-LN (also using n = 4). Tables 2 and 3 show the estimated NS-LN-GPD and NS-LN parameters, respectively.

Figure 3.

Time evolution of μLN, σLN and ξ2 for the NS-LN-GPD [4, 2, 2] model.

Table 2. NS-LN Parameters
Ord. (k)μσ
θakθbkθakθbk
0−0.1160.561
10.3180.2030.100−0.016
2−0.024−0.0700.021−0.019
30.010−0.009−0.004−0.008
40.0510.0010.0080.014
Table 3. NS-LN-GPD Parameters
Ord. (k)μLNσLNξ2
θakθbkθakθbkθakθbk
0−0.0940.520−0.006
10.3220.1990.097−0.019−0.0140.076
2−0.019−0.0730.023−0.012−0.063−0.037
30.004−0.011----
40.0450.004----
Z1 Z2
−0.734 (23%) 1.078 (86%)

[53] Figure 4 shows the quantiles corresponding to the empirical accumulated probability values and those obtained when the NS-LN and NS-LN-GPD models are used. The empirical quantiles have been obtained using a moving window of one month. Generally speaking, the quantiles calculated using the NS-LN-GPD distribution coincide with the empirical quantiles. As compared with the NS-LN model, the NS-LN-GPD model exhibits superior fit at the tails.

Figure 4.

Iso-probability quantiles for non-exceeding probability P[xt] equal to 0.01, 0.1, 0.25 0.5, 0.75, 0.9 and 0.99; empirical (grey continuous line), NS-LN model (red dashed line) and NS-LN-GPD model (black continuous line).

[54] Figure 5 (top) shows the annual CDF on log-normal paper. As can be observed, the NS-LN-GPD model exhibits a better fit at the tails than the NS-LN model. Figure 5 (bottom) shows the annual PDF. The NS-LN-GPD model fits the mode better than the NS-LN model.

Figure 5.

(top) Accumulated probability on log-normal paper and (bottom) probability density. Empirical (dots), data from the NS-LN normal model (dashed line), and data from the NS-LN-GPD model (continuous line).

[55] Finally, Figure 6 shows the Q-Q and P-P graphs for the two models. These graphs confirm the goodness-of-fit obtained using the NS-LN-GPD model.

Figure 6.

Q-Q graph of the (a) non-stationary log normal model and (b) non-stationary model. P-P graph of the (c) non-stationary log normal model and (d) non-stationary model.

3.2. Interannual Variations

[56] The purpose here is to show how the proposed model can include the interannual variations observed in the series and examine how these interannual variations affect the simulation of new series. The physical basis of the observed interannual variations is not under study here. Moreover, the observed trends are assumed to be cyclical so that the mean value of the long-term simulations is not affected. This also makes it easier to compare the original and simulated series.

[57] It is not our aim to perform an in-depth analysis of the interannual variation in the data series being used; this would mean studying covariables of interest such as the NAO and considering long-term trends and climate cycles, which require longer series than the one available as well as series of covariables [see, e.g., Ruggiero et al., 2010; Izaguirre et al., 2010].

[58] When the moving average of the data is displayed on a graph (Figure 7), two trends are observed: (i) a cyclical component with a period of approximately 5 years and (ii) a decreasing trend. To analyze both, the following cyclical components are included in the mean:

equation image

This is an ad hoc model for long-term trends that assumes that the downward trend in the 13 years of data is part of a 26-year pattern of cyclical variation.

Figure 7.

Ninety-day Moving Average of Hs and the μLN(t) parameter of interannual model.

[59] These four parameters and the other parameters of the model are estimated using maximum likelihood estimation with n = 4 as the maximum order of approximation for the Fourier series and using the BIC to select the model. The model obtained in this case is [4 2 2 2], where the first three numbers refer to the order of approximation of μLN, σLN and ξ2 and the last refers to the two interannual cyclical components included in μLN (equation (12)).

[60] Figure 7 shows the moving average of the logarithm of the data obtained using a moving window of 90 days and the mean of NS-LN-GPD model [4 2 2 2]. As can be observed, the μLN parameter with interannual variation adequately captures the trend in the mean of the logarithm of the data.

[61] Model [4 2 2 2] exhibits a goodness of fit similar to that of model [4 2 2] (as given in Figures 5 and 6 and therefore not shown here).

3.3. Time Dependency: Copulas

[62] To fit the time dependency, different copula families can be tested. In this study, the families with the best fit are selected based on the log-likelihood function (LLF) and a visual evaluation. The following paragraphs describe the data fitting processes, which are conducted based on the probability series {Pt} obtained using NS-LN-GPD model [4 2 2 2]. Figure 8 shows the mean and standard deviation of Pt as well as their smoothed values on an annual scale. As can be observed, the series may be treated as stationary.

Figure 8.

Mean and standard deviation of Pt, estimated on an annual scale for each state, and their moving average smooth curves.

[63] The asymmetric Gumbel-Hougaard copula (E1) provides a good fit for the time-dependence between Pt and Pt−1. The parameters estimated for this copula are θ = 5.462, θ1 = 0.994 and θ2 = 0.969. This shows that Pt and Pt−1 are significantly dependent on each other (high θ) and that the distribution is slightly asymmetrical (θ1 ≈ θ2).

[64] Figure 9 depicts the empirical function C(Pt, Pt−1) and that obtained using the asymmetric Gumbel-Hougaard function. It is clear that the modeled and empirical iso-probability curves overlap, except around PtPt−1 ≈ 0.1 − 0.4, where the data reflect a more marked dependence than that exhibited by the model. In general, the fit is good.

Figure 9.

Empirical copula C(Pt, Pt−1) (thick line) and asymmetric Gumbel-Hougaard copula (thin line).

[65] We then estimated the dependence between Pt and Pt−2, which was not explained by C(Pt, Pt−1). For this purpose, the C12 copula was used to estimate F1∣2 and F3∣2. The dependence between F1∣2 and F3∣2 is significant (τk = −0.133 and ρs = −0.192), and thus, the trivariate copula C123 was constructed.

[66] To obtain the trivariate copula (D2), the bivariate copula C13(F1∣2,F3∣2) was fitted. In this case, a good fit was obtained using the Fréchet family. The parameters were fitted using (E8) and assuming that α = 0. A good fit was obtained, although there was some asymmetry in the data that was not captured by the copula.

[67] The copula C123 was used to estimate F1∣23 and F4∣23. The dependence between these variables was found to be τk = −1.4 × 10−3 and ρs = −1.3 × 10−4. Consequently, the variables F1∣23 and F4∣23 can be regarded as independent.

[68] Table 4 summaries the parameters of the copulas fitted using the probability series {Pt} obtained with the NS-LN-GPD [4 2 2] and [4 2 2 2] models (i.e., the seasonal model (SM) and interannual model (IM)). For the SM, the influence of considering the C14 copula was not found to be very significant.

Table 4. Copulas Parameters Fitted Using Pt Series Obtained With the NS-LN-GPD [4 2 2] (SM) and NS-LN-GPD [4 2 2 2] (IM) Models
 C12 G-H Asim.C13 FréchetC14 Fréchet
 θθ1θ2αβαβ
SM5.6970.9950.97100.1940.0050
IM5.4620.9940.96900.192

3.4. Time Dependency: ARMA Models

[69] High-order AR(p) and ARMA(p,q) models were estimated to compare the results obtained. An optimal number of parameters was not selected; rather a sufficiently high number (p = q = 23) was used to take advantage of the capacities of these models. We decided to work with ARMA models because they provided slightly better results than the AR models.

3.5. Simulation

[70] A simulation was conducted of 500 years of significant wave height Hs with each of the models fitted to the data: (a) the SM and the dependence model based on copulas (SM-C); (b) the IM and the dependence model based on copulas (IM-C); (c) the SM and the ARMA(23,23) model (SM-A); and (d) the IM and the ARMA(23,23) model (IM-A).

[71] Figure 10 shows a five-year data series and another five-year series simulated using the IM-C model. The next step was to evaluate the results obtained using the different models, differentiating between the medium or main-mass regime and the extreme or upper-tail regime.

Figure 10.

Five years of (top) measured significant wave heights and (bottom) simulated significant wave heights.

3.5.1. Medium or Main-Mass Regime

[72] The medium regime obtained using the four simulated series are very similar. In fact, it is practically impossible to differentiate between the four series in the PDF and CDF plots. Therefore, Figure 11 presents the results only for model SM-C. By comparing Figure 11 with Figure 5, it is clear that the distribution of the simulated data series (Figure 11) is equal to the theoretical distribution (Figure 5). This finding is because the simulated series is very long (500 years).

Figure 11.

(top) Accumulated probability on log-normal paper and (bottom) probability density. Original (dots) and simulated (green line) data series.

[73] Table 5 shows the values of the statistics derived from the first four moments of the distribution: mean, variance, skewness, and kurtosis. As can be observed, all of the models properly represent the mean and the variance. Regarding skewness and kurtosis, the best approximations were obtained using the SM-C and SM-A models. The IM-C and IM-A models yielded overestimated figures for kurtosis, particularly when the ARMA model was used for time dependence.

Table 5. Statistics Obtained From the First Four Central Moments
 DataSM-CIM-CSM-AIM-A
Mean1.0881.0771.0861.0901.093
Variance0.5480.5210.5380.5390.556
Skewness2.1272.1062.2752.1592.410
Kurtosis10.00610.46812.29010.84614.326

[74] Figure 12 shows the autocorrelation function (ACF) for the data and the four simulated series. For a time lag of less than three days, the SM-C and IM-C models fit the data better than the SM-A and IM-A models. In contrast, for longer time-lags, the SM-A and IM-A models provide a better fit. The main reason for this is that the ARMA model is a 23rd -order model, whereas the copula-based models correspond to second-order and third-order Markov models for the IM-C and SM-C, respectively. When third-order ARMA models are used (as indicated by the red dashed line referred to as IM-ARMA (3,3) in Figure 12), the long-term fit of the ACF is equivalent to that obtained using copula-based models, whereas the short-term fit is roughly the same as that obtained using a 23rd -order ARMA model.

Figure 12.

Autocorrelation function (ACF) for the four dependence models used and for a simulation run using an ARMA(3,3) model.

[75] Figure 13 shows the PDF of the persistences over thresholds (0.5 m, 1.0 m, 1.5 m, 2.0 m, 2.5 m, 3.0 m). In many cases, there are discrepancies between the persistence regimes for the original and simulated data series. For a threshold of 0.5 m, the simulated series show a lower than observed frequency of persistence of short duration (6 hours); i.e., the simulations overestimate persistence over 0.5 m. For thresholds greater than 2 m, the simulations (particularly those obtained using ARMA-based models) show a higher than observed frequency of persistence of short duration (6 hours); i.e., both the copula-based and the ARMA models underestimate persistence, but the extent of the underestimation by the ARMA model is greater. Nevertheless, for thresholds greater than 1.5 m, the series obtained using the copula-based models (SM-C and IM-C) show a better fit with regard to the persistence than that obtained using the ARMA model. In contrast, for the thresholds 0.5 m and 1 m, the data series simulated using the ARMA model exhibits a better fit with regard to the persistence than the series simulated using the copula model.

Figure 13.

Persistence over thresholds 0.5, 1, 1.5, 2, 2.5 and 3 m.

3.5.2. Extreme or Upper-Tail Regime

[76] This study has analyzed two aspects of the extreme regime: (i) annual maxima and (ii) storms and peaks over the threshold (POT regime).

3.5.2.1. Annual Maxima

[77] Figure 14 shows the annual maxima of the empirical data and of the simulated series for different return periods. Wide dispersion can be observed for high return periods: e.g., for 50-year return period, the values of obtained from the simulated series are between 7.5 m for the model SM-C and more than 10 m for the model IM-A. Generally speaking, the ARMA model has overestimated the annual maxima, whereas the data obtained via the copula-based model are underestimates. Nevertheless, the series simulated using the IM-C model appropriately fit the empirical regime of annual maxima.

Figure 14.

Annual maxima Hs: empirical data (dots), data from the copula models (green lines) and data from the ARMA models (blue lines).

[78] Additionally, the effect of including interannual variations (via the IM-C and IM-A models) was to increase the value of Hs for a given return period. This finding occurred independent of the time-dependence model used.

3.5.2.2. Storms and Peaks Over Threshold (POT)

[79] This study focused on the mean number of storms per year, their distribution throughout an average year, their duration, and the maximum significant wave height reached during the storm (i.e., the POT regime). Storms were identified following Solari (submitted thesis, 2011); the value of the threshold was u = 3.58 m, and the minimum time between the storms was Tmin = 2 days; this minimum time assured that the peaks came from different storms or independent events. The mean number of storms per year based on these data was ν = 3.08. The mean numbers of storms based on the simulated series were νSM−C = 3.15, νIM−C = 3.46, νSM−A = 6.16, and νIM−A = 6.66.

[80] Figure 15 shows the variation in parameter ν throughout the year. The values were obtained by dividing the year into 24 subsets of 1/2 month each, calculating the mean number of storms in each subset, and multiplying them by 24 so that the unit used would be the number of storms per year. (This two-week time scale corresponds to the variation between spring and neap tides. Even though this was not previously considered, it is another of the variation scales of the system, forced in this case by astronomical phenomena. One might ask if these variations have any effect on the occurrence or intensity of the storms.) The integral of the curve in the year is the mean number of storms per year. The results obtained via the SM-C and IM-C models are within the confidence limits obtained from the original data. In contrast, the results obtained using the SM-A and IM-A models include a significantly greater number of storms than was actually recorded, particularly in the winter.

Figure 15.

Storm occurrence: empirical data with 90% confidence intervals (black lines with dots), data from the copula models (green lines), and data from the ARMA models (blue lines).

[81] Figure 16 reflects the distribution of storm durations (i.e., persistence exceeding the threshold u). The results obtained via the SM-C and IM-C models were found to provide a slightly better fit of the data than the SM-A and IM-A models, although the four models tended to overestimate the frequency of short durations (approx. 5 hours), and underestimate frequency of long durations (approx. 30 hours).

Figure 16.

Persistence of the storms above 3.58 m in days: empirical data (dots), data from the copula models (green lines) and data from the ARMA models (blue lines).

[82] Finally, Figure 17 shows the values of Hs corresponding to different return periods as obtained from the POT regime. It also displays the fit of the GPD obtained by Solari (submitted thesis, 2011) for that regime. In this case, the simulated series that best fit the data is that obtained via the SM-A model. In contrast, the series obtained using the IM-A model contains significant overestimates and reflects a long-term tendency that is very different from the tendency indicated by the GPD. On the other hand, although the IM-C model underestimated the data for return periods of less than 10 years, the series obtained exhibit a long-term trend that lies within the GPD confidence limits.

Figure 17.

POT regime for Hs: empirical data (dots), annual GPD with confidence intervals (grey line), data from the copula models (green lines) and data from the ARMA models (blue lines).

3.6. Discussion

[83] With regard to the marginal distribution, all of the simulated series have approximated the original data quite well. The differences between the models become evident when the autocorrelation and persistence regimes are analyzed. As compared to the ARMA model, the copula-based time-dependency model provides a better fit to persistence data for thresholds higher than 1 m.

[84] With respect to autocorrelation, it appears that in the long term (with time-lags longer than 3 days), the high-order autoregressive models (23) provide better fitting data than do the models based on copulas. However, when low-order autoregressive models (of order 3) are used, the long-term behavior of the autocorrelation is similar to that obtained using copula-based models (which are also low-order models). If only short-term behavior is considered (with a time lag of less than 3days), the copula-based models show a slightly better fit in terms of autocorrelation than that obtained using autoregressive models.

[85] For the extreme regime, the IM-C model provided the best fit in every way. The exception was the POT regime, for which the IM-C model provided the second-best fit. The analysis of the extreme values in terms of the return period clearly indicated the effect of including interannual variations in the model. For particular return periods, the series obtained using the IM model include greater values of Hs than those obtained using the SM model. The data from the ARMA-based models indicate that there was a much larger mean number of storms per year than was actually recorded. The data from these models also underestimate the duration of the storms. In contrast, the results derived using the copula-based models appropriately fit the recorded data regarding the mean number of storms per year, their distribution throughout the year and their duration.

[86] Based on these findings, copula-based models can be deemed more suitable for use than are ARMA-based models given the frequency and persistence of the storms, which are important parameters to consider when studying systems such as beaches or ports. Even though the copula-based model yielded simulated series with characteristics that are very similar to those of the original series, there are certain differences between the series with regard to the POT regime.

[87] The effect of interannual variability is especially evident in the values for the upper tail even though it was only included in the parameters for the mean of the distribution. This is one of the advantages of using an integral model that covers the entire range of values of the variable. Performing a more in-depth analysis of interannual variation by taking into account the effect of covariables could improve the results obtained. Furthermore, it would provide more information regarding the long-term behavior of the variable.

4. Conclusions

[88] This article has described a non-stationary univariate model for the long-term distribution of sea-state variables that is valid for the entire range of values of the variable. The model includes seasonal variation using a Fourier-series approximation of the parameters and can also take into account climate cycles, trends, and covariables.

[89] The results of this study indicate that this non-stationary model can be used to transform the original non-stationary variable (Hs(t) in this article) into a stationary one P(t) = Prob[Hs < Hs (t)∣t]. Using this variable (P(t)), it is possible to study the time dependence or autocorrelation of the original variable (Hs). For this purpose, in this research, a copula-based model was developed based on the assumption that the process being examined was a Markov process.

[90] The application of the models to a data series for hindcast significant wave height indicated that the simulations obtained via the copula-based time-dependence model were better than those obtained using an ARMA model. However, some related considerations require further study. The long-term autocorrelation data generated by the copula-based models (with time-lags larger than 3 days) is inferior to that obtained using the high-order ARMA models. The possibility of improving these results by using other families of copulas should be investigated. It will also be necessary to more rigorously study how including long-period variation and covariables in the non-stationary model influences the simulated series.

[91] This study has shown that from an engineering viewpoint, it is not appropriate to evaluate simulation methods exclusively in terms of the ACF of the simulated series. A good ACF fit does not ensure that the model will behave suitably in representing persistence regimes, storm regimes and annual maxima.

Appendix A:: Data Standardization

[92] To build the PP and QQ plots of the NS-LN-GPD model, the standardized variable Ze is used.

equation image

where Z1 and Z2 are the parameters of the model; u1 and u2 are the thresholds calculated with the model; and ZLN, Zmin and Zmax are calculated as

equation image
equation image
equation image

This takes into account that when H(t) has a log-normal distribution, ZLN has a standard normal distribution; and when H(t) has a GPD distribution of minima (maxima), Zmin (Zmax) has a unit-parameter exponential distribution.

[93] After calculating the standardized variable Ze this variable was used to calculate empirical probability Pe. The modeled values of Zm quantiles and of probability Pm were calculated from Ze and Pe as

equation image
equation image

[94] Finally, graph QQ was built with (Ze, Zm) and graph PP was built with (Pe, Pm).

Appendix B:: Copula Definition

[95] A copula is a function C:[0, 1] × [0, 1] → [0, 1] such that for all u, v ∈ [0, 1], it holds that C(u, 0) = 0, C(u, 1) = u, C(0, v) = 0 and C(1, v) = v; and for all u1u2, v1v2 ∈ [0, 1] it holds that

equation image

[96] The use of copulas to define multivariate distribution functions is based on the Sklar's theorem: when FXY is a two-dimensional distribution function with marginal distribution functions FX y FY, there is then a copula C such that FXY = Prob[Xx, Yy] = C(FX(x), FY(y)).

Appendix C:: Measures of Association

[97] For a bivariate series (x, y). , the most widely used measurements of association are Kendall's τk and Spearman's ρs [Salvadori et al., 2007]. A sample version of these parameters are

equation image
equation image

where c (d) are the number of concordant (discordant) pairs (xi, yi) (xj, yj), defined as (xixj) (yiyj) < 0 (>0); Ri = Rank(xi); Si = Rank(yi); n is the sample size.

Appendix D:: Copula-Based Second-Order and Third-Order Markov Models

[98] Variables F1∣2 and F3∣2 are calculated using the bivariate copula C12 that defines the first-order Markov process:

equation image
equation image

Where it is assumed that the time-dependence structure is stationary, and thus C12C23.

[99] If these variables are dependent on each other (a dependence measured with τk or ρs), a trivariate copula C123 is then built that contemplates this dependence and which defines the second-order Markov process

equation image

Where marginal distributions C12 and C23 are given by the copula C12C23, and where marginal C13 represents the dependence of Pt and Pt−2 that is not explained by C12. A copula of this type can be found in [Joe, 1997, chap. 4.5]

equation image

Where C13 is fit based on the sample of F1∣2 and F3∣2.

[100] Similarly, F1∣23 and F4∣23 are calculated using C123

equation image
equation image

Where C12C23C34 and C123C234.

[101] If the dependence between F1∣23 and F4∣23, measured with Kendall's τk or Spearman's ρs, is significant, there is a significant degree of dependence between Pt and Pt−3 that is not explained by C123, and copula C1234 is built, which defines the fourth-order Markov process

equation image

Where copula C14 is fit, based on the sample of variables F1∣23 and F4∣23.

[102] The distribution of Pt conditioned to Pt−1 = v, Pt−2 = w and Pt−3 = y is then obtained by deriving (D4)

equation image

Appendix E:: Copulas Families

[103] The Gumbel-Hougaard family is the same as the logistic family used in the multivariate theory of extremes [see, e.g., Coles, 2001, chap. 8; Salvadori et al., 2007, Appendix C]. This study used an asymmetric version of this family [see, e.g., Ribatet et al., 2009].

equation image

with

equation image

where equation image = −log(u) and equation image = −log(v), θ ≥ 1, 0 ≤ θ1, θ2 ≤ 1.

[104] The conditioned distributions are given by

equation image
equation image

whereas the density is

equation image

[105] The parameters of this copula are estimated by means of maximum likelihood using (E5).

[106] The Fréchet copula family is given by

equation image

where M2(u, v) = min(u, v) is the Fréchet-Hoeffding upper bound; Π2 (u, v) = uv is the independent copula; and W2(u, v) = max(u + v − 1, 0) is the Fréchet-Hoeffding lower bound. The following relations are used to fit the parameters of the Fréchet family [Salvadori et al., 2007]

equation image
equation image

Appendix F:: Simulation Procedure of the Third-Order Markov Process

[107] For the third-order Markov process., the simulation procedure is:

[108] (i) At t = 1, u1equation image(0, 1) is simulated, and P1 = u1 is taken.

[109] (ii) For t = 2, u2equation image(0, 1) is simulated, and P2 is calculated conditioned to P1, solving the following equation

equation image

[110] (iii) For t = 3, u3equation image(0, 1) is simulated, and P3 is calculated conditioned to P1 and P2, solving the following equation

equation image

[111] (iv) for t ≥ 4, utequation image(0, 1) is simulated, and Pt is calculated conditioned to Pt−1, Pt−2 and Pt−3, solving the following equation

equation image

[112] (v) Once the series {Pt} is simulated, the series {Ht} is constructed, using the inverse of the NS-LN-GPD (equation (6)).

[113] In steps (ii) to (iv), the expressions of the conditioned copulas are analytically resolved, whereas equations (F1), (F2), and (F3) are numerically solved with the bisection method.

Appendix G:: List of Abbreviations

[114] Table G1 lists the abbreviations used throughout the article.

Table G1. List of Abbreviations
AbbreviationDescription
BICBayesian Information Criterion
GPDGeneralized Pareto distribution
IMNS-LN-GPD model fitted to the data allowing the parameters to have interannual variations
IM-ACombination of IM model for marginal distribution and ARMA model for time dependency
IM-CCombination of IM model for marginal distribution and copulas-based model for time dependency
LLFLog-likelihood function
LNLog-normal distribution
NLLFNegative log-likelihood function
NS-LNNon-stationary log-normal distribution
NS-LN-GPDNon-stationary mixture model composed by a log-normal distribution for the main-mass regime and two generalized Pareto distributions for the tails regimes
SMNS-LN-GPD model fitted to the data without allowing for interannual variations of the parameters
SM-ACombination of SM model for marginal distribution and ARMA model for time dependency
SM-CCombination of SM model for marginal distribution and copulas-based model for time dependency

Acknowledgments

[115] This research was funded by the Spanish Ministry of Education through its postgraduate fellowship program, grant AP2009-03235. Partial funding was also received from the Spanish Ministry of Science and Innovation (research project CTM2009-10520) and the Andalusian Regional Government (research project P09-TEP-4630). The authors also wish to thank Puertos del Estado for providing the wave record data.

Ancillary