Sig‐Wasserstein GANs for conditional time series generation

Generative adversarial networks (GANs) have been extremely successful in generating samples, from seemingly high‐dimensional probability measures. However, these methods struggle to capture the temporal dependence of joint probability distributions induced by time‐series data. Furthermore, long time‐series data streams hugely increase the dimension of the target space, which may render generative modeling infeasible. To overcome these challenges, motivated by the autoregressive models in econometric, we are interested in the conditional distribution of future time series given the past information. We propose the generic conditional Sig‐WGAN framework by integrating Wasserstein‐GANs (WGANs) with mathematically principled and efficient path feature extraction called the signature of a path. The signature of a path is a graded sequence of statistics that provides a universal description for a stream of data, and its expected value characterizes the law of the time‐series model. In particular, we develop the conditional Sig‐ W1$W_1$ metric that captures the conditional joint law of time series models and use it as a discriminator. The signature feature space enables the explicit representation of the proposed discriminators, which alleviates the need for expensive training. We validate our method on both synthetic and empirical dataset and observe that our method consistently and significantly outperforms state‐of‐the‐art benchmarks with respect to measures of similarity and predictive ability.

representation of the proposed discriminators, which alleviates the need for expensive training.We validate our method on both synthetic and empirical dataset and observe that our method consistently and significantly outperforms state-of-the-art benchmarks with respect to measures of similarity and predictive ability.

INTRODUCTION
Ability to generate high-fidelity synthetic time-series datasets can facilitate testing and validation of data-driven products and enable data sharing by respecting the demand for privacy constraints (Assefa et al., 2020;Bellovin et al., 2019;Tucker et al., 2020).Until recently, time-series models were mostly conceived by handcrafting a parsimonious parametric model, which would best capture the desired statistical and structural properties or the so-called stylized facts of the time series data.Typical examples are discrete time autoregressive econometric models (Tsay, 2005), or continuous time stochastic differential equations (SDEs) (Karatzas & Shreve, 1998).In many applications, such as finance and economics, one cannot base models on well-established "physical laws" and the risk of handcrafting inappropriate models might be significant.It is, therefore, tempting to build upon success of nonparametric unsupervised learning method such as deep generative modeling (DGM) to enable data-driven model selection mechanisms for dynamically evolving data sets such as time-series.However, off-the-shelf DGMs perform poorly on the task of learning the temporal dynamics of multivariate time series data  1∶ = ( 1 , … ,   ) ∈ ℝ × due to (1) complex interaction between temporal features and spatial features, and (2) potential high dimension for the joint distribution of  (e.g., when  >> 1), see, for example, Mescheder et al. (2018).
In this work, we are interested in developing a data-driven nonparametric model for the conditional distribution Law( future |  ) of future time series given  past ∶=  − p+1∶ .This setting includes classical auto-regressive processes.Learning conditional distributions is particularly important in the cases of (1) predictive modeling: it can be directly used to forecast future time series distribution given the past information; (2) causal modeling: conditional generator can be used to produce counterfactual statements; and (3) building the joint law through conditional laws enables to incorporate a prior into the learning process, which is necessary for building high-fidelity generators.
Learning the conditional distribution is often more desirable than learning the joint law and can lead to more efficient learning with a smaller amount of data (Buehler et al., 2020;Ng & Jordan, 2002).To see that, consider the following example.
As a consequence, the problem of learning distribution over ℝ × can be reduced to learning conditional distribution over ℝ  .
In our setting, the conditional law is time invariant and hence having only one data trajectory  1∶ gives  − p − 1 samples.This should be contrasted with having one sample when trying to learn Law( 1∶ ) directly.

Structure
The problem of calibrating a generative model in the time series domain is formulated in Section 2. There we overview the key results of this work against the work available in literature.In Section 3, we introduce the signature of a path formally.In Section 4, we establish key theoretical results of this work.In Section 5, we present the algorithm while in Section 6, we present extensive numerical experiments.

PROBLEM FORMULATION
Fix  > 0 and  ∶= ( 1 , … ,   ) ∈ ℝ × is a -dimensional time series of length .Let  be the window size (typically  << ).Suppose that we have access to one realization of , that is, ( 1 , … ,   ) and then obtain the  copies of time series segment of a window size  by sliding window.We assume that for each , the time series segment ( +1 ∶ ⋯ ,  + ) is sampled from the same but unknown distribution on the time series (path) space  ∈ (ℝ × ).The objective of the unconditional generative model is to train a generator such as to produce a ℝ × -valued random variable whose law is close to  using time series data . 1  In contrast, this paper focuses on the task of the conditional generative model of future time series when conditioning on past time series.Let p, q denote the window size of the past time series  past, ∶= ( − p+1 , … ,   ) ∈ ℝ × p =∶  and future time series  future, ∶= ( +1 , … ,  + q) ∈ ℝ × q =∶ , respectively.Assume that the joint distribution of ( future, ,  past, ) = ( − p+1,+ q) does not depend on time .Given a realization of time series ( 1 , … ,   ), at each time , the pairs of past path  past, ∶= ( − p+1 , … ,   ) ∈  and future path  future, ∶= ( +1 , … ,  + q) ∈  are sampled from the same but unknown distribution of  × valued random variable, denoted by ( past ,  future ).We aim to train a generator to produce the conditional law, denoted by   () ∶= Law( future, | past, = ).As   () is independent with  and hence we write () for simplicity.But of course, the methodology developed here all applies if one can access a collection of ( (𝑖) past ,  () future )  =1 of  independent copies of the past and future time series for  ≥ 1.
More specifically, the aim of the conditional generative model is to map samples from some basic distribution   supported on  ⊆ ℝ   together with data  past, into samples from the conditional law ( past, ).Given latent (, ()), conditional (, ()) and target (, ()) measurable spaces, one considers a map  ∶ Θ () ×  ×  → , with Θ () being a parameter space.Given parameters  () ∈ Θ () and  past, , ( () ,  past, ) transports   into ( () ,  past, ) ∶= ( () ,  past ) #   =   (( () ,  past, ) −1 ()),  ∈ ().The aim is to find  such that (,  past, ) is a good approximation of ( past, ) with respect to a suitable metric.Often the metric of choice is a Wasserstein distance, which leads to W 1 ( ( past ), ( () ,  past ) ) = sup ( past ) [( future )] −  ( () , past ) [( future )] (1) The optimal transport metrics, such as Wasserstein distance, are attractive due to their ability to capture meaningful geometric features between measures even when their supports do not overlap, but are expensive to compute (Genevay et al., 2019).Furthermore, when computing Wasserstein distance for conditional laws, one needs to compute the conditional expectation  ( past ) [( future )] using input data.In the continuous setting studied in this paper, this is computationally heavy and typically will introduce additional bias (e.g., due to employing least square regression to compute an approximation to the conditional expectation).
Since our aim is to learn the conditional law for all possible conditioning random variables, we consider Note that, since W 1 is non-negative, [W 1 (( past ), (,  past ))] = 0 implies that ( past ) = (,  past ) almost surely.

Challenges in implementing 𝑊 1 -GAN for conditional laws
There are two key challenges when one aims to implement  1 -GAN for conditional laws.
Challenge 1: Min-max problem.A typical implementation of  1 -GAN would require introduction of a parametric function approximation Θ () ×  ∋ ( () , ) ↦ ( () , ) such that  ↦ ( () , ) is 1-Lip.In the case of neural network approximation, this can be achieved by clipping the weights or adding penalty that ensures ∇  ( () , ) is less than 1, see Gulrajani et al. (2017).Recall a definition of  1 in Equation (1) and define ( () ,  () Training conditional  1 -GAN constitutes solving the min-max problem min  ()   max  ()   ( () ,  () ) . (2) In practice, the min-max problem is solved by iterating gradient descent-ascent algorithms and its convergence can be studied using tools from game theory (Lin et al., 2020;Mazumdar et al., 2019).However, it is well known that the first order method, which is typically used in practice, might not converge even in the convex-concave case (Daskalakis et al., 2017;Daskalakis & Panageas, 2018;Mertikopoulos et al., 2018).Consequently, the adversarial training is notoriously difficult to tune (Farnia & Ozdaglar, 2020;Mazumdar et al., 2019), and generalization error is very sensitive to the choice of discriminator and hyper-parameters, as it was demonstrated in large scale study in Lucic et al. (2018).
Challenge 2: Computation of the conditional expectation.In addition to the challenge of solving a min-max for each new parameter  () , one needs to compute the conditional expectation  ( past ) [( () ,  future )] (or  ( past ) [∇  () ( () ,  future )] if one can interchange differentiation and integration).From Doob-Dynkin lemma, we know that this conditional expectation is a measurable function of  past and approximation of these is computationally heavy and can be recast as a mean-square optimization problem Practical solution of this problem requires an additional function approximation, which may introduce additional bias and makes the overall algorithm much harder to tune.

Summary of the key results
Discrete time econometric models can be viewed as discretisation of certain SDEs type models (Klüppelberg et al., 2004).The continuous time perspective by embedding discrete time series into a path space, which we follow in this paper, is particularly useful when learning from irregularly sampled data sets and designing efficient training methods that naturally scale when working with high and ultra high frequency data (Cuchiero et al., 2020;Gierjatowicz et al., 2022;Liu et al., 2019).Our approach utilizes the signature of a path, which is a mathematical object that emerges from rough-path theory and provides a highly abstract and universal description of complex multimodal data streams that has recently demonstrated great success in several machine learning tasks (Kidger et al., 2019;Xie et al., 2017;Yang et al., 2022).To be more precise, we add a time dimension to d dimensional time series (  )  =1 and embed it into  ∶ [0, ] →  ∶= ℝ  with  = d + 1.For example, this is easily done by linearly interpolating discrete time data points.We assume that  is regular (c.f.Section 3.2) and denote the space of all such regular paths by Ω 0 ([0, ], ).The signature of a path determines the path up to tree-like equivalence (Boedihardjo & Geng, 2015;Hambly & Lyons, 2010).Roughly speaking, there is an almost one-to-one correspondence between the signature and the path, but when restricting the path space to Ω 0 ([0, ], ), the signature (feature) map  ∶  ↦ (),  ∈ Ω 0 ([0, ], ), is bijective.In other words, the signature of a path in Ω 0 ([0, ], ) determines the path completely (Levin et al., 2016).Let (Ω 0 ([0, ], )) denote the range of the signature of all the possible paths in Ω 0 ([0, ], ).Note that the signature map , defined on Ω 0 ([0, ], ), is continuous with respect to the 1-variation topology (Lyons et al., 2007).A remarkable property of the signature is the following universal approximation property.
Theorem 2.1 Universality of signature (Levin et al., 2016).Consider a compact set  ⊂ (Ω 0 ([0, ], )).Let  ∶  → ℝ be any continuous function.Then, for any  > 0, there exists a linear functional  ∈ (()) * acting on the signature such that (3) Theorem 2.1 applies to any subspace topology on ((Ω 0 ([0, ], )), which is inherited from the Hausdorff topology (()), that is finer than the weak topology.The theorem tells us that any continuous functional on the signature space can be arbitrarily well approximated by a linear combination of coordinate signatures.
Since the signature  is bijective and continuous when restricting the path space to Ω 0 ([0, ], ), the pushforward of the measure on the path space, () ∶= ( # )() = ( −1 ()) for  in the algebra of (Ω 0 (, )), induces the measure on the signature space.With this in mind, the  1 on the signature space is given by Motivated by the universality of signature, we consider the following Sig- 1 metric as the proxy for  Sig 1 by restricting the admissible test functions to be linear functionals: The Sig- 1 metric was initially proposed in Ni et al. (2021), where the Lipschitz norm of  is obtained by endowing the underlying signature space equipped with the  2 norm.Here, we consider a more general case, where the norm of the signature space is chosen as   for some  > 1.
In Lemma 4.5, we show that when where   () is the set of all the tensor series elements with finite   norm, then Sig- 1 admits analytic formula The significance of this result is that Sig- 1 -GAN framework reduces the challenging min-max problem to supervised learning, without severing loss of accuracy when compared with Wasserstein distance on the path space.Figure 1 of the two-dimensional VAR(1) dataset illustrates that the SigCWGAN helps stablize the training process and accelerate the training to converge compared with the CWGAN when keeping the same conditional generator for both methods.
In the conditional setting studied here, we lift both ( past ,  future ) into the signature space, that is ( past ,  future ) ↦ ( past ,  future ) ∶= (( past ), ( future )).The corresponding Sig- 1 distance is given where  denotes ( past ,  future ).From Doob-Dynkin lemma, we know that the conditional expectations are measurable functions of  past .Assuming the continuity of conditional expectation, and by the universal approximation results, these can be approximated arbitrarily well by linear functional of signature.Hence, we have Due to linearity of the functional , the solution of the above optimization problem can be estimated by linear regression.Unlike classical  1 -GAN described above, the conditional expectation under the data measure needs to be computed only once.Complete training is then reduced to solving following supervised learning problem Note that for each  () , one needs to approximate  ∼( () ) [ future ] using Monte Carlo simulations.A complete approximation algorithm also requires Monte Carlo approximation of outer expectation and truncation of the signature map (see Section 5.2 for exact details).The flowchart of SigCWGAN algorithm is given in Figure 2.

Related work
In the time series domain, the unconditional generative model was approached by various works such as Koshiyama et al. (2021) and Wiese et al. (2020).Among the signature-based models, Kidger et al. (2019)  The p lagged values of   , that is, ( −+1 , … ,   ) ∈ ℝ × p =∶ .

𝑋 𝑡,future
The next q step forecast of   , that is, ( +1 , … ,  + q ) ∈ ℝ × q =∶ . p The window size of the past path  ,past q The window size of the future path  ,future

𝑆 𝑡,past
The signature of  ,past  ,future The signature of  ,future supervised loss to the adversarial loss to force network to adhere to the dynamics of the training data during sampling.The supervised loss of TimeGAN is defined in terms of the sample-wise discrepancy between the true latent variable ℎ +1 and the generated one-sample estimator ĥ+1 given ℎ  .However, even if the estimator ĥ+1 has the same conditional distribution as ℎ +1 , the supervised loss may not be equal to zeros, and hence it suggests that the proposed loss function might not be suitable to capture the conditional distribution of the latent variable ℎ +1 given the ℎ  .Conditional moment matching network (CMMN) introduced in Ren et al. (2016) derives the conditional MMD criteria based on the kernel mean embedding of conditional distributions, which avoids the approximation issues mentioned in the above conditional WGANs.However, the performance of CMMN depends on the kernel choice and it is yet unclear how to choose the kernel on the path space.While our SigWGAN method is built on the conditional WGANs and the signature features, we would like to highlight the difference of method to the conditional WGAN and its link to CMMD.SigCWGAN resolves the computational bottleneck of the conditional WGANs given the past time series by using the analytic formula for the conditional discriminator without training.Building upon Ni et al. (2021), our work expands the SigWGAN framework from its initial application to unconditional generative models to enable conditional generative modeling.Moreover, one can view the SigCWGAN as the combination of unnormalized Sig-MMD (Chevyrev & Oberhauser, 2022) and CMMD, which has not been explored in the literature.It is worth noting that we also extend the definition of Sig- 1 in Ni et al. (2021), from the  2 norm of the signature space to the general   for some  > 1.We provide Table 1 to summarize the commonly used notations of our paper.

SIGNATURES AND EXPECTED SIGNATURES
In order to introduce formally the optimal conditional time series discriminator, in this section, we recall basic definitions and concepts from rough path theory.

Tensor algebra space
We start with introducing the tensor algebra space of , where the signature of a -valued path takes values.For simplicity, fix  = ℝ  throughout the rest of the paper. has the canonical basis { 1 , … ,   }.Consider the successive tensor powers  ⊗ of . 2 If one thinks of the elements   as letters, then  ⊗ is spanned by the words of length  in the letters { 1 , … ,   }, and can be identified with the space of real homogeneous noncommuting polynomials of degree  in  variables, that is, (  ∶=   1 ⊗ ⋯ ⊗    ) =( 1 ,…,  )∈{1,…,}  .We give the formal definition of the tensor algebra series as follows.

Embed time series in the path space
The signature feature takes a continuous function perspective on discrete time series.It allows the unified treatment on irregular time series (e.g., variable length, missing data, uneven spacing, asynchronous multidimensional data) to the path space (Chevyrev & Kormilitzin, 2016).To embed time series to the signature space, we first lift discrete time series to a continuous path of bounded 1-variation.Let x = (  )  =1 ∈ ℝ d× be a d-dimensional time series of length .We embed x to  ∶ [0, ] → ℝ  with  = d + 1 as follows: (1) interpolate the cumulative sum process of x to get the -dimensional piecewise linear path; (2) add the time dimension to the 0th coordinate of .
Throughout the rest of the paper, we restrict our discussion on the path space Ω 0 (, ).However, our methodology discussed later can be applied to other methods of transforming discrete time series to the path space provided that the embedding ensures the uniqueness of the signature.The commonly used path transformations with such uniqueness property are listed in Section B.1.

The signature of the path
We first introduce the -fold iterated integral of a path  ∈ Ω 0 ([0, ], ℝ  ).Let  = ( 1 , … ,   ) be a multi-index of length , where  1 , … ,   ∈ {0, 1, 2, … ,  − 1}.Let  () denote the th coordinate of , which is a real-valued function.The iterated integral of  indexed by  is defined as Collecting the iterated integrals of  with all possible indices of length  gives rise to the th fold iterated integral of .It can also be written in the tensor form, that is, Figure 3 (left) shows the one-fold iterated integral of , which is the increment of , that is,  3 −  0 , and the two-fold iterated integral of , which is given by where  (,) () = 1 2 (Δ  ) 2 and  (0,1) (),  (1,0) () are blue and yellow area in Figure 3 (right), respectively.Now we are ready to introduce the signature of a path .
Proof.It is a consequence of the factorial decay of the signature of a path of bounded 1-variation (c.f., Lemma A.2 in Appendix A). □ Lemma 3.6 (Uniqueness of signature).For any  ∈ Ω 0 ([0, ], ), the signature of  uniquely determines .
Proof.We refer the proof to that of Lemma 2.14 in Levin et al. (2016).□ The universality and uniqueness of signature, described in Section 2.1, make it an excellent candidate as a feature extractor of time series.
As we mainly work on the signature space in the later section, we provide a remark on the structure of the range of  on Ω 0 ([0, ], ).
Remark 3.7.Let  denote a compact set of Ω 0 ([0, ], ).Then the range () is a compact set of   () endowed with   topology.We defer the proof to that of Lemma A.3 at Appendix A.

3.3
Expected signature
Lemma 3.8.Let ,  be two measures defined on the path space Ω 0 (, ).Then for () ∶= ( # )() and () ∶= ( # )() with  in the -algebra of (Ω 0 (, )) we have Proof.This is an immediate result of the bijective property of the signature map , when  is restricted to Ω 0 (, ).□ By Proposition 6.1 in Chevyrev et al. (2016), we have the following result: Theorem 3.9.Let  and  be two measures on the path space Ω 0 (, ).In other words, under the regularity condition, the distribution  on the path space is characterized by  ∼ [()].We call  ∼ [()] the expected signature of the stochastic process  under measure .Intuitively, the signature of a path plays a role of a noncommutative polynomial on the path space.Therefore, the expected signature of a random process can be viewed as an analogy of the moment generating function of a -dimensional random variable.For example, the expected Stratonovich signature of Brownian motion determines the law of the Brownian motion in Lyons et al. (2015).However, it is challenging to establish a general condition to guarantee the infinite radius of convergence (ROC).In fact, the study of the expected signature of stochastic processes is an active area of research.For example, the expected signature of fractional Brownian motion for the Hurst parameter  ≥ 1∕2 is shown to have the infinite ROC (Fawcett, 2002;Passeggeri, 2020), whereas the ROC of the expected signature of stopped Brownian motion up to the first exit domain is finite (Boedihardjo et al., 2021;Li & Ni, 2022).Chevyrev et al. (2016, Theorem 6.3) provide a sufficient condition for the infinite ROC of the expected signature, potentially offering an alternative way to show the infinite ROC without directly examining the decay rate of the expected signature.

SIG-WASSERSTEIN METRIC
In this section, we formalize the derivation of Signature Wasserstein-1 (Sig- 1 ) metric introduced in Section 2.1.The Sig- 1 is a generalization of the one proposed in Ni et al. (2021) by considering the general   metric of the signature space.
Let  ∶ Ω → ℝ, where Ω is a generic metric space.Define where  is a metric defined on Ω.Let  and  be two compactly supported measures on the path space Ω 0 ([0, ], ) such that the corresponding induced measures on the signature space  and , respectively, have a compact support  ⊂ (Ω 0 ([0, ], )) ⊂   ().Recall that From the definition of the supremum, there exists a sequence of   ∶  → ℝ with bounded Lipschitz norm along which the supremum  Sig 1 (, ) is attained.By the universality of the signature, it implies that for any  > 0, for each   , there exists a linear functional   ∶  → ℝ to approximate   uniformly, that is, As   ∶  → ℝ is linear, there is a natural extension of   mapping from   () to ℝ. Motivated by the above observation, to approximate  Sig 1 (, ), we restrict the admissible set of  to be linear functionals  ∶ (()) → ℝ, which leads to the following definition.Definition 4.1 (Sig- 1 metric).For two measures ,  on the path space Ω 0 ([0, ], )) such that their induced measures  and , respectively, has a compact support  ⊂ (Ω 0 ([0, ], )), ) .
Here we skip the domain   () in the Lip norm of ||||  for the simplicity of the notation.
Remark 4.2.Despite the motivation of Sig- 1 from the approximation of   1 , it is hard to establish the theoretical results on the link between these two metrics.The main difficulty comes from that the uniform approximation of the continuous function  by a linear map  on  does not guarantee the closeness of their Lipschitz norms.We conjecture that in general,   1 (, ) is not equal to Sig- 1 (, ).However, it would be interesting but technically challenging to find out the sufficient conditions such that these two metrics coincide.
To derive the analytic formulae for the Sig- 1 metric, we shall introduce the following auxiliary lemma on the   norm of the tensor space   () and its dual space.).
For any linear functional  ∈   () * , it holds that sup Similarly, for any  ∈   (), it holds that sup We refer to the proof of Lemma 4.3 in Appendix A.3.
Remark 4.4.The sequence space   () is defined as where  is a general index set and  ≥ 1.It is well known that the dual space of   () for  ≥ 1 has naturally isomorphic to   ().This isomorphism is exactly the same as the map  * ∶   () ⧵ {0} →   () * ⧵ {0} ∶  →  * () used in our proof.Similarly, the dual space of   () * has a natural isomorphism with   () for any  > 1.
By exploiting the linearity of the functional  ∈   () * , we can compute the Lip norm of  analytically for  being the   norm of   () without the need of numerical optimization.By Lemma 4.3, the Lip norm of  is the   norm of , given as where The simplification of the Lip norm enables us to derive an analytic formula of the corresponding Sig- 1 metric.Throughout the rest of the paper, by default, we use Sig- 1 metric when  is  2 norm on (()), that is, Sig- 1 (, ) = ‖ ∼ [()] −  ∼ [()]‖ 2 .In practice, we truncate the Sig- 1 (, ) up to degree , that is,

SIG-WASSERSTEIN GANS FOR CONDITIONAL LAW
In this section, we introduce a general framework, so-called conditional Sig-Wasserstein GAN (SigCWGAN) based on Sig- 1 metric to learn the conditional distribution ( future | past ) from data .The C-SigWGAN algorithm is mainly composed of two steps: 1. We apply a one-off linear regression to learn the conditional expected signature under true measure   future ∼( past ) [( future )] (see Section 5.1); 2. We solve an optimization problem to find optimal parameters  () of the conditional generator, when using loss (5) (see Section 5.2).
In the last subsection of this section, we propose a conditional generator, that is, AR-FNN generator, which is a nonlinear generalization of the classical autoregressive models by using a feed-forward neural network.It can generate the future time series of arbitrary length.

Learning the conditional expected signature under the true measure
The problem of estimating the conditional expected signature under the true measure ( past ), by Equation (4) and the universality of the signature (Theorem 2.1), can be viewed as a linear regression task, with the signature of the past path and future path respectively (Levin et al., 2016).
More specifically, given a long realization of  ∶= ( 1 , … ,   ) ∈ ℝ × and fixed window size of the past and future path p, q > 0, we construct the samples of past/future path pairs ( past ,  future ) in a rolling window fashion, where the th sample is given by ) .
Assuming stationarity of the time series, the samples of past and future signature pairs are identically distributed )) where  1 ,  2 are the degrees of the signature of the past and future paths, which can be chosen by cross-validation in terms of fitting result.One may refer to Fermanian (2022) for further discussion on the choice of the degree of the signature truncation.
In principle, linear regression methods on the signature space could be applied to solve this problem using the above constructed data.When we further assume that under the true measure, where    ∼  and [  | () ] = 0, then an ordinary least squares regression (OLS) can be directly used.This simple linear regression model on the signature space achieves satisfactory results on the numerical examples of this paper.But it could be potentially replaced by other sophisticated regression models when dealing with other datasets.
We highlight that this supervised learning module to learn  ( past ) [  ( future )] is one-off and can be done prior to the generative learning.It is in striking contrast to the conditional WGAN learning, which requires to learn  ( past ) [  ( future )] every time the discriminator   is updated, and hence saves significant computational cost.

Sig-Wasserstein GAN algorithm for conditional law
We recall that in order to quantify the goodness of the conditional generator ( () ,  past, ) ∶= ( () ,  past ) #   , we defined the loss where L denotes the linear regression estimator for the conditional expectation L ∶  ↦  ∼ [ future | past = ].Given the conditional generator ( () , ⋅), the conditional expected signature  ∼( past , () ) [()] can be estimated by Monte Carlo method.We denote by ν the empirical approximation of (,  () past ), computed by sampling the future trajectory X() +1∶+ q using ( () , ⋅) and a conditioning variable  past, .This leads to the following empirical loss function: Input: (  )  =1 , the signature degree of future path  2 , the signature degree of past path  1 , the length of future path q, the length of past path p, learning rate , batch size , the number of epochs , number of Monte Carlo samples  MC .

5:
▹ Denote the set of time index of the batch as   .

7:
We randomly select the set of time index of batch size , denoted by  .
Using empirical loss function ( 14), one updates the generator parameters  () with stochastic gradient descent algorithm until it converges or achieves the maximum number of epochs.See Algorithm 1 for pseudocode.

The conditional AR-FNN generator
In this subsection, we further assume that the target time series  is stationary and satisfies the following autoregressive structure, that is, where  ∶  ×  → ℝ  is continuous and (  )  are i.i.d.random variables and   and  ,past are independent.Time series of such kind include the autoregressive model (AR) and the Autoregressive conditional heteroskedasticity (ARCH) model.The proposed conditional AR-FNN generator is designed to capture the autoregressive structure of the target time series by using the past path  past, as additional input for the AR-FNN generator.The function  in Equation ( 15) is represented by forward neural network with residual connections (He et al., 2016) and parametric ReLUs as activation functions (He et al., 2015) (see Section B.2 for a detailed description).
We first consider a step-1 conditional generator  1 ( () , ⋅) ∶ ℝ × p ×  → ℝ  , which takes the past path  and the noise vector  1 to generate a random variable to mimic the conditional distribution of step-1 forecast ( +1 | past, = ).Here the noise vector  1 has the standard normal distribution in  = ℝ   .
One can generate the future time series of arbitrary length q ≥ 1 given  past by applying  1 ( () , ⋅) in a rolling window fashion with i.i.d.noise vector (  )  as follows.Given  past = (x 1 , … , x p) ∈ ℝ × p, we define time series ( x )  inductively; we first initialize the first p term x as  past , and then for  > p, use  1 ( () , ⋅) with the p-lagged value of x conditioning variable and the noise   to generate x+1 ; in formula, Therefore, we obtain the step-q conditional generator, denoted by   ( () , ⋅) ∶ ℝ × p → ℝ × q and defined by  past ↦ ( x p+1 , … , x p+ q), where ( x p+1 , … , x p+ q) is defined in Equation ( 16).We omit q in  q for simplicity.(See Algorithm 2 in Supplementary Material.)

NUMERICAL EXPERIMENTS
To benchmark with SigCWGAN, we consider the baseline conditional WGAN (CWGAN) to compare the performance and training time.Besides, we benchmark SigCWGAN with three representative generative models for the time-series generation, that is, (1) TimeGAN Yoon et al. ( 2019), (2) RCGAN (Hyland et al., 2018)-a conditional GAN and (3) GMMN (Li et al., 2015)-an unconditional MMD with Gaussian kernel.For a fair comparison, we use the same neural network generator architecture, namely the three-layer AR-FNN described in Section B.2, for all the above generative models.Furthermore, we compare the proposed SigCWGAN with Generalized autoregressive conditional heteroskedasticity model (GARCH), which is a popular econometric time series model.
To demonstrate the model's ability to generate realistic multidimensional time series in a controlled environment, we consider synthetic data generated by the Vector Autoregressive (VAR) model, which is a key illustrative example in TimeGAN (Yoon et al., 2019).We also provide two financial datasets, that is, the SPX/DJI index data and Bitcoin-USD data to validate the efficacy of the proposed SigCWGAN model on empirical applications.The additional example of synthetic data generated by ARCH model is provided in the appendix.
To assess the goodness of the fitting of a generative model, we consider three main criteria (a) the marginal distribution of time series; (b) the temporal and feature dependence; (c) the usefulness (Yoon et al., 2019)-synthetic data should be as useful as the real data when used for the same predictive purposes (i.e., train-on-synthetic, test-on-real). 5In the following, we give the precise definition of the test metrics.More specially, we use  real ∶= ( () future )  =1 and  fake ∶= ( x() future )

𝑖=1
to compute the test metrics, where x() future is a simulated future trajectory sampling by the conditional generator ( () ,  () past ). real and  fake are the samples of the ℝ × q-valued random variable  future under real measure and synthetic measure resp.The test metrics are defined below.
• Metric on marginal distribution: For each feature dimension  ∈ {1, … , }, we compute two empirical density functions (epdfs) based on the histograms of the real data and synthetic data resp.denoted by d   and d   .We take the absolute difference of those two epdfs as the metric on marginal distribution averaged over feature dimension, that is, • Metric on dependency: (1) Temporal dependency: We use the absolute error of the auto-correlation estimator by real data and synthetic data as the metric to assess the temporal dependency.For each feature dimension  ∈ {1, … , }, we compute the auto-covariance of the th coordinate of time series data  with lag value  under the real measure and the synthetic measure, respectively, denoted by    () and    ().Then the estimator of the lag-1 auto-correlation of the real/synthetic data is given by . The ACF score is defined to be the absolute difference of lag-1 autocorrelation given as follows: Note    () and    () can be estimated empirically by Equations (C.1) and (C.2) in Appendix C, respectively, which allows us to compute the ACF score on the dataset.In addition, we present the ACF plot, which illustrates the autocorrelation of each coordinate of the time series with different lag values.The synthetic data's quality is evaluated by how closely its ACF plot resembles that of the real data, as it indicates the synthetic data's ability to capture long-term temporal dependencies.
(2) Feature dependency: For  > 1, we use the  1 norm of the difference between cross correlation matrices.Let  ,  and  ,  denote the correlation of the th and th feature of time series under real measure and synthetic measure, respectively.The metric on the correlation between the real data and synthetic data is given by  1 norm of the difference of two correlation matrices, that is, We defer the estimation of the correlation matrix   2019), we consider the problem of predicting next-step temporal vectors using the lagged values of time series using the real data and synthetic data.First, we train a supervised learning model on real data and evaluate it in terms of  2 (TRTR).Then we train the same supervised learning model on synthetic data and evaluate it on the real data in terms of  2 (TSTR).The closer two  2 are, the better the generative model it is.To assess the performance of the proposed SigCWGAN to generate the longer time series, we consider the  2 score for the regression task to predict the next -step, where  can be even larger than q.The train and test split is 80 and 20%, respectively, in all the numerical examples.We conduct the hyper-parameter tuning for the signature truncation level.We set  = 2 in the   norm used in the Sig- 1 metric.Appendix B contains the additional information on implementation details of SigCWGAN, including path transformations and network architecture of the generator.We refer the Appendix C for more details on the evaluation metrics.We also provide the extensive supplementary numerical results of VAR(1) data, ARCH(1) data and empirical data in Appendix D. Implementation of SigCWGAN can be found in https://github.com/SigCGANs/Conditional-Sig-Wasserstein-GANs.

Synthetic data generated by vector autoregressive model
In the -dimensional VAR(1) model, time series (  )  =1 are defined recursively for  ∈ {1, … ,  − 1} through where (  )  =1 are iid Gaussian-distributed random variables with co-variance matrix  + (1 − );  is a  ×  identity matrix.Here, the coefficient  ∈ [−1, 1] controls the auto-correlation of the time series and  ∈ [0, 1] the correlation of the  features.In our benchmark, we investigate the dimensions  = 1, 2, 3 and various (, ).We set  = 40, 000 and p = q = 3.In this example, the optimal degree of signature of both past paths and future paths is 2.
First, we empirically prove that the proposed SigCWGAN can serve as an enhancement of CGWAN model.One can see from Figure 4, when the CWGAN training is fed into a more reliable estimator of the conditional mean under real measure  ( past ) [( future )], the training tends to converge faster.However, the commonly used one-sample estimator ( future ) in the CWGAN training may suffer from large variance, leading to inefficiency of training.In contrast to it, the SigCWGAN may alleviate this problem by its supervised learning module.Additionally, the simplification of the min-max game to optimization via SigCWGAN leads to further acceleration and stablization of training SigCWGANs, and hence brings the performance boost, as shown in Table 2. Figure 4 illustrates that the SigCWGAN has a better fitting than CWGAN in terms of conditional law as the estimated mean (and standard deviation) is closer to that of the true model compared with CWGAN.Moreover, Tables D.1-D.3 show that the SigCWGAN consistently beats the CWGAN in terms of performance for varying , , and .
We then proceed with the comparison of CSigWGN with the other state-of-the-art baseline models.Across all dimensions, we observe that the CSigWGAN has a comparable performance or outperforms the baseline models in terms of the metrics defined above.In particular, we find that as the dimension increases, the performance of SigCWGANs exceeds baselines.We illustrate this finding in Figure 5, which shows the relative error of TSTR  2 when varying the dimensionality of VAR(1).Observe that the SigCWGAN remains a very low relative error, but the performance of the other models deteriorates significantly, especially the GMMN.Moreover, we validate the training stability of different methods.Figure 6 shows the development of the loss function and ACF scores through the course of training for the three-dimensional VAR(1) model.It indicates the stability of the SigCWGAN optimization in terms of training iterations, in contrast to all the other algorithms, especially RCGAN and TimeGAN that involve a min-max optimization, as identified in the 1st challenge in Section 2. While the ACF scores of the baseline models oscillate heavily, the SigCWGAN ACF score and Sig- 1 distance converge nicely towards zero.Also, although the MMD loss converges nicely towards zero, in contrast, the ACF scores do not converge.This highlights the stability and usefulness of the Sig- 1 distance as a loss function.
To assess the efficiency of different algorithms, we train all the algorithms for the same amount of time (2 min) and compare the test metrics of each method.Table 2 shows a higher efficiency of SigCWGAN, which yields the best performance in terms of all the metrics except for the metric on the marginal distribution.
Furthermore, the SigCWGAN has the advantage of generating the realistic long time series over the other models, which is reflected by that the marginal density function of a synthetic sampled path of 80,000 steps is much closer to that of real data than baselines in Figure 7.

SPX and DJI index dataset
The dataset of the S&P 500 index (SPX) and Dow Jones index (DJI) consists time series of indices and their realized volatility, which is retrieved from the Oxford-Man Institute's "realized library" (Heber et al., 2009).We aim to generate a time series of both the log return of the close prices and the log of median realized volatility of (a) the SPX only; (b) the SPX and DJI.Here we choose the length of past and future path to be 3.By cross-validation, the optimal degree of signature ( 1 =  2 ) is 3 and 2 for the SPX dataset and SPX/DJI dataset, respectively.TA B L E 3 Numerical results of the stock datasets.In each cell, the left/right number are the result for the SPX data/the SPX and DJI data, respectively.We use the relative error of TSTR  2 against TRTR  2 as the  2 metric.Table 3 shows that SigCWGAN achieves the superior or comparable performance to the other baselines.The SigCWGAN generates the realistic synthetic data of the SPX and DJI data shown by the marginal distribution comparison with that of real data in Figure 8.For the SPX only data, GMMN performs slightly better than our model in terms of the fitting of lag-1 auto-correlation and marginal distribution (≤0.0013), but it suffers from the poor predictive performance and feature correlation in Table 3 and Figure 9.When the SigCWGAN is outperformed, the difference is negligible.Furthermore, the test metrics, that is, the ACF loss and density metric, of our model are evolving much smoother than the test metrics of the other baseline models shown in Figure D.7.Moreover, the ACF plot shown in Figure 10 shows that SigCWGAN has the better fitting for the auto-correlation for various lag values, which indicates the superior performance in terms of capturing long temporal dependency.

Metrics
It is worth noting that our SigCWGAN model outperforms GARCH, the classical and widely used time series model in econometrics, on both the SPX and SPX/DJI data, as shown in Table 3.The poor performance of the GARCH model could be attributed to its parametric nature and the potential issues of model mis-specification when applied to empirical data.

Bitcoin-USD dataset
The Bitcoin-USD dataset contains hourly data of Bitcoin price in USD from 2021 to 2022.We use the data in 2021 (2022) for the training (testing), respectively, which are illustrated in Figure 11.We apply our method to learn the future log-return of the future 6 h given the past 24 h.We encode the future and past paths with their signatures of depth 4.  marginal distribution (2.0532 vs. 2.803).The better performance of the SigCWGAN to capture the temporal dependency is also verified by the additional results of the autocorrelation metric and the  2 -metric for different lag values is provided in Tables D.8 and D.9.

CONCLUSION
In this paper, we developed the conditional Sig-Wasserstein GAN for time series generation based on the explicit approximation of  1 metric using the signature features space.This eliminates the problem of having to approximate a costly critic/discriminator and, as a consequence, dramatically simplifies training.Our method achieves state-of-the-art results on both synthetic and empirical dataset.
Our proposed conditional Sig-Wasserstein GAN is proved to be effective for generating time series of a moderate dimension.However, it may suffer the curse of dimensionality caused by high path dimension.It might be interesting to explore how to combine SigCWGAN with the implicit generative model to learn the low-dimensional latent embedding and hence cope with the high-dimensional path case.Moreover, on the theoretical level, it is worthy of investigating the conditions, under which the  1 metric on the signature space coincides with the Sig- 1 metric.

A C K N O W L E D G M E N T S
H.N. is supported by the EPSRC under the program Grant EP/S026347/1.H.N. and L.S. are supported by the Alan Turing Institute under the EPSRC Grant EP/N510129/1.All authors thank the anonymous referees for constructive feedback, which greatly improves the paper.Moreover, HN extends her gratitude to Siran Li, Terry Lyons, Chong Lou, Jiajie Tao, and Hang Lou for their helpful discussion.

D ATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are openly available in Conditional-Sig-Wasserstein-GANs repository at https://github.com/SigCGANs/Conditional-Sig-Wasserstein-GANs.These empirical data were derived from the following resources available in the public domain: (1) the Oxford-Man Institute's "realized library" https://realized.oxford-man.ox.ac.uk/ data ;(2) https://github.com/David-Woroniuk/Historic_Crypto.

E N D N O T E S
1 For any distribution   ∈ (ℝ × ), one can construct a stochastic process  ∶ Ω → ℝ × , such that Law() =   , see Dudley (1989, Proposition 9.1.2and Theorem 13.1.1). 2 The tensor power  ⊗ is defined based on the concept of the tensor product.Consider two vector spaces  and  over the same field  with basis   and   , respectively.The tensor product of  and , denoted by  ⊗ , is a vector space consisting of basis  ⊗  ′ , where  ∈   and  ′ ∈   that is equipped with a bilinear map ⊗.Here  ⊗  ′ can be regarded as a function  ×  → ℝ, which maps every (, ) to  =,= ′ .For any two elements (    )  ⊗  ′ . 3One may refer Definition A.1, Appendix A for the -variation metric of a path. 4The definition of infinite radius of convergence of expected signature can be found in Definition A.4 of Appendix A.
For each  ≥ 1, the -variation norm of a path  ∈   (, ) is denoted by |||| − and defined as follows: Recall that   is the projection map from the tensor algebra element to its truncation up to level .To differentiate with   , we also introduce another projection map Π  ∶ (()) →  ⊗ , which maps any  = ( 0 ,  1 , … ,   , …) to its th term   .
For concreteness, we state the decay rate of the signature for paths of finite 1-variation.However, there is a similar statement of the factorial decay for the case of paths of finite -variation (Lyons et al., 2007).) 2 ).The direct calculation leads to that Expected signature of stochastic processes Definition A.4.Let  denote a stochastic process, whose signature is well defined almost surely.Assume that [()] is well-defined and finite.We say that [()] has infinite radius of convergence, if and only if for every  ≥ 0,

A.3
The signature Wasserstein-1 metric (Sig- 1 ) In the following, we provide the proof of Lemma 4.3.
The proof of Equation ( 12) is similar to the above.We only need to show the supremum taken over |||| = 1 is the same as that |||| ≤ 1. Again we only prove for  ≠ 0 as the  = 0 case is trivial.Similarly to the above, when  * () ∶= ( □

APPENDIX B: CONDITIONAL SIGNATURE WASSERSTEIN GANS
In this section, we provide the algorithmic details of the conditional signature Wasserstein GANs for practical applications.

B.1 Path transformations
The core idea of SigCWGAN is to lift the time series to the signature feature as a principled and more effective feature extraction.In practice, the signature feature may often be accompanied with several of the following path transformations: • Time jointed transformation (Definition 4.3, Levin et al., 2013); • Cumulative sum transformation: it is defined to map every (  )  =1 to   ∶= ∑  =1   , ∀ ∈ {1, … , } and  0 =  (eq.(2.20) in Chevyrev & Kormilitzin, 2016).
Although in our analysis on the Sig- 1 metric, we use the time augmented path to embed the discrete time series  to a continuous path for the ease of the discussion.However, to use Sig- 1 metric to differentiate two measures on the path space, the only requirement for the way of embeddings, a discrete time series to a continuous path is that this embedding needs to ensure the bijection between the time series and its signature.Therefore, in practice, we can choose other embedding to achieve that; for example, by applying the lead-lag transformation to time series, one can ensure the one-to-one correspondence between the time series and the signature.

B.2 AR-FNN architecture
We give a detailed description of the AR-FNN architecture below.For this purpose, let us begin by defining the employed transformations, namely the parametric rectifier linear unit and the residual layer.where   is applied component-wise, is called residual layer.
The AR-FNN is defined as a composition of PReLUs, residual layers, and affine transformations.Its inputs are the past -lags of the -dimensional process we want to generate as well as the -dimensional noise vector.A formal definition is given below.
The pseudocode of generating the next q-step forecast using   is given in Algorithm 2.

A L G O R I T H M 2
Pseudocode of generating the next -step forecast using   Input:  − p+1∶ ,   Output: x+1∶+ q 1: xfuture ← a matrix of zeros of dimension  × q.
3: for  = 1 ∶ q do 4: We sample   from the iid standard normal distribution.

APPENDIX C: NUMERICAL IMPLEMENTATIONS
We use the following public codes for implementing the below three baselines:

Test metrics of different models
We apply SigCWGAN, CWGAN, and the other above-mentioned methods on VAR(1) different set with various hyper-parameter settings.The summary of the test metrics of all models on  dimensional VAR(1) data for  = 1, 2, 3 can be found in Tables D.1-D.3, respectively.
Additionally, apart from Figure 5, the  2 comparison, we provide the bar charts to compare the performance of different methods on the VAR data in terms of other test metrics in

D.3 SPX and DJI dataset
We provide the supplementary results on the SPX and DJI dataset.The summary of test metrics of different models is given by Table D.5.The test metrics over the training process of each method on (1) SPX dataset and (2) SPX and DJI dataset can be found in Figures D.

D.4
Bitcoin dataset We provide the additional numerical results on the Bitcoin dataset as follows.

Settings
Comparison across three performance metrics (see Section 6) of training SigCWGAN with loss function (5) and CWGAN with loss function (2=1 iid Gaussian-distributed random variables with co-variance matrix  + (1 − ) and autocorrelation coefficient  = 0.8 and co-variance parameter  = 0.8.The explicit form of the model allows for an unbiased approximation of conditional expectation in (2) using Monte Carlo samples.The colors blue and orange indicate the relevant distance/score for each dimension.[Color figure can be viewed at wileyonlinelibrary.com]F I G U R E 2 The illustration of the flowchart of SigCWGAN.[Color figure can be viewed at wileyonlinelibrary.com]Let L denote the linear regression estimator of the conditional expectation  ↦  ∼ [ future | past = ].

F
I G U R E 3 (Left) Embed one dimensional time series x = ( 1 ,  2 ,  3 ) to  (in blue) in the path space.First we compute (  ) 3 =0 , which is the cumulative sum of (  ) 3 =1 , that is,  0 = 0 and  for  = 1, 2, 3. Then we linearly interpolate  to a continuous path in Ω 0 ([0, 3], ℝ 2 ); (right) embed the time series to the path space and visualize the low order signature.[Color figure can be viewed at wileyonlinelibrary.com]
true data and fake data to Appendix C. •  2 comparison: Following Esteban et al. (2017) and Yoon et al. ( Comparison of the performance on CWGAN in terms of metric on the marginal distribution for varying  MC .This experiment is conducted on a three-dimensional VAR(1) dataset generated by Equation (17) with  =  = 0.8.We use the Monte-Carlo estimator of the conditional mean ( ( past ) [( () )( future )]) generated by the ground truth model for the CWGAN training.The larger number  MC of Monte Carlo approximation, the better conditional mean under the true measure.(Right) Comparison of the performance of SigCWGAN and CWGAN in terms of fitting the conditional distribution of future time series given one past path sample on one-dimensional VAR(1) dataset.[Color figure can be viewed at wileyonlinelibrary.com]F I G U R E 5 Comparison of predictive score across the VAR(1) datasets.The three numbers in the bracket indicate the hyperparameters , ,  used to generate the corresponding VAR dataset.The predictive score was computed by taking the absolute difference of the  2 obtained from TSTR and TRTR.[Color figure can be viewed at wileyonlinelibrary.com]

F I G U R E 6 (
Upper panel) Evolution of the training loss functions.(Lower panel) Evolution of the ACF scores.Each color represents the ACF score of each feature dimension.Results are for the 3-dimensional VAR(1) model based on Eqn.(17) for  = 0.8 and  = 0.8.[Color figure can be viewed at wileyonlinelibrary.com]F I G U R E 7 Comparison of the marginal distributions of one long sampled path (80,000 steps) with the real distribution.[Color figure can be viewed at wileyonlinelibrary.com]

F I G U R E 8
Comparison of the marginal distributions of the generated SigCWGAN paths and the SPX and DJI data.[Color figure can be viewed at wileyonlinelibrary.com]F I G U R E 9 Comparison of real and synthetic cross-correlation matrices for SPX and DJI log-return and log-volatility data.On the far left, the real cross-correlation matrix from SPX and DJI data is shown./-axis represents the feature dimension while the color of the (, )th block represents the correlation of  on the far right indicates the range of values taken.[Color figure can be viewed at wileyonlinelibrary.com]F I G U R E 1 0 ACF plot for each channel on the SPX/DJI dataset.Here x-axis represents the lag value (with maximum lag equal to 100) and y-axis represent the corresponding auto-correlation.The length of real/generated time series used to compute the ACF is 1000.The number in the bracket under each model is the sum of the absolute difference between the correlation coefficients computed from real (dashed line) and generated (solid line) samples.[Color figure can be viewed at wileyonlinelibrary.com]F I G U R E 1 1 The evolution of the close value (left) and log return of BTC-USD from January 2021 to January 2023.[Color figure can be viewed at wileyonlinelibrary.com] Lemma A.2 (Factorial decay of the signature).Let  ∈  1 (, ).Then there exists a constant  > 0, such that for all  ≥ 0,|Π  (())| ≤  || Let  denote a compact set of Ω 0 ([0, ], ).Then the range (Ω 0 ([0, ], )) is a compact set on   () endowed with   topology.Proof.The proof boils down to showing the continuity of the signature map  from Ω 0 ([0, ], ) with 1-variation norm to   () with   topology.Let ,  ∈ Ω 0 ([0, ], ), which are controlled by the control function , for example, (, ) ∶= max(| , | 1− , | , | 1− ) for all 0 <  <  < .Let | [,] −  [,] | 1− ≤ (, ) for some  ∈ ℝ + .Then by the continuity of the signature map in Theorem 3.10, Lyons et al. (2007), and the admissible norm  1 , it holds that for an integer  ≥ 1, |Π  (()) − Π  (())| 1 ≤  (

F
Comparison of the performance on the cross-correlation metric across all algorithms and benchmarks.[Color figure can be viewed at wileyonlinelibrary.com]TA B L E D . 4 Numerical results of the ARCH(p) datasets.

F
Exemplary development of the considered distances and score functions during training for the two-dimensional VAR(1) model with autocorrelation coefficient  = 0.8 and covariance parameter  = 0.8.The colors blue and orange indicate the relevant distance/score for each dimension.[Color figure can be viewed at wileyonlinelibrary.com]F I G U R E D .6 Exemplary development of the considered distances and score functions during training for the three-dimensional VAR(1) model with autocorrelation coefficient  = 0.8 and covariance parameter  = 0.8.The colors blue, orange, and green indicate the relevant distance/score for each dimension.[Color figure can be viewed at wileyonlinelibrary.com]F I G U R E D .7 Exemplary development of the considered distances and score functions during training for SPX data.[Color figure can be viewed at wileyonlinelibrary.com]F I G U R E D .8 ExemplARCHry development of the considered distances and score functions during training for SPX and DJI data.[Color figure can be viewed at wileyonlinelibrary.com]F I G U R E D .9 Comparison of real and synthetic cross-correlation matrices for SPX data.On the far left, the real cross-correlation matrix from SPX log-return and log-volatility data is shown./-axis represents the feature dimension while the color of the (, )th block represents the correlation of  ()  and  ()  .The colorbar on the far right indicates the range of values taken.Observe that the historical correlation between log-returns and log-volatility is negative, indicating the presence of leverage effects, that is, when log-returns are negative, log-volatility is high.[Color figure can be viewed at wileyonlinelibrary.com]TA B L E D .7 Autocorrelation metric for the stock datasets for different lag values.In each cell, the left/right number are the result for the SPX data/the SPX and DJI data, respectively.
Notation summary table.
Numerical results of VAR(1) for  = 3 with fixed training time of 2 min.
TA B L E 2 Numerical results of BTC-USD data experiment.We use the relative error of TSTR  2 against TRTR  2 as the  2 metric.

norm of real and generated cross correlation matrices
Comparison of the performance on the density metric across all algorithms and benchmarks.[Colorfigurecanbe viewed at wileyonlinelibrary.com]Numerical results of VAR(1) for  = 2.We implement extensive experiments on ARCH(p) with different -lag values, that is,  ∈ {2, 3, 4}.We choose the optimal degree of signature 3. The numerical results are summarized in TableD.4.The best results among all the models are highlighted in bold.Numerical results of VAR(1) for  = 3.
F I G U R E D . 2 Comparison of the performance on the absolute difference of lag-1 autocorrelation across all algorithms and benchmarks.[Color figure can be viewed at wileyonlinelibrary.com]F I G U R E D . 3 Comparison of the performance on the Sig- 1 metric across all algorithms and benchmarks.[Color figure can be viewed at wileyonlinelibrary.com]TA B L E D . 2

norm of real and generated cross correlation matrices
Numerical results of the stocks datasets.

norm of real and generated cross correlation matrices
2 metric (%) of the stock datasets for different lag values.In each cell, the left/right number are the result for the SPX data/ the SPX and DJI data, respectively.
Autocorrelation metric for the BTC dataset for different lag values. 2 metric (%) of the BTC dataset for different lag values.