Semi-supervised Surface Wave Tomography with Wasserstein Cycle-consistent GAN: Method and Application on Southern California Plate Boundary Region

,

The shear wave velocity (Vs) inversion problem in surface wave tomography, i.e., mapping from surface wave velocity dispersion curves to 1-D Vs depth profiles, is highly nonlinear and underdetermined (e.g., Qiu et al., 2019).Conventional methods, such as linearized inversion (e.g., Herrmann et al., 2013), near-neighbor algorithm (e.g., Wathelet, 2008), and nonlinear Bayesian Markov Chain Monte Carlo method (MCMC; e.g., Roy & Romanowicz, 2017;Shen et al., 2013), are able to provide reliable results in previous studies if an initial model with sufficient accuracy is available.Hu et al. (2020)  based method that has the potential to alleviate the dependency on the accuracy of the initial Vs model while preserving the speed of the inversion as demonstrated in Hu et al. (2020).
The workflow of CNN based Vs inversion is shown in Figure 1a.A labeled dataset is split into a training set and a validation set.The "labeled data" usually consists of a known Vs model and its corresponding theoretical dispersion curves (e.g., Hu et al., 2020), and provides learnable examples to supervise the training of networks.The neural network stops updating when the prediction accuracy of the validation set reaches its optimum.The trained network is then applied to the observed dispersion data, later referred to as "unlabeled data", to output the best fitting Vs model.Since only labeled dataset is used in the training process, quality of the Vs model generated from the CNN is dependent on the similarity of the initial model and the true structures (Hu et al., 2020).
In comparison, generative adversarial networks (GAN; Goodfellow et al., 2014) introduce an adversarial network (discriminator) that incorporates both the labeled and unlabeled datasets into the training process (i.e., semi-supervised; Figure 1b), in an effort to alleviate the strong labeled dataset dependency of the CNN.In addition, we introduce Cycle-consistent GAN (Cycle-GAN; Zhu et al., 2017;Yi et al., 2017), in which a data generative network that learns to reconstruct the input data from its predicted model is added.It enforces the model and data generative subnets to be self-consistent and penalizes the reconstruction misfit, consequently reducing the variance of both the forward and backward generative networks.Compared to CNN or GAN, Cycle-GAN has been proven to generate predictions for seismic trace interpolation (e.g., Kaur and Fomel, 2019) and impedance inversion (e.g., Wang et al., 2019) with better accuracy under the same setup.To further improve training stability (Arjovsky and Bottou, 2017) of the GAN algorithm, we adopt the structure of WGAN-GP, i.e., using Wasserstein distance and adding a gradient penalty (GP; Gulrajani et al., 2017) in the adversarial loss function (Arjovsky et al., 2017).The state-of-the-art hybrid method (hereinafter, Wcycle-GAN) combines the structures of Cycle-GAN and WGAN-GP, and outperforms conventional machine learning algorithms in biomedical translation (McDermott et al., 2018) and seismic impedance inversion (Cai et al., 2020).
In this paper, we demonstrate the application of the Wcycle-GAN method to Vs inversion using dispersion data derived for the SC plate boundary region, one of the most well-studied areas in the world.To better evaluate the seismic hazard in SC, several tomographic velocity models were developed using different datasets with various resolutions.The two Community Velocity Models (CVM), CVM-H15.1 (Shaw et al., 2015) and CVM-S4.26(Lee et al., 2014), derived via full waveform tomography were widely used as the initial model in previous surface wave tomography studies of this area (e.g., Barak et al., 2015;Berg et al., 2018;Qiu et al., 2019).
We first demonstrate the preparation of the training dataset as the combination of labeled dataset generated using the CVM-H15.1 and unlabeled data as the Rayleigh wave velocity dispersion maps from Qiu et al. (2019) in section 2. The network architecture of the Wcycle-GAN designed for this specific dataset and training process are presented in section 3. We then input the unlabeled data to the best trained Wcycle-GAN and obtain the final 3-D Vs model as the output.
The final 3-D Vs model and the corresponding data misfits are presented in section 4. Compared to the models generated from conventional CNN algorithm and linearized inversion (Qiu et al., 2019), our Vs model yields smaller data misfit and improved image of structures near major faults.It is important to note that this method is the first machine learning based Vs inversion study that incorporates unlabeled data in the training process, which has the potential to be applied to surface wave dispersion datasets collected at various scales and from regions where subsurface structures are poorly constrained from previous studies.

Rayleigh Wave Phase and Group Velocities -Unlabeled Data
We use the isotropic phase and group velocity maps of fundamental mode Rayleigh waves from Qiu et al. (2019) as the unlabeled dataset, which is used in both the training process and generation of the final 3-D Vs model.Travel times of surface waves reconstructed from ambient noise cross correlations for a seismic network with 346 stations in SC (triangles in Figure 2) are first measured at each station pair over a period range of 2 to 20 s.Eikonal tomography is then applied to resolve isotropic phase and group velocity maps and corresponding uncertainties with a grid size of 0.05°×0.05°(grid lines in Figure 2) for periods between 2.5s and 16s.Details of the Rayleigh wave velocity dispersion maps can be found in Qiu et al. (2019).
In this study, we use velocity dispersions in the period range between 3 s and 16 s to construct the unlabeled data, as the velocity maps at 2.5 s are less robust (i.e., large uncertainties) and only cover a small part of the SC plate boundary region.Dispersion curve and its uncertainty at each grid cell are interpolated and discretized into 17 samples, an interval of 0.5 s from 3 to 6 s and 1 s from 6 to 16 s.Since the uncertainties are estimated from Eikonal tomography by analyzing velocity maps derived for different virtual sources (Section 4 of Qiu et al., 2019), uncertainty values less than 0.05 km/s are set to 0.05 km/s to account for errors from other sources (e.g., dispersion picking, accuracy of the trained network, etc.).Grid cells with a phase or group velocity dispersion curve that has less than 8 sample points are excluded.In total, the unlabeled data consists of 4076 pairs of Rayleigh wave phase (Figure S1) and group velocity (Figure S2) dispersion curves, and the corresponding uncertainties are utilized to calculate the data misfit distribution in section 4.

Community Velocity Model and Synthetic Dispersion Curves -Labeled Data
We take advantage of the CVM resolved from full waveform tomography in constructing the labeled dataset for training the network.The CVM-H15.1 (later referred to as "CVM-H") is preferred to CVM-S4.26 in the network training because of its inclusion of topography, smaller misfit to observed dispersion data, and model simplicity, as discussed in Qiu et al. (2019).16480 1-D profiles of Vs, Vp, and density are extracted from the CVM-H with a grid spacing of 0.03º×0.03ºfor the region covered by the unlabeled data (Figure 2).These 1-D profiles are discretized into 98 layers with a thickness of 0.5 km from 0 km to 49 km (relative to the earth surface) and a half space below 49 km.The study area is confined to a longitude range from 120.2⁰W to 114.9⁰W and a latitude range from 32.6⁰N to 36.0⁰N.Each 1-D Vs profile (Figure S3a) is labeled by the synthetic Rayleigh wave phase and group velocity dispersion curves (Figure S3b).The synthetic velocity dispersion curve is calculated using the Computer Programs in Seismology (CPS) software package (Herrmann, 2013), in which 1-D profiles of Vs, Vp, and density at the target location are inputted.

Methodology
In contrast to the conventional CNN, GAN incorporates a discriminative network that enables the use of unlabeled data.In CNN applications (Figure 3a), we train a model generative network (  ) using labeled data only, by iteratively minimizing the point-wise misfit between the translated model (i.e.,   predictions) and the real Vs model.The misfit is also known as the estimation loss (ℒ  ) and can be measured in cross-entropy or least-square format.GAN runs updates of generative and discriminative networks separately in a single iteration (Figure 3b left column).In the first step, the trainable parameters are fixed in   and model discriminator (  ) is updated.The discriminator is renewed to separate the real Vs models and the outputs from model generator.Numerically, this is implemented by forcing   to output binary discrimination, where "1" stands for real model samples in the labeled dataset and "0" represents the outputs from   .Next, the model discriminator is fixed, and the generator is updated to "fake" the discriminator and score "1" with the translated model.Similar process is conducted for unlabeled data (Figure 3b right column) except when we do not have real Vs models to feed into model generator in the first step.In this way, the discriminative network searches for a transformation to maximize the difference between real and translated models while the generator seeks to minimize it.The corresponding loss function is named adversarial loss (ℒ  ), which can be calculated in cross-entropy (Goodfellow et al., 2014), least-square (Mao et al., 2017) and Wasserstein distance (Arjovsky et al., 2017).
Cycle-GAN (Figure 3c) further extends the GAN algorithm with the concept of "cycleconsistency", by introducing an extra data generative (  ) and discriminative network (  ).For simplicity, we separate the algorithm into data cycle (green arrows in Figure 3c) and model cycle (purple arrows in Figure 3c).In the data cycle for the labeled data, besides computation of the adversarial loss, the translated model (i.e., output from   ) is fed into   to reconstruct the original dispersion data (i.e., the input to   ).The point-wise reconstruction misfit (cycleconsistent loss, ℒ  ) is minimized during the iterations.Similar to the linearized Vs inversion where we compute the predicted data from the current best model using known physical relations, in the Cycle-GAN, we compute the reconstructed data but replace the physical modeling with a data generative network.In the model cycle (bottom left of Figure 3c), we generate the translated data from the real Vs model and estimate the adversarial loss using the data discriminator   .The translated data is then fed into   to generate reconstructed model, and the cycle-consistent loss of the model reconstruction is penalized (Figure 3c left column).
The unlabeled data go through similar process in the data cycle (Figure 3c right column).
However, since their corresponding Vs models are unknown, there is no model cycle for the unlabeled data.
Our approach to resolve Vs structures from Rayleigh wave velocity dispersion curves is based on a specific Cycle-GAN algorithm that utilizes Wasserstein adversarial loss.We present the details of Wcycle-GAN algorithm-based surface wave tomography as follows.

Sub Neural Network Structures
The architecture of the proposed Wcycle-GAN consists of four sub neural networks (Figures S3c-S3f)two generative subnets (  and   ) and two discriminative subnets (  and   ).
Different from Hu et al. (2020), for all the subnets, a 1-D rather than 2-D neural network is implemented for network simplicity.Unlike image translation (Isola et al., 2017) or seismic impedance inversion (Cai et al., 2020) problems, surface wave dispersion curves and Vs models have different ranges of values and dimensions.In this study, the input dimension of dispersion data to the neural networks is 17x2 with the phase and group velocities as two separate channels, while the Vs model is 99x1 (Section 2).Considering the difference in Vs model and dispersion data dimensions, we design specific architectures for model and data generative subnets (Figures S3d and S3e).In the model generator, we double the number of filters at each convolutional layer similar to the VGG16 network (Simonyan and Zisserman, 2014).The number of filters at each convolutional layer from shallow to deep is 32,64,128,, respectively.Accordingly, in the data generative subnet (  ), we first upsample the Vs model to the dense feature map with a dimension of 17x256, and sequentially half the number of filters in the following convolutional layers.For both the model and data discriminative subnets (Figures S3c and S3f), we double the number of filters in the convolutional layers and apply a sigmoid activation function in the fully connected layer to output probability values between 0 and 1.
In all the subnets, the convolutional layer uses 1D convolution with kernel size 3x1 and zero padding on the boundary.The stride equals to 1 except in the   , where the stride value of 2 is used to reduce trainable parameters.To accelerate the training process, at each convolutional layer, we apply the batch normalization (Ioffe and Szegedy, 2015) after the ReLU (Nair and Hinton, 2010) activation and initialize the weight parameters in the convolutional layers using the He initialization (He et al., 2015).In addition, as suggested by Gulrajani et al. (2017), we replace the batch normalization in the adversarial subnets with the layer normalization (Ba et al., 2016).

Loss Function
To optimize both the generative and adversarial subnets, the loss function in the Wcycle-GAN is calculated by a combination of the estimation loss, cycle-consistent loss, and adversarial loss, and can be written as where ℒ  , ℒ  , and ℒ  stand for the Wasserstein adversarial loss, the cycle-consistent loss, and the estimation loss, respectively.The hyperparameters  1 and  2 are the weighting factors.
We introduce the notations which will be used in the following discussions The gradient penalty loss ℒ  enforces the discriminator to be 1-Lipschitz continuous, which is the assumed to optimize the Wasserstein GAN (Arjovsky et al., 2017).Detailed implementation of ℒ  can be found in Gulrajani et al. (2017).In practice, the weighting factor  should be large enough to avoid exploding gradient (Gulrajani et al., 2017).In this study, we set  = 100 to ensure good numerical stabilities (Cai et al., 2020) (3) Note that the computation of Wasserstein adversarial loss is slightly different from that of the conventional adversarial loss.Corresponding mathematical derivations of Wasserstein adversarial loss can be found in Arjovsky et al. (2017).
The cycle consistency loss (Zhu et al., 2017) for the model cycle.( * , * ) stands for a measurement of the difference between two samples, and in this proposed method it is computed by mean-square error (MSE).Using the labeled data as an example, the ℒ  is computed as the difference between the input data  and the reconstructed data     (    ()).The reconstructed data is the output after the original data consequently passed through the model (  ) and data (  ) generative subnets.We also penalize the estimation loss in the Wcycle-GAN algorithm to constrain the fitting in the labeled dataset, by computing the MSE between the translated samples and ground truth in the model and data domain, The complete loss functions can be found in the supplementary materials.

Training Neural Networks and Evaluation
Before feeding the dispersion data and Vs model into the neural network, we apply linear transformations (see supplementary materials) to normalize them into the interval range of [-1, 1] to speed up the convergence of the training process.Outputs of the neural network in the data and model domains are transformed back to its original amplitude according to their linear transformation relations before computing misfits.For a comparative study, we apply both the conventional 1-D CNN and the proposed Wcycle-GAN method to the Vs inversion at SC region, and the structure of CNN is the same as the model generative subnet (  ) in the Wcycle-GAN.
For the training process of both CNN and Wcycle-GAN (Figure S4), the iteration stops when the root-mean-square (RMS) misfit between the predicted and true shear velocities in the labeled data is below 0.07 km/s, where  ℎ is the number of Vs models in a batch,  _  and  _  are the predicted Vs and true models in the labeled data, respectively.For the hyperparameter selection, we choose  1 = 5 and  2 = 3 for training the Wcycle-GAN.The training batch size is 160 for the labeled data and 80 for the unlabeled data.We use Adam (Kingma and Ba, 2014) for optimization with a learning rate of 5 × 10 −5 and other parameters as default.For the CNN training, the neural networks could further lower its RMS misfit of the labeled data at later epochs, which may result in overfitting.
Finally, we apply the trained generative networks (  ) to the observed dispersion data and output the final Vs model.To evaluate the performance of models obtained from different methods, we compute the chi-square misfit between the predict data calculated using the final Vs model and the observed dispersion data at every grid point: where N=17x2 is the number of observed dispersion data points,    and    are the theoretical and observed dispersion wave speed (i.e., phase and group velocities) at the  ℎ data point, and    is the corresponding data uncertainty.A good data fitting is achieved when the normalized  2 misfit is close to 1 (Bevington, 1969;Zelt et al., 2003).

Results
The advantages of the proposed Wcycle-GAN method are demonstrated using surface wave dispersion data obtained from the SC plate boundary region.We first present the 3-D Vs model obtained from Wcycle-GAN method and compare it with that of Qiu et al. (2019) and the surface geology (section 4.1).Then, models derived from different machine learning algorithms (e.g., CNN) are compared to illustrate the advantages of incorporating unlabeled data into the network training process (section 4.2).

Output 3-D Vs Model
For training the Wcycle-GAN, the results converge after 1700 epochs.The trained network is applied to the observed dispersion data and generate the final 3-D Vs model by assembling all the 1-D Vs predictions.Because of the limited period range (i.e., 3-16s) of the input Rayleigh wave dispersion curves, the Vs model resolved beyond the 3-20 km depth range are not well constrained (Qiu et al., 2019).Therefore, we only focus on the Vs models at depths of 3-15 km.
Depth slices at the depth of 5 km and 10 km for the initial model (CVM-H) and differences between the initial and final models are presented in Figure S5.The largest differences between our final model and the CVM-H are found underneath the basins and near the Salton Trough in the top 3-10 km, consistent with that in Qiu et al. (2019).
Figure 4 shows the depth slices of the Vs model resolved at 5 km and 10 km from various methods (Figure S6 for depth slices at 3 km and 15 km).At shallow depths (e.g., in the top 3-7 km; Figures 4c and S6a-b), we can clearly see a good agreement between our final model (Figures 4c and 4g) and the surface geology, such as low velocity anomalies at Southern Central valley, LA Basin, Ventura Basin, and the Salton Trough; areas with high velocity in the Peninsular Ranges (e.g., Berg et al., 2018;Lee et al., 2014;Tape et al., 2010).It is important to note that our model shows the low velocity zone better within the junction between the San Jacinto Fault (SJF) and San Andreas Fault (SAF) compared to the CVM-H (Figure S5a-b).
At greater depths (e.g., below 10 km; Figures 4g and S6c-d), a sharp velocity contrast from west to east in the Peninsular Ranges is observed, which is related to the Hemet stepover (Marliyani et al., 2013).Clearer velocity contrasts across major fault systems, such as Elsinore Fault (EF), SJF and SAF are depicted in the map views of the final Vs model (Figures 4g and     S6c-d), suggesting the derived Vs model yields higher resolutions compared to the CVM-H.
These observations agree well with the large-scale features found in the Vs model of Qiu et al. (2019).In addition, the differences between the two models at different depth slices, which are shown in Figure S7, are rather small.The consistent observation of largest velocity updates beneath basin, coherent large-scale velocity structures, together with small model differences suggest a cross-validation of both the Wcycle-GAN and the Eikonal tomography model.Unlike the conventional linearized Vs inversion (e.g., Qiu et al., 2019), in which an extra spatial filtering is applied to achieve a smoothed 3-D Vs model, our final Vs model in map view suggests that the Wcycle-GAN method inherently guarantees a spatial smoothness that is similar to those of the surface wave velocity dispersion maps (Figure S1).The proposed Wcycle-GAN method shows potential to improve lateral consistency of the neighboring 1-D models, which is a significant drawback in current dispersion-curve based 1-D Vs inversion.We note that, while presenting the Vs model in map view better shows the large-scale features that are consistent with the surface geology, it is hard to demonstrate variations in structures at depth, such as geometry of the major fault systems (e.g., width of low-velocity zone and dipping fault).Thus, in section 5, we further illustrate three depth cross sections (blue lines in Figure 2) of our final Vs model for a detailed discussion of the resolved fault structures.
Figure 5 shows histograms of the chi-square misfit of the dispersion data computed following equation 8 for Vs models obtained from different methods.To calculate the misfit, the compressional velocity (Vp) model by assuming the same Vp/Vs ratio as the CVM-H and the density model same as the CVM-H are used.Map views of  misfits are depicted in Figure S8.
The misfits are lower using the Wcycle-GAN model than using the Vs model of Qiu et al. (2019) in the Salton Trough region, suggesting our final Vs model is more reasonable in the area.The average misfit of the Wcycle-GAN based model (0.949; Figure 5c) is slightly smaller than 1, suggesting the final Vs model is of good fit to the observed dispersion data.Although the average misfit value of our model is a bit higher than that (0.864; Figure 5d) of Qiu et al. (2019), we note the misfit values are also sensitive to the input Vp and density models, which might not be accurate as we assume the Vp/Vs ratio and density to the same as those of CVM-H.

Comparison with the Conventional CNN Algorithm
In this section, we compare the Vs model from the Wcycle-GAN method with that derived from the conventional CNN algorithm.The training parameters (e.g., batch size, learning rate) and stopping criteria are the same as illustrated in section 3.  Similar property of Wasserstein metric has been observed in near surface seismic velocity estimation using full-waveform inversion (Yang et al., 2018).The better accuracy in fitting the observed dispersion data and spatial continuity of the Vs model from the Wcycle-GAN method demonstrates the effectiveness of the proposed method by incorporating advanced loss function, cycle consistency, and unlabeled data into the training process.

Discussions
We suggest the proposed Wasserstein Cycle-GAN to be a robust data-driven method.On one hand, the Wasserstein adversarial loss with gradient penalty provides good training stability and convergence characteristic comparing with cross-entropy or least-squares.Figure S9 shows the comparative study of using different metrics for adversarial loss.Using the least square loss may result in underfitting to the labeled data as the incorrect prediction of the velocity jump at Moho depth.Both cross-entropy and least-square adversarial loss can result in strong artifacts and negative velocity gradient in the Vs predictions using unlabeled data.In comparison, Wasserstein loss results in high model prediction quality using either labeled or unlabeled data.The Vs model from the Wcycle-GAN method is smoother and laterally more continuous, compared to models derived from supervised method (Section 4.2).On the other hand, the proposed method incorporates the observed dispersion data into the training process that improves the generalization ability of the trained network.In addition, for the weighting factors in the loss function, changes in hyperparameter  1 and  2 has relatively small effects on the final derived Vs model, but a future study of the effects of the two hyperparameters would be beneficial for optimizing the Wcycle-GAN method.
To further discuss the importance of incorporating unlabeled data in the training process, we perform a third experiment, in which the same Wcycle-GAN structure is used but trained without the unlabeled data.The weighting factor  2 of the loss function (equation 1) is set to 10, different from section 3.2, since only the labeled data is used for training.Figures 4b and 4f  We also note that our Wcycle-GAN method requires less amount of labeled data in the training.To demonstrate this, we reduce the amount of labeled data by down sampling with a grid spacing of 0.1ºx0.1º(originally 0.03ºx0.03º).This results in a selection of 1890 out of the originally 16480 labeled data, which is even much less than the number (4076) of observed dispersion curves.Figures 6a and 6c show the depth slices of the Vs model from the Wcycle-GAN method trained with down sampled labeled data.The resulting Vs models are similar between the methods trained using a reduced and the full labeled datasets.Figure 5e shows the data misfit of the Vs model from the network trained with reduced labeled dataset.There is only a small increase in the mean misfit, i.e., from 0.949 to 1.10, compared to that of results trained with the full labeled dataset.It is important to note that the average misfit value 1.1 is still much smaller than those of the supervised methods (Figures 5a and 5b).The result suggests the redundancy in the labeled data and further demonstrates the strength of the proposed Wcycle-GAN method in resolving high accuracy Vs model using small amount of labeled data.This can also save time spent on training as it takes only ~4 hours after reducing the amount of the labeled dataset by almost a factor of 10.
An extension to the proposed Wcycle-GAN algorithm is incorporating the location (i.e., longitude and latitude) as prior information in the training process, which can further enhance the accuracy in the application of Vs inversion.Map views of the Vs model, derived from the proposed method after incorporating the latitude and longitude of both the labeled and unlabeled data in the training process, at 5 km and 10 km are presented in Figures 6b and 6d, respectively.
Details of how to incorporate location information into a machine learning network training can be found in supplementary materials.The Vs models resolved from networks trained with and without the input of location information are nearly identical to each other at a large scale (e.g., tens of kilometers; Figures 4c, 4f, 6b, and 6d).The data misfits (~0.9 in Figure 5f) are slightly smaller after incorporating the location information into the training process.Therefore, we show the cross sections of the Vs model resolved from the network trained with location information incorporated in Figure 7 to infer structures of the major fault systems.We note that the incorporation of location information for both the labeled and unlabeled data will have greater impact on the results when applying to the Vs inversion at regional or global scales.
We show the cross sections DD', EE' and FF' (blue lines in Figure 2), the same as those shown in Figure 1 of Qiu et al. (2019), of the final Vs model between 3 km and 20 km to infer the structures of EF, SJF, and SAF at depth.In the profile DD', the low velocity zone indicates both the SJF and SAF are nearly vertical.This is consistent with the fault geometry near San Gorgonio Pass (SGP) from the Community fault model in SC (CFMv5; Plesch et al., 2007).
Besides, we observe a pronounced low-velocity body (dashed circle, Figure 7) between depths of 15-20 km, which is consistent with the results of Qiu et al. (2019) (Figure S10c).This low velocity anomaly at great depth, with ~5-7% lower velocities compared to the surrounding media, is likely related to the large damage volume beneath the SGP estimated in Ben-Zion and Zaliapin (2019).
In profile EE', we observe a broad (~5-km-wide) flower-shaped (i.e., width decreases with depth) fault damage zone with ~2-3% average velocity reduction for the SAF in the top 8-10 km that is clearly dipping towards the northeast.The estimated dipping angle of SAF in profile EE' is ~60⁰.This dipping angle is consistent with the observation in Qiu et al. (2019), but the flowershaped fault damage zone is less clear in their results (Figure S10g).Besides, the low velocity anomaly beneath the Eastern California Shear Zone (ECSZ) is slightly deeper than that in Qiu et al. (2019).Similarly, the SAF is highlighted by a flower-shaped low-velocity zone that is dipping towards the northeast with a similar angle (~60⁰) in the top 10 km.Different from EE', the low velocity zone is more pronounced (~4-5%) in FF', likely indicating the rocks inside the fault zone are more damaged in the southwest.The flower-shaped fault zone structures in EE' and FF' are consistent with the model of Fuis et al. (2016) derived for the southern section of the SAF by jointly inverting gravity and magnetic data.In addition, the observed ~60⁰ dipping angle in both EE' and FF' agrees well with the previous the estimation from magnetic data (~65⁰; Fuis et al., 2012).It is important to note that the model of Qiu et al. (2019) is subject to the choice of damping parameter in and spatial smoothing after the Vs inversion.Therefore, through the cross section comparisons, we again demonstrate the robustness of our Vs model from the Wcycle-GAN model and confirm with a different method that the flower-shaped damage zone and fault dipping towards northeast observed for the southern section of the SAF in Qiu et al. (2019) are reliable.These features have important implications, such as a better understanding of strong ground motions produced by earthquakes that will occur on the SAF.

Conclusions
We implement the Wcycle-GAN method to the Vs inversion in surface wave tomography, by applied CNN based Vs inversion to Rayleigh wave dispersion data in China and the southern California (SC) plate boundary regions.The results show the effectiveness of the CNN technique and demonstrate the quality of the training dataset can affect accuracy of the output Vs model.In this study, we develop a deep-learning-

Figure 4
Figure 4 3. For the CNN case, 120 epochs are needed to achieve a convergent training.Training the Wcycle-GAN takes longer time than the CNN method due to extra efforts on training the adversarial networks.But the Wcycle-GAN method still provides sufficient efficiency as the 1700 epochs only took ~12 hours using a single NVIDIA GeForce RTX 2080 graphic card.After the training process, for both the CNN and

Figure 5 Figures
Figure 5 present map views of the output Vs model from such experiment.Strong local velocity jumps and artificial lateral heterogeneities are seen in the model, comparing with the Vs model map views in Figures 4c and 4g.Training the Wcycle-GAN without unlabeled data results in larger data misfits (~2.3 in average) that are shown clearly both in histogram (Figure5b) and map view (FigureS8), compared to those of the proposed Wcycle-GAN method (0.949).Therefore, incorporating the unlabeled data into the training process is essential for providing robust and reliable Vs model when using machine learning based methods to solve the Vs inversion problem.

Figure 7
Figure 7 incorporating unlabeled data into the network training process.The proposed method shows an improved prediction accuracy, better training stability, and only requires a small amount of labeled data, compared to CNN-based method.We demonstrate these improvements by using the fundamental mode Rayleigh wave velocity dispersion data derived in the Southern California plate boundary region.The final Vs model obtained from the proposed method show clearer images of structures near faults in the top 15 km, specifically the low velocity damage zone centered on the southern section of the San Andreas fault that is dipping ~60⁰ to the northeast.In addition, integrating longitude and latitude information into the Wcycle-GAN algorithm further improves the prediction accuracy as well as the spatial continuity of the final Vs model, particularly in the cross sections.For future studies, we would like to investigate the potential of this method by reducing the amount of labeled data through leveraging random sampling or sampling strategy based on clustering analysis(Eymold & Jordan, 2019).using the deep-learning framework of TensorFlow.The training and prediction processes are conducted using a single NVIDIA GeForce RTX 2080 GPU with a memory of 8GB.Fruitful discussions with Dr. Jing Hu at University of Science and Technology of China (USTC) are well appreciated.We are grateful to the Department of Earth, Environmental and Planetary Sciences, Rice University, for supporting this study and A. Cai's PhD research.

Figure captions Figure 1 .
Figure captions

Figure 2 .
Figure 2. Map of the Southern California plate boundary region.The thick black lines depict surface traces of major faults, coastlines, and state boundaries.The yellow triangles are seismic stations used in Qiu et al. (2019) to derive the Rayleigh wave velocity dispersion maps with a grid size of 0.05°×0.05°(grid lines).Three cross sections (i.e., DD' to FF'; blue lines) of the final Vs model are presented in Figure7.The cross sections DD' to FF' are of the same locations as those inQiu et al. (2019).SAF -San Andreas Fault; SJF -San Jacinto Fault; EF -Elsinore Fault; ECSZ -Eastern California Shear Zone.

Figure 3 .
Figure 3.The algorithm comparison between convolutional neural network (CNN), generative adversarial network (GAN), and Wasserstein Cycle-GAN (Wcycle-GAN).The suffix  and  represents shear velocity model and dispersion data, respectively.CNN (a) computes point-wise misfit (estimation loss: ℒ  ) between real samples and translated samples generated by a model generative network (   ).The GAN (b) introduces an adversarial network (  ) and computes the difference between distributions of real and generated samples using adversarial loss (ℒ  ), by updating generator and discriminator separately in a single iteration.The Wcycle-GAN (c) uses Wasserstein metric for adversarial loss in (b).Besides, a data generative subnet (  ) is incorporated to learn the modeling of velocity model to dispersion data, together with a corresponding data discriminator (  ).The use of   enables an extra constraint, the cycle consistent loss (ℒ  ), which is estimated by the misfit between the input real sample and reconstructed sample.The complete Wcycle-GAN penalty function is a linear combination of three types of the loss function (ℒ  , ℒ  , and ℒ  ).

Figure 4 .
Figure 4. Comparison of depth slices for the output 3-D Vs models from four different methods.Depth slices at 5 km (left column) and 10 km (right column) for (a), (e) CNN-based model; (b), (f) Wcycle-GAN (WCGAN) based model but without using the unlabeled data in training; (c), (g) the proposed Wcycle-GAN based model; (d), (h) the Eikonal tomography

Figure 5 .
Figure 5. Chi-square misfit histograms for Vs models derived from six different methods: (a) CNN-based method; (b) Wcycle-GAN (WCGAN) based model but without using the unlabeled data in training; (c) Wcycle-GAN based model with full labeled data and unlabeled data; (d) model from Qiu et al. (2019); (e) Wcycle-GAN based model but using down sampled 1890 label data; (f) Wcycle-GAN based method with location information added as extra channels in the network training.

Figure 6 .
Figure 6.Depth slices of shear velocity model at 5 km (left column) and 10 km (right column) for (a), (c) Wcycle-GAN (WCGAN) based model but using down sampled 1890 unlabeled data and (b), (d) Wcycle-GAN based model with location information added as extra prior information in the network training (WCGAN + Position).

Figure 7 .
Figure 7. Cross sections (blue lines in Figure 2) of the Vs model resolved from the Wcycle-GAN network trained with location information.Colors in panels on the left show the velocity values, whereas velocity perturbations, relative to the 1-D average Vs depth profile, in percentage are illustrated on the right.The black curve depicts an exaggerated topography variation.The black dashed line in each profile represents the inferred fault planes for SJF in DD' and SAF in EE' and FF'.The dashed ellipse in DD' outlines a low velocity anomaly that is likely associated with rock damaged inferred in Ben-Zion & Zaliapin (2019).EF = Elsinore Fault; SJF = San Jacinto Fault; ECSZ = Eastern California Shear Zone.
The calculation of Wasserstein adversarial loss can be described as two steps.First, we fix the trainable parameters in the generator   and update discriminator   using the formula :  and  stand for the labeled Vs model and synthetic dispersion data pairs, respectively;  * is the unlabeled real dispersion data;  * represents the trainable parameters in the networks;   * ( * ) is the neural network operator that generates translated samples using Vs model as input.For instance,    is the trainable parameters in the model generative subnet;     () is the output translated dispersion data generated by   .