Deep Learning for Geophysics: Current and Future Trends

Recently deep learning (DL), as a new data‐driven technique compared to conventional approaches, has attracted increasing attention in geophysical community, resulting in many opportunities and challenges. DL was proven to have the potential to predict complex system states accurately and relieve the “curse of dimensionality” in large temporal and spatial geophysical applications. We address the basic concepts, state‐of‐the‐art literature, and future trends by reviewing DL approaches in various geosciences scenarios. Exploration geophysics, earthquakes, and remote sensing are the main focuses. More applications, including Earth structure, water resources, atmospheric science, and space science, are also reviewed. Additionally, the difficulties of applying DL in the geophysical community are discussed. The trends of DL in geophysics in recent years are analyzed. Several promising directions are provided for future research involving DL in geophysics, such as unsupervised learning, transfer learning, multimodal DL, federated learning, uncertainty estimation, and active learning. A coding tutorial and a summary of tips for rapidly exploring DL are presented for beginners and interested readers of geophysics.

Examples of Data-Driven Tasks in Geophysics Figure 1. An illustration of model-driven and data-driven methods. On the left are the research topics in geophysics ranging from the Earth's core to the outer space. On the right is the observation means used at present. In the middle are examples of model-driven and data-driven methods. In model-driven methods, the principles of geophysical phenomena are induced from a large amount of observed data based on physical causality, then the models are used to deduct the geophysical phenomena in the future or in the past. In data-driven methods, the computer first inducts a regression or classification model without considering physical causality. Then, this model will perform tasks such as classification on incoming datasets. the wave equation becomes increasingly complex, the numerical implementation of the equation becomes nontrivial, and the computational cost increases considerably for large-scale scenarios. Different from traditional model-driven methods, machine learning (ML) is a type of data-driven approach that trains a regression or classification model through a complex nonlinear mapping with adjustable parameters based on a training data set. The comparison of model-driven and data-driven approaches is summarized in Figure 1. For decades, ML methods have been widely adopted in various geophysical applications, such as exploration geophysics (Huang et al., 2006;Helmy et al., 2010;Jia & Ma, 2017;Lim, 2005;Poulton, 2002;Zhang et al., 2014), earthquake localization , aftershock pattern analysis (DeVries et al., 2018), and Earth system analysis (Reichstein et al., 2019). A review article about ML in solid Earth geoscience was recently published in Science (Bergen et al., 2019). The topic in-YU AND MA 10.1029/2021RG000742 4 of 36  cludes a variety of ML techniques, from traditional methods, such as logistic regression, support vector machines, random forests and neural networks, to modern methods, such as deep neural network and deep generative models. The article stresses that ML will play a key role in accelerating the understanding of the complex, interacting and multiscale processes of Earth's behavior.
In the ML community, an artificial neural network (ANN) is one such regression or classification model that is analogous to the human brain and consists of layers of neurons. An ANN with more than one layer, that is, a deep neural network (DNN), is the core of a recently developed ML method, named deep learning (DL) (LeCun et al., 2015). DL mainly encompasses supervised and unsupervised approaches depending on whether labels are available or not, respectively. Supervised approaches train a DNN by matching the input and labels and are usually used for classification and regression tasks. Unsupervised approaches update the parameters by building a compact internal representation and then are used for clustering or pattern recognition. In addition, DL also contains semi-supervised learning where partial labels are available and reinforcement learning where a human-designed environment provides feedback for the DNN. Figure 2 summarizes the relationship from artificial intelligence to DL and the classification of DL approaches. DL has shown potential in overcoming the limitations of traditional approaches in various areas. The performance of DL is even superior to the performance of the human brain in specific tasks, such as image classification (5.1% vs. 3.57% with respect to the top-5 classification errors, He et al., 2016) and the game of Go.
The geophysical community has shown great interests in DL in recent years. Figure 3 shows the published papers related to artificial intelligence in two major geophysical unions, that is, society of exploration geophysics (SEG) and American geophysical union (AGU). A clear exponential growth is observed in both libraries due to the use of DL techniques. Moreover, DL has also provided several astonishing results to the geophysical community. For instance, on the STanford EArthquake Data set (STEAD), the earthquake detection accuracy is improved to 100% compared to 91% accuracy of the traditional STA/LTA (short time average over long time average) method (Mousavi, Zhu, Sheng, et al., 2019, Mousavi et al., 2020. DL makes characterizing the earth with high resolution on a large scale possible (Chattopadhyay et al., 2020;Chen et al., 2019;Zhang, Stanev, & Grayek, 2020). DL can even be used for discovering physical concepts (Iten et al., 2020), such as the solar system is heliocentric.
Our review introduces DL-related literature covering a variety of geophysical applications, from deep to the Earth's core to distant outer space, and mainly focuses on exploration geophysics, earthquake science and a geophysical data observation method for remote sensing. This review intends to first provide a glance at the most recent DL research related to geophysics, along with analysis of the changes and challenges DL brings to the geophysical community, and then discusses the future trends. Figure 4 presents the topics included in this review. In addition, we provide a cookbook for beginners who are interested in DL, from geophysical students to researchers.
The first section above mentioned briefly introduces the background of geophysics and DL. Following contents consist of three sections. The second section contains concepts, and we introduce the basic idea of DL (Section 2). The third section reviews DL applications in geophysical areas (Section 3). A discussion of future trends (Section 4) is given as extensions of this review. The fifth section (Section 5) summarizes this review. A tutorial section for beginners is given in the appendix.

The Theory of Deep Learning
Readers who are already familiar with general theory in DL may skip to Section 3. We denote scalars by italic letters, vectors by bold lowercase letters and matrices by bold uppercase letters. In geophysics, a large number of regression or classification tasks can be reduced to, model-driven routines, an optimization objective loss function is established with an additional constraint, such as sparsity constraint in dictionary learning. In data-driven routines, given an extensive training set, a mapping between x and y is established by training, as done in DL, which is especially suitable for situations where L is not precisely known.
To bring the reader into DL gradually, this paper first introduces another approach, that is, dictionary learning (Aharon et al., 2006), since the theoretical frameworks of dictionary learning and DL are similar. In dictionary learning, an adaptive dictionary is learned as a representation of the target data. The key features of dictionary learning are single-level decomposition, unsupervised learning, and linearity. Single-level decomposition means that one dictionary is used to represent a signal. Unsupervised learning means no labels are provided during dictionary learning. Besides, only the target data are used without an extensive training set. Linearity implies that the data decomposition on the dictionary is linear. The above features make the theory of dictionary learning simple. This review will help readers transfer existing knowledge on dictionary learning to DL.

Dictionary Learning
To solve Equation 1, an optimization function E(x;y) with a regularization term R is constructed: where D is a similarity measurement function. Typically, the L 2 -norm   Lx y  2 is used under the assumption of Gaussian distribution for the error. Tikhonov regularization (R x x      2 2 ) and sparsity are two popular regularization terms. In sparsity regularization, R x Wx      1 , where W is a sparse transform with several vectorized bases. W is also termed as the dictionary. The goal of dictionary learning is to train an optimized sparse transform W, which is used for the sparse representation of x. The objective function of dictionary learning involves learning W via matrix decomposition with constraints R w and R v on the dictionary W and coefficient v, where W and v are optimized alternatively, that is, dictionary updating and sparse coding. Here we introduce two dictionary learning approaches: K-SVD and data-driven tight frame (DDTF).
YU AND MA 10.1029/2021RG000742 6 of 36 Figure 5. An illustration of dictionary learning: data-driven tight frame. The dictionary is initialized with a spline framelet. After training based on a post-stack seismic data set, the trained dictionary exhibits apparent structures.
K-SVD (where SVD is singular value decomposition) (Aharon et al., 2006) regularizes the sparsity of v and normalizes the energy of W. K-SVD uses orthogonal matching pursuit for sparse coding and several tricks in dictionary updating. First, one component of the dictionary is updated at a given time, and the remaining terms are fixed. Second, a rank-1 approximation SVD algorithm is used to obtain the updated dictionary and coefficients simultaneously, thereby accelerating convergence and reducing computational memory. K-SVD is applied in geophysics with extensions to improve efficiency (Nazari Siahsar et al., 2017).
Despite the success of K-SVD in signal enhancement and compression, dictionary updating is still time-consuming regarding high-dimensional and large-scale datasets, such as 3D prestack data in seismic exploration. K-SVD includes one SVD step to update one dictionary term. Can the entire dictionary be updated by one SVD for efficient improvement? A data-driven tight frame (Cai et al., 2014;Liang et al., 2014) was proposed by enforcing a tight frame constraint on the dictionary W. The tight frame condition is a slightly weaker condition than orthogonality, for which the perfect reconstruction property holds. With the tight frame property, dictionary updating in DDTF is achieved with one SVD, which is hundreds of times faster than K-SVD. DDTF has been applied in high dimensional seismic data reconstruction (Yu et al., 2015(Yu et al., , 2016. An example of a learned dictionary with 3D DDTF for a seismic volume is shown in Figure 5.

Deep Learning
Unlike dictionary learning, DL treats geophysical problems as classification or regression problems. A DNN F is used to approximate x from y, where Θ is the parameter set of the DNN. In classification tasks, x is a one-hot encoded vector representing the categories. Θ is obtained by building a high-dimension approximation between two sets YU AND MA 10.1029/2021RG000742 7 of 36 In each layer, nine of the learned filters are shown. A great number of hierarchical structures are observed in different layers. Layer 1 exhibits edge structures, layer 2 shows small structures of seismic events, and layer 3 shows small portions of seismic sections. The filters in layer 2 and 3 are blank near edges, which may be caused by the boundary effect of the convolutional filter. Layer 4 gives larger seismic portions, which are approximations to the training data. The filters in layer 4 look more similar to each other than training datasets because deep neural network (DNN) tries to learn the similar and hierarchical patterns which compose the data.
, that is, the labels and inputs. The approximation is achieved by minimizing the following loss function to obtain an optimized Θ: If F is differentiable, a gradient-based method can be used to optimize Θ. However, a large Jacobi matrix is involved when calculating  E Θ , making it infeasible for large-scale datasets. A back-propagation method (Rumelhart et al., 1986) is proposed to compute  E Θ and avoid computing the Jacobi matrix. In unsupervised learning, the label x is not known, such that additional constraints are required, such as making x identical to y.
The relations of DL and dictionary learning are as follows: the depth of decomposition, the amount of training data, and the nonlinear operators. Dictionary learning is usually a single-level matrix decomposition problem. A double sparsity (DS) dictionary learning was proposed to explore deep decomposition (Rubinstein et al., 2010). The motivation of DS is that the learned dictionary atoms still share several underlying sparse pattern for a generic dictionary. In other words, the dictionary is represented with a sparse coefficient matrix multiplied by a fixed dictionary, as in discrete cosine transform. Inspired by DS dictionary learning, can we propose triple, quadruple or even centuple dictionary learning? We know cascading linear operators are equivalent to a single linear operator. Therefore, using more than one fixed dictionary does not improve the signal representation ability compared to that ability of one fixed dictionary if no additional constraints are provided. In DL, nonlinear operators are combined in such a deep structure. An ANN with one hidden layer and nonlinear operators can represent any complex function with a sufficient number of hidden neurons. To fit ANN with many hidden neurons, we need an extensive training set, while dictionary learning involves only one target data. To compare the learned features of dictionary learning in Figure   Understanding deep learning (DL) from different perspectives. Optimization: DL is basically a nonlinear optimization problem which solves for the optimized parameters to minimize the loss function of the outputs and labels. Dictionary learning: The filter training in DL is similar to that in dictionary learning. High dimensional mapping: Deep neural network (DNN) in DL is basically a high-dimensional mapping from the input to the labels. Optimal transport: a generative adversarial network can be interpreted by the theory of optimal transportation, which involves transformation between the given white noise and the data distribution. Manifold learning: The representation of training samples in the latent space of a DNN is similar to that learning a low dimensional manifold which contains all the data samples. Ordinary differential equation: a recurrent neural networks is basically a solution of an ordinary differential equation with the Euler method.
The theory of DL can be understanded from different angles except for dictionary learning (Figure 7). On one hand, DL can be treated as an ultra-high dimensional nonlinear mapping from data space to the feature space or the target space, where the nonlinear mapping is represented by a DNN. Therefore, DL is basically a high-dimensional nonlinear optimization problem. On the other hand, recurrent neural networks (RNNs) are basically a solution of the ordinary differential equation with the Euler method (Chen et al., 2018). A generative adversarial network (Creswell et al., 2018;Goodfellow et al., 2014) (GAN) can also be interpreted by the theory of optimal transportation, since the targets of GAN are mainly manifold learning and probability distribution transformation, that is, transformation between the given white noise and the data YU AND MA 10.1029/2021RG000742 9 of 36 , the inputs of one layer are connected to every unit in the next layer. f stands for a nonlinear activation function. In (b-f), we omit the details of the layers and maintain the shape of each network architecture. (b) Vanilla convolutional neural network (CNN) is cascaded by convolutional layers, pooling layers, nonlinear layer, and etc. In CNN, the outputs of the convolutional layers are either the same or smaller than the input depending on the strides used for convolution. Pooling layers will reduce the size of the extracted features. In regression or classification tasks, the output usually has the same dimension or a smaller dimension than the input (where (b) shows the latter situation). The difference between regression and classification is that the outputs are continuous variables in regression tasks and discrete variables representing categories in classification tasks. The dimension of the latent feature space in the CAE may be either larger or smaller than that of the data space, where (c) shows the latter. (d) Skip connections in U-Net are used to bring the low-level features to a high level. (e) In a GAN, lowdimensional random vectors are used to generate a sample from the generator, and then the sample is classified as true or false by the discriminator. (f) In an recurrent neural network (RNN), the output or hidden state of the network is used as input in a cycle. distribution (Lei et al., 2020). RNNs and GANs are two specific DNNs and will be introduced in the next subsection.

Deep Neural Network Architectures
The key components of DL are the training set, network architectures and parameter optimization. The architectures of DNNs vary in different applications; here, we introduce several commonly used architectures.
A fully connected neural network (FCNN) (Figure 8a) is an ANN composed of fully connected layers where the inputs of one layer are connected to every unit in the next layer. The weighted summation of the inputs passes through a nonlinear activation function f in one unit. The typical f in DL are rectified linear unit (ReLU), sigmoid and tanh functions, as shown in Figure 9a. The number of layers in a FCNN has a significant effect on the fitting and generalization abilities of the model. However, FCNNs were restricted to a few layers due to the computational capacity of the available hardware, the vanishing and explosion gradient problem during optimization, etc. With the development of hardware and optimization algorithms, ANNs tend to become deeper. On the other hand, if a raw data set is the input directly into the FCNN, massive parameters are required since each pixel corresponds to one feature, especially for high dimensional inputs. Features are used to basically reduce the dimension at the input layer and as a result reducing the amount of parameters in the model. FCNN requires preselected features with full reliance on experience and ignores the structure of the input entirely. Automated feature selection algorithms are proposed (Qi et al., 2020), but require high computational resources. To reduce the number of parameters in an FCNN and consider local coherency in an image, convolutional neural networks (CNN) (Figure 8b) were proposed to share network parameters with convolutional filters.
CNNs have developed rapidly since 2010 for image classification and segmentation, and several popular CNNs include VGGNet (Simonyan & Zisserman, 2015) and AlexNet (Krizhevsky et al., 2017). CNNs are also used in image denoising (Zhang, Zuo, Chen, et al., 2017) and super-resolution tasks (Dong et al., 2014). A CNN uses original data rather than selected features as an input set and uses convolutional filters to restrict the inputs of a neural network to within a local range. The convolutional filters are shared by different neurons in the same layer. As shown in Figure 9b, one typical block in CNN consists of one convolutional layer, one nonlinear layer, one batch normalization and one pooling layer. Convolutional layers and nonlinear layers provide the basis components of CNN. Batch normalization layers prevent gradient explosion and stabilize the training. Pooling layers subsamples the input to extract key features. The simplest CNNs are named as vanilla CNNs, which are CNNs with simple sequential structures (the same for vanilla FCNN). Vanilla CNNs are reliable for most applications in geophysics, such as denoising, interpolation, velocity modeling, and data interpretation, if many training samples and labels are available. CNN is invariant to small changes in the inputs due to the pooling layers. However, pooling layers lose information, such that CNN cannot characterize the changes in the input. Capsule networks (Sabour et al., 2017) are proposed to simultaneously keep the invariance and characterize the changes. This is achieved by replacing scalars with vectors to serve as inputs and outputs of the neurons. The length of the vector represents the probability that one entity exists. The orientation of the vector stands for the parameters of the entity.
More DL network architectures have been proposed for specific tasks based on vanilla FCNNs or CNNs. An autoencoder learns to reconstruct the inputs with useful representations with an encoder and a decoder (Makhzani, 2018). The encoder uses nonlinear layers to map the inputs to a latent space. The decoder uses nonlinear layers to decode the latent features into the original data space. Autoencoders are trained in a self-supervised manner. To obtain meaningful representation, additional constraints are imposed on the network. For example, undercomplete autoencoders limit the size of the latent space smaller than that of the inputs, such that the encoder extracts critical features. Sparse autoencoders are usually overcomplete with larger latent space than the input space and impose a sparse regularization on the latent space. Denoising autoencoders or contractive autoencoders learn useful representations by making the autoencoder robust to the input's variations. Convolutional autoencoders (CAE, Figure 8c) use convolutional layers in the encoder and deconvolutional layers in the decoder.
U-Nets (Ronneberger et al., 2015) (Figure 8d) have U-shaped structures and skip connections. The skip connections bring low-level features to high levels. U-Net was first proposed for image segmentation and has been applied in seismic data processing, inversion, and interpretation. The U-shape structure with a contracting path and expanding path makes every data point in the output contain all information from the input, such that the approach is suitable for mapping data in different domains, such as inverting velocity from seismic records. The input size of the test set must be the same as that in the training set for a trained U-Net. The data need to be processed patch-wisely if the size is not identical to the requirement of U-Net.
A GAN (Figure 8e) can be applied in adversarial training with one generator to produce a fake image or any other type of data and one discriminator to distinguish the produced one from the real ones. When training the discriminator, the real data set and generated data set correspond to labels one and zero, respectively. Additionally, when the generator is trained, all datasets correspond to the label one. Such a game will finally allow the generative network to produce fake images that the discriminative network cannot distinguish from real images. A GAN is used to generate samples with similar distributions as the training set. The generated samples are used for simulating realistic scenarios or expanding the training set. An extended GAN, named CycleGAN, was proposed with two generators and two discriminators for signal processing . In CycleGAN, a two-way mapping is trained for mapping two datasets from one to the other. The training set of CycleGAN is not necessarily paired as in a vanilla CNN, which makes it relatively easy to construct training sets in geophysical applications.
RNNs ( Figure 8f) are commonly used for tasks related to sequential data, where the current state depends on the history of inputs fed into the neural network. Long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) is a widely used RNN that considers how much historical information is forgotten or remembered. The main advantage of LSTM is in handling longer time duration of data compared to the vanilla RNN, which has vanishing gradient problem for long sequences. Therefore, the inference accuracy of LSTM increases with the amount of historical information considered. Gated recurrent unit (GRU) (Cho et al., 2014) is a variant of LSTM with a simpler architecture. Compared to LSTM, GRU has similar performance with fewer parameters, such that is computationally cheaper. In geophysical applications, RNNs are YU AND MA 10.1029/2021RG000742 11 of 36 mainly used for predicting the next sample of a temporally or spatially sequenced data set. RNNs are also used for seismic wavefield or earthquake signal modeling by simulating the time-dependent discrete partial differential equation.

DL Geophysical Applications
The most direct method for applying DL in geophysics is transferring geophysical tasks to computer vision tasks, such as denoising or classification. However, in certain geophysics applications, the characteristics of geophysical tasks or data are quite different from those of computer vision. For example, in geophysics, YU AND MA 10.1029/2021RG000742 12 of 36 Figure 11. Comparison of traditional and DL-based methods in exploration geophysics. (a) In random denoising tasks, the curvelet denoising method (Herrmann & Hennenfent, 2008) assumes that the signal is sparse under curvelet transform, and a matching method is used for denoising. In velocity inversion tasks, full-waveform inversion based on the wave equation is used for forward and adjoint modeling in the optimization algorithm. In fault interpretation tasks, faults are picked by interpreters. (b) The mentioned tasks are treated as regression problems that are optimized with neural networks. Different tasks may require different neural network architectures.
we have large-scale and high-dimensional data but fewer annotated labels. In this section, we introduce how DL approaches relieve the bottlenecks of traditional methods, what difficulties we encounter and how to solve them. The development of DL applications in exploration geophysics is first reviewed, followed by applications in earthquake science, remote sensing and other areas.

Exploration Geophysics
Exploration geophysics images the Earth's subsurface by inverting collocated physical fields at the surface, among which seismic wavefields are the most commonly used. Seismic exploration uses reflective seismic waves to predict subsurface structures. The main processes of seismic exploration consist of seismic data sampling and processing (denoising, interpolation, etc.), inversion (migration, imaging, etc.), and interpretation (fault detection, facies classification, etc.). Figure 10 summarizes the procedure of exploration geophysics. Figure 11 compares traditional and DL-based methods in exploration geophysics.

Seismic Data Processing
Seismic data are contaminated by different types of noise, such as random noise from the background, ground rolls that travel along the surface with high energy and mask useful signals, and multiple that reflected mul-YU AND MA 10.1029/2021RG000742 13 of 36 ti-times between the interfaces. One of the long-standing problems in exploration geophysics is to remove noise and improve the signal-to-noise ratio (SNR) of signals. Traditional methods use handcrafted filters or regularization for denoising certain kinds of noise by analyzing the corresponding features (Herrmann & Hennenfent, 2008). However, handcrafted filters fail when the signal and noise share a common feature space. DL methods avoid feature selection when used for seismic denoising. For example, U-Net-based DeepDenoiser can separate signals and noise by learning a nonlinear regression . Moreover, with DnCNN (Zhang, Zuo, Chen, et al., 2017), a CNN for denoising, the same architecture can be used for three kinds of seismic noise while achieving a high SNR (Yu et al., 2019) as long as a corresponding training set is constructed. However, there is still a long way to go. A DNN trained on synthetic datasets does not have a good generalization ability to field data. To make the network reusable, transfer learning (Donahue et al., 2014) can be used for field data denoising. Sometimes the labels of clean data are difficult to obtain, and one solution is to use multiple trials involving user-generated white noise to simulate real white noise (Wu, Zhang, Lin, Li, & Liu, 2019).
An example of scattered ground-roll attenuation is shown in Figure 12 ( Yu et al., 2019). Scattered ground roll is mainly observed in the desert area, and is caused by the scattering of ground roll when the near surface is laterally heterogeneous. The scattered ground roll is difficult to remove because it occupies the same frequency domain as the reflected signals. DnCNN was used to remove scattered ground roll successfully.
Due to environmental or economic limitations, seismic geophones are usually located irregularly or not densely enough under the principle of Nyquist sampling. The reconstruction or regularization of seismic data to a dense and regular grid is essential to improve inversion resolution. In the beginning, end-to-end DNNs were proposed for the reconstruction of regularly missing data (Wang, Zhang, Lu, et al., 2019) and randomly missing data (Mandelli et al., 2018;. However, the training sets are numerically synthetic, and do not generalize well to field data. We can YU AND MA 10.1029/2021RG000742 14 of 36  borrow training data from a natural image data set to train DnCNN and then embed it in the traditional project onto a convex set (POCS, Abma & Kabir, 2006) framework (Zhang, Yang, et al., 2020). The resulting interpolation algorithm generalized well to seismic data. Moreover, no new networks were required for the interpolation of other datasets. Figure 13 gives the training set and a simple interpolation result (Zhang, Yang, et al., 2020).  First arrival picking is used to select the first jumps of useful signals and has been automated but needs intense human intervention to check pickings with significant static corrections, weak energy, low signal-to-noise ratios, and dramatic phase changes. DL helps improve the automation and accuracy of first arrival picking on realistic seismic data. It is natural to transform first arrival picking into a classification problem by setting the first arrival as ones and other locations as zeros when DL is used . However, such a setting can cause imbalanced labels. An interesting approach treats first arrival picking as an image classification problem, where anything before the first arrival is set to zero, and all instances after the first arrival are set to one . This method works well for noisy situations and field datasets. After the segmentation image is obtained, a more advanced picking algorithm, such as an RNN, can be applied to take advantage of the global information . Figure 14 shows the results of the first arrival picking based on U-Net. We used 8,000 synthetic seismological samples. A gradient constraint was added to the loss function to enhance the continuity of the selected positions. For the output, three classifications were set: zeros before the first arrival, ones after the first arrival, and twos for the first arrival. The training data set was contaminated with strong noise and had missing traces. The predicted picking results were close to the labels.
More DL-based seismic signal processing literature that does not belong to the mentioned scope is summarized in this paragraph. Signal compression is essential for the storage and transmission of seismic data. Traditional seismic data are stored in 32 bits per sample. With an RNN to estimate the relationships among samples in a seismic trace and compress seismic data, only 16 bits are needed for lossless representations, such that half storage is saved (Payani et al., 2019). Seismic registration aligns seismic images for tasks such as time-lapse studies. However, when large shifts and rapid changes exist, this task is extremely difficult. A CNN is trained with two seismic images as inputs and the shift as output by learning from the concept of optical flow. The method outperforms traditional methods but is dependent on the training data set (Dhara and Bagaini, 2020).

Seismic Data Imaging
Seismic imaging is a challenging problem since traditional methods such as tomography and full waveform inversion (FWI) suffer from several bottlenecks. 1. Imaging is time-consuming due to the curse of dimensionality. 2. Imaging relies heavily on human interactions to select proper velocities. 3. Nonlinear optimization needs a good initialization or low frequency information, however there is a lack of low frequency energy in recorded data. DL methods help relief the bottlenecks from several angles.
First, end-to-end DL-based imaging methods use recorded data as inputs and velocity models as outputs, which provides a totally different imaging approach. DL methods avoid the mentioned bottlenecks, providing a next-generation imaging method. The first attempts at DL in staking (  One important issue is that the input is in the data space and the output is in the model space, both with high dimensional parameters. U-Net is used to transfer from different spaces with different dimensions, and downsampling is used to reduce the parameters while training the DNN (Yang & Ma, 2019). Figure 15 shows the velocity inversion results from Yang and Ma (2019).
However, end-to-end DL imaging also has disadvantages, such as a lack of training samples and restricted input sizes due to memory limitations. An interesting work used smoothed natural images as velocity models, thus producing a large number of models to construct the training set . Figure 16 shows an example on how ) convert a three-channel color image to a velocity model.
To make DL-based imaging applicable to large scale inputs, more works aim to collaborate with traditional methods and solve one of the mentioned bottlenecks, such as extrapolating the frequency range of seismic data from high to low frequencies for FWI (Fang, Zhou, et al., 2020;Ovcharenko et al., 2019), and adding constraints to FWI (Zhang & Alkhalifah, 2019). To mitigate the "curse of dimensionality" problem of global optimization in FWI, CAE is used to reduce the dimension of FWI by optimizing in the latent space (Gao et al., 2019). Another work aims at the high computational cost of forward modeling when the high-order finite difference method is used. A GAN is used to produce a high-quality wavefield from a low-quality wavefield with a lower-order finite difference in the context of surface-related multiples, ghosts, and dispersion (Siahkoohi et al., 2019). U-Net can be used for velocity picking in stacking ( Figure 17, Wang et al., 2021). The inputs are seismological data, and the outputs have values of one where the picks are located and values of zero elsewhere.
An alternative is to replace the FWI object with an RNN loss function. The structure of an RNN is similar to that of finite different time evolution, and the network parameters correspond to the selected velocity model. Therefore, optimizing an RNN is equivalent to optimizing FWI (Sun, Niu, et al., 2020). Such a strategy is extended to the simultaneous inversion of velocity and density (Liu, 2020). Figure 18 shows the structure of a modified RNN-based on the acoustic wave equation used in (Liu, 2020). The diagram represents the discretized wave equation implemented in an RNN with a flow chart. The optimized method in FWI can also be learned by a DNN rather than with a gradient-descent-based approach (Sun & Alkhalifah, 2020). An ML-descent method is proposed to consider the historical information of the gradient based on an RNN rather than handcrafted directions.

Seismic Data Interpretation and Attributes Analysis
Seismic interpretation (faults, layers, dips, etc.) or attribute analysis (impedance, frequency, facies, etc.) can be used to help the extraction of subsurface geologic information and locate underground sweet points. However, both tasks are time-consuming since interventions by experts are required. Preliminary works show that DL has the potential to improve the efficiency and accuracy in seismic interpretation or attribute analysis. The localization of faults, layers, and dips in seismic interpretation is similar to object detection in computer vision. Therefore, DNNs for image detection can be directly applied in seismic interpretation. However, unlike the computer vision industry, it is difficult to obtain a public training set or to manually construct a training set for field datasets. Building realistic synthetic datasets rather than handcrafted field datasets is more efficient and can produce similar results. Therefore, synthetic samples are used for training. To build an approximately realistic 3D training data set, randomly choosing folding and faulting parameters in a reasonable range is required . Then, the data set is used to train a 3D U-Net for the seismic structural interpretation of features, such as faults, layers, and dips, in field datasets. If the detected objects are of a small proportion, a class-balanced binary cross-entropy loss function is used to adjust the data imbalance so that the network is not trained to predict only zeros . An alternative to a synthetic training set is a semi-automated approach that annotates the targets on a coarse scale and predicts them on a fine scale (Wu, Zhang, Lin, Cao, et al., 2019). An example of synthetic post-stack image and field data fault analysis is shown in Figure 19  .
Attribute analysis is similar to image classification, where seismic images are inputs and areas with labels as different attributes are output. Therefore, DNNs for image classification can be directly applied in seismic attribute analysis (Das et al., 2019;Feng, Mejer Hansen, et al., 2020;You et al., 2020). If the attributes cannot be directly computed from the seismic data, a DNN can work in a cascaded way (Das & Mukerji, 2020). If labels are not available, CAE is used for feature extraction, and then a clustering method, such as K-means, is used for unsupervised clustering He et al., 2018;Qian et al., 2018). Clustering refers to grouping similar attributes in an unsupervised manner. For example, we can use clustering to decide whether a region contains fluvial facies or faults based on stacked sections. CAE and K-means can further be optimized simultaneously for better feature extraction (Mousavi, Zhu, Ellsworth, et al., 2019). To mitigate the dependence of vanilla CNNs on the amount of labeled seismic data available, a 1D CycleGAN-based algorithm was proposed for impedance inversion . The CycleGAN did not require training set pairing. Only two sets with and without high fidelity are needed. To consider the spatial continuity and similarity of adjacent traces, an RNN is used in facies analysis .

Earthquake Science
The goal of earthquake data processing is quite different from that of exploration geophysics; therefore, this section focuses on DL-based earthquake signal processing. The preliminary processing of earthquake signals includes classification to distinguish real earthquakes from noise and arrival picking to identify the arrival times of primary (P) and secondary (S) waves. Further applications involve earthquake location and Earth tomography. DL has shown promising results in these applications.

Earthquake and Noise Classification
Earthquake signal and noise classification is the most fundamental and difficult task in earthquake early warning (EEW  in signal and noise discrimination since it is a classification task. With a sufficient training set, DNN can achieve up to 99.2%  and 99.5% precision  in different regions. To detect small and weak earthquake signals robust to strong noise and non-earthquake signals, a residual network with convolutional and recurrent units is developed (Mousavi, Zhu, Sheng, et al., 2019). RNN and CNN are also used in a more challenging task to distinguish between anthropogenic sources, such as mining or quarry blasts, and tectonic seismicity (Linville et al., 2019). More categories of signals are required to identify in specific tasks, such as in volcano seismic detection (Titos et al., 2019). Volcano seismic signals can be classified into six classes: long-period events, volcanic tremors, volcano-tectonic events, explosions, hybrid events, and tornados (Malfante et al., 2018). Uncertainty is also considered in volcano-seismic monitoring .
We provide an example of using the wavelet scattering transform (WST) (Mallat, 2012) and a support vector machine for earthquake classification with a limited number of training samples. The WST involves a cascade of wavelet transforms, a module operator, and an averaging operator, corresponding to convolutional filters, a nonlinear operator, and a pooling operator in a CNN, respectively. The critical difference between the WST and a CNN is that the filters are predesigned with the wavelet transform in the WST. In our case, only 100 records were used for training, and 2,000 records were used for testing. We obtained a classification accuracy as high as 93% with the WST method. Figure 20 shows the architecture of the WST algorithm.

Arrival Picking
Arrival picking for earthquakes identifies the arrival time of P and S waves. Traditional automated arrival picking algorithms, such as short-term average/long-term average method (STA/LTA), are less precise than human experts and rely on thresholding setting. DL-based arrival picking overcomes these shortcomings and helps illuminate the Earth structure clearly (Wang, Xiao, et al., 2019). With a sufficiently large training set, one can achieve remarkably picking and classification accuracies higher than STA/LTA Zhou et al., 2019), even close to or better than human experts (Ross et al., 2018, 4.5 million seismograms training set). If labels are not sufficient, a GAN-based model EarthquakeGen can be used to artificially expand labeled data sets (Wang, Zhang, & Li, 2019). The detection accuracy was greatly improved by performing artificial sampling for the training set. Simultaneous earthquake detection and phase picking can further improve the accuracy of both tasks (Mousavi et al., 2020;Zhou et al., 2019).

Earthquake Location and Other Applications
Earthquake location and magnitudes estimation are important in EEW and subsurface imaging. Conventional earthquake location significantly relies on a velocity model and suffers from inaccurate phase picking. CNN is used for earthquake location by using received waveforms at several stations as input and location map as output . This method worked well for earthquakes (M L < 3.0) with low SNRs, for which traditional methods fail. The prediction results and errors of earthquake source locations are indicated in Figure 21. DL also helps estimate earthquake locations and magnitudes based on signals from a single station (Mousavi & Beroza, 2020a;Mousavi & Beroza, 2020b). Further applications involving associating seismic phases, which involves grouping the phase picks on multiple stations associated with an individual event , and relationship analysis between a strong earthquake and postseismic deformation (Yamaga & Mitsui, 2019).

Remote Sensing-a Geophysical Data Observation Means
Remote sensing is an important means to collect geophysical data and images by using sensors in satellites or aerial crafts. Remote sensing imagery mainly includes optical images, hyperspectral images, and synthetic aperture radar (SAR) images. Large-scale and high-resolution satellite optical color imagery can be used for precision agriculture and urban planning. To address the issue of objection rotation variations, a rotation-invariant CNN for object detection in very high-resolution optical remote sensing images was proposed, where a rotation-invariant layer was introduced by enforcing the training samples before and after rotation to share the same features (Cheng et al., 2016). If the labels are not accurate, a two-step training approach was used where first the CNN was initialized by numerous inaccurate reference data and then refined on a small amount of correctly labeled data (Maggiori et al., 2017). To further improve the image resolution, the image contours were extracted with an edge-enhancement GAN to remove the artifacts and noise in super resolution (Jiang et al., 2019).
Images obtained by hyperspectral sensors have rich spectral information, such that different land cover categories can potentially be precisely differentiated. In recent years, numerous works have explored DL methods for hyperspectral image classification (Li, Song, et al., 2019). To consider the spectral-spatial structure simultaneously, a 3D CNN rather than a 2D one should be used to extract the effective features of hyperspectral imagery (Chen, Jiang, et al., 2016). The extracted features are useful for image classification and target detection and open a new window for future research. An alternative means to explore the relationships among different spectrum channels is to use RNN, which regards hyperspectral pixels as sequential data input (Mou et al., 2017).
SAR systems artificially enlarge the aperture of radar to produce high-resolution images. SAR can operate in all-weather and day-and-night conditions. CNN is used for target classification in SAR images, which avoided handcrafted features and provided higher accuracy (Chen, Wang, et al., 2016). To consider both the amplitude and phase information of complex SAR imagery, a complex-valued CNN for SAR image classification was proposed to process complex-valued inputs (Zhang, Wang, et al., 2017).

Other AI Geophysical Applications
We investigate more AI geophysical applications in this section. The topics are roughly arranged by the order from the Earth to outer space.

The Earth's Structure
Understanding the structure of the Earth is a challenging task since observations are mainly limited on the earth's surface. The earth is roughly divided into the surface, crustal layers, mantle and core and from the surface to inside; however, the detailed structures and properties of the earth are not clear. Moisture as an important soil attribute, is predicted historically with high fidelity from two recent years of satellite data, showing LSTM's potential for hindcasting, data assimilation, and weather forecasting (Fang et al., 2017;Fang, Kifer, et al., 2020). The high-resolution 3D CT data of rocks is required to determine the rock's property but results in a small field of view. A CycleGAN was proposed to obtain super resolution images from low resolution one by training on an unpaired data set . Volcanic deformation was detected by using a CNN to classify interferometric fringes in wrapped interferograms (Anantrasirichai et al., 2018). The crustal thickness in eastern Tibet and the western Yangtze craton are estimated by Rayleigh surface wave velocities based on DNN (Cheng et al., 2019). The mantle thermal state of simplified model planets was predicted based on DL with an accuracy of 99% for both the mean mantle temperature and the mean surface heat flux compared to the calculated values (Shahnas & Pysklywec, 2020).

Water Resources
Water on Earth has a great impact on ecosystems and natural disasters. DL can help address several major challenges in water sciences (Shen, 2018). DL can predict the loop current in the ocean by learning the pattern in sea surface height (SSH). An LSTM was proposed to predict SSH and current loop in the Gulf of Mexico within 40 kilometers nine weeks in advance (Wang, Zhuang, et al., 2019). Due to the limit of computational memory, the region of interest is split into different sub-regions. Further works directly reconstruct SSH on a large and spatial and temporal space based on sparsely sampled data with CNN (Manucharyan et al., 2021). By using observation from satellite and coastal stations simultaneously, GAN can be used to reconstruct the SSH of the whole North-Sea (Zhang, Stanev, et al., 2020). DL also help estimate the iceberg in the pan-Antarctic near-coastal zone that covers the whole Antarctic continent for monitoring ice melt and sea level increasing (Barbat et al., 2019), and coastal inundation for a better understanding of the geospatial and temporal characteristics of coastal flooding ( In addition to oceans, water is stored in different forms, such as rivers, lakes, rain, and ice. DL has found its roles in estimating groundwater storage , global water storage in the US (Sun, Scanlon, et al., 2020), measuring accurate river widths by super resolution (Ling et al., 2019), predicting the temperature of lake water (Read et al., 2019), predicting rainfall and runoff (Akbari Asanjan et al., 2018), and prediction water vapor retrieval from remote sensing data (Acito et al., 2020).

Atmospheric Science
Atmospheric science observes and predicts climate, weather and atmospheric phenomena. Global observation of global atmospheric parameters is difficult since the earth is extremely large and sensor locations are limited. Researchers chose a CNN-based inpainting algorithm to reconstruct missing values in global climate datasets such as HadCRUT4 (Kadow et al., 2020, Figure 22). Air pollution is damaging both the Earth's environment and human health. Researchers used DL to estimate ground-level PM2.5 or PM10 levels by using satellite observations and station measurements Shen et al., 2018;Tang et al., 2018). DL also helps improve the accuracy of weather forecasting, which is a long-standing challenge in atmospheric science (Bonavita & Laloyaux, 2020;Scher & Messori, 2021). The tracks of typhoons were predicted with a GAN based on satellite images (Rüttgers et al., 2019). A six-hour-advance track with an average error of 95.6 km was produced. Flow-dependent typhoon-induced sea surface temperature cooling was estimated by a DNN and used for improving typhoon predictions (Jiang et al., 2018).

Space Science
Global space parameter estimation and prediction are long-standing tasks in space science. Researchers used a DNN to predict short-term and long-term 3D dynamic electron densities in the inner magnetosphere (Chu et al., 2017). This network can obtain the magnetospheric plasma density at any time and for any location. A regularized GAN is used to reconstruct dynamic total electron content (TEC) maps . Several existing maps were used as references to interpolate missing values in some regions, such as the oceans. The TEC maps can also be predicted two hours in advance with an LSTM  or one day in advance with a GAN (Lee et al., 2021). Further, a DNN is used to estimate the relationship between electron temperature and electron density in small regions . Therefore, the global electron density is easily measured and used to predict the global electron temperature. Tasistro -Hart et al. 2020 Grana et al. 2020 Note. Here optimization oriented means using DNNs to optimize the traditional model-driven objective functions. An aurora is an astronomical phenomenon commonly observed in polar areas. Auroras are caused by disturbances in the magnetosphere caused by the solar wind. Auroral classification is important for polar and solar wind research. Researchers used DNN to classify auroral images (Clausen & Nickisch, 2018, Figure 23). The classification results can further be used to produce an auroral occurrence distribution (Zhong et al., 2020). To handle the situation where limited images were annotated, a CycleGAN model was used to extract key local structures from all-sky auroral images .
The first attempts started with simple FCNN methods followed by complex networks, such as CNN, RNN, and GAN models. With respect to the training set, early works used end-to-end training borrowed from the computer vision area, which requires a large number of annotated labels, while recent works have started to consider unsupervised learning (He et al., 2018) and the combination of DL with a physical model (Chattopadhyay et al., 2020;Wu & McMechan, 2019). In 2020, more works focused on the uncertainty of DL methods (Cao et al., 2020;Grana et al., 2020;Mousavi & Beroza, 2020a). More examples are listed in Table 2. From these trends, we can conclude that an increasing number of researchers are trying to develop DL methods that are specifically designed for geophysical tasks to make DL methods more practical. In the next subsection, we introduce these future trends in detail.

Future Directions for Deep Learning in Geophysics
DL, as an efficient artificial intelligence technique, is expected to discover geophysical concepts and inherit expert knowledge through machine-assisted mathematical algorithms. Despite the success of DL in some geophysical applications such as earthquake detectors or pickers, their use as a tool for most practical geophysics is still in its infancy. The main problems include a shortage of training samples, low signal-to-noise ratios, and strong nonlinearity. Among these issues, the critical challenge is the lack of training samples in geophysical applications compared to those in other industries. Several advanced DL methods have been proposed related to this challenge, such as semi-supervised and unsupervised learning, transfer learning, multimodal DL, federated learning, and active learning. We suggest that a focus be placed on the subjects below for future research in the coming decade.

Semi-Supervised and Unsupervised Learning
In practical geophysical applications, obtaining labels for a large data set is time-consuming and can even be infeasible. Therefore, semi-supervised or unsupervised learning is required to relieve the dependence on labels. Dunham et al. (2019) focused on the application of semi-supervised learning in a situation in which the available labels were scarce. A self-training-based label propagation method was proposed, and it outperformed supervised learning methods in which unlabeled samples were neglected. Semi-supervised learning takes advantage of both labeled and unlabeled datasets. The combination of AE and K-means is an efficient YU AND MA 10.1029/2021RG000742 24 of 36 unsupervised learning method (He et al., 2018;Qian et al., 2018). An autoencoder is used to learn low-dimensional latent features in an unsupervised way, and then K-means is used to cluster the latent features.

Transfer Learning
Usually, we must train one DNN for a specific data set and a specific task. For example, a DNN may effectively process land data but not marine data, or a DNN may be effective in fault detection but not in facies classification. Transfer learning (Donahue et al., 2014) is suggested to increase the reusability of a trained network for different datasets or different tasks.
In transfer learning with different datasets, the optimized parameters for one data set can be used as initialization values for learning a new network with another data set; this process is called fine-tuning. Fine-tuning is typically much faster and easier than training a network with randomly initialized weights from scratch. In transfer learning involving different tasks, we assume that the extracted features should be the same in different tasks. Therefore, the first layers in a model trained for one task are copied to the new model for another task to reduce the training time. Another benefit of transfer learning is that with a small number of training samples, we can promptly transfer the learned features to a new task or a new data set. Diagrams of these two transfer learning methods are shown in Figure 24. Further topics in transfer learning include the relationship between the transferability of features (Yosinski et al., 2014) and the distance between different tasks and different data sets (Oquab et al., 2014).

Combination of DL and Traditional Methods
Can we combine traditional and DL approaches to make geophysical mechanics and DL collaborate? Intuitively, such a combination can produce a more precise result than traditional methods and a more reliable result than DL methods.
How can DL be incorporated into traditional methods? In a traditional iteration optimization algorithm, the thresholding-based denoiser can be replaced by a DL denoiser (Zhang, Zuo, Gu, et al., 2017) such that the reconstructed results are improved. On the other hand, different tasks use the same denoiser without training a new denoiser. Another technique, DIP, uses a DNN architecture as a constraint on the data and ensembles traditional physical models for different tasks (Lempitsky et al., 2018). Similar to the idea of DIP, Wu and McMechan (2019) showed that a DNN generator can be added to an FWI framework. First, a U-Net-based generator   ; F Θ v with random input v was used to approximate a velocity model m with high accuracy. Then, where d r is the seismic record and P is the forward wavefield propagator. The gradient of E FWI with respect to network parameters Θ is calculated with the chain rule. U-Net is only used for regularizing the velocity model. After training, one forward propagation of the network will produce a regularized result.
Traditional optimization methods also benefit from the autodifference mechanism in DL, which makes optimization more efficient by replacing conjugate gradient descent or LBGFS with DL optimization methods, such as SGD and Adam (Sun, Niu, et al., 2020;Wang, Chang, et al., 2020). DL also inspired new directions in the study of traditional nonlinear optimization algorithms, such as ML-descent (Sun and Alkhalifah, 2020) and DL-based adjoint state methods (Xiao et al., 2021).
How can traditional methods be incorporated into DL? With an additional physical constraint on DL methods, fewer training samples are required to obtain a more generalized inference than those of traditional methods. Raissi et al. (2019) proposed a physically informed neural network (PINN) that combines training data and physical equation constraints for training. Taking wave modeling as an example, the wavefield was represented with a DNN,      , , ; u x t F x t Θ , such that the acoustic wave equation was: , ; , ; , ; How can DL and traditional methods cooperate? Another benefit of combining data-driven and model-driven approaches is that we can obtain high-resolution solutions on a large scale. The process on a large scale was numerically solved with a low-resolution grid based on physical equations. On a small scale, the process was solved by data-driven DL methods (Chattopadhyay et al., 2020). Therefore, the high computational demand on a fine scale is avoided. DL can also be used for discovering physical concepts (Iten et al., 2020).
It is more common to hear someone ask, "Does machine learning have a real role in hydrological modeling?" rather than, "What role will hydrological science play in the age of machine learning?" (Nearing et al., 2020). As the authors claim, DL has uncovered the principles in large-scale rainfall-runoff simulations, which cannot be explained by physical models. DL has a great impact on traditional methods, causing a collision between new and old ideas. We believe that DL and physical-based methods will be used together to move science forward for a long time.

Multimodal Deep Learning
To improve the resolution of inversion, the joint inversion of data from different sources has been a popular topic in recent years (Garofalo et al., 2015). One of the advantages of DNNs is that they can fuse informa- tion from multiple inputs. In multimodal DL (Ngiam et al., 2011;Ramachandram & Taylor, 2017), inputs are from different sources, such as seismic data and gravity data. Collecting data from different sources can help relieve the bottleneck of a limited number of training samples. Besides, using multimodal datasets can increase the quality and reliability of DL methods (Zhang, Stanev, et al., 2020). Feng, Fang et al. (2020) used data integration to forecast streamflow where 23 variables were used, such as precipitation, solar radiation, and temperature. Figure 25 shows an illustration of multimodal DL.

Federated Learning
To provide a practical training set in DL for geophysical applications, collecting available datasets from different institutes or corporations might be a possible solution. However, data transfer via the internet is time-consuming and expensive for large-scale geophysical datasets. Besides, most datasets are protected and cannot be shared. Federated learning was first proposed by Google (Mcmahan et al., 2017;Li et al., 2020) to train a DNN with user data from millions of cellphones without privacy or security issues. The encrypted gradients from different clients are assembled in a central server, thus avoiding data transfer. The server updates the model and distributes information to all clients (Figure 26). In a simple federated learning setting, the clients and the server share the same network architecture. We give a possible example of federated learning in geophysics based on the concept that some corporations do not share the annotations of first arrivals; however, they can benefit from federated learning by training a DNN together for first arrival picking.

Uncertainty Estimation
One of the remaining questions associated with applying DL in geophysics is related to whether the results of DL-based methods without a solid theoretical foundation can be trusted. DL-based uncertainty analysis methods include Monte Carlo dropout (Gal & Ghahramani, 2016), Markov chain Monte Carlo (MCMC) (de Figueiredo et al., 2019), variational inference (Subedar et al., 2019), etc. For example, in Monte Carlo dropout, dropout layers are added to each original layer to simulate a Bernoulli distribution. With multiple realizations of dropout, the results are collected, and the variance is computed as the uncertainty. DL with uncertainty estimation in inference is reported in areas such as volcano-seismic monitoring , geomagnetic storm forecasting (Tasistro-Hart et al., 2020), weather forecasting (Scher & Messori, 2021;Bonavita & Laloyaux, 2020), soil moisture predictions (Fang, Kifer, et al., 2020) and earthquake locations estimation (Mousavi & Beroza, 2020b).

Active Learning
To train a high-precision model using a small amount of labeled data, active learning is proposed to imitate the self-learning ability of human beings (Yoo & Kweon, 2019). An active learning model selects the most useful data based on a sampling strategy for manual annotation and adds this data to the training set; then, the updated data set is used for the next round of training ( Figure 27). One of the sampling strategies is based on the uncertainty principle, that is, the samples with high uncertainty are selected. Taking fault detection as an example, if a trained network is not sure whether a fault exists at a given location, we can annotate the fault manually and add the sample to the training set.

Summary
In this review, the key concepts of DL approaches are introduced, a broad range of applications of DL in geophysics are presented with the pros and cons, finally the future trends are discussed for geophysical readers who are beginning their trip in DL. DL methods have created both opportunities and challenges in geophysical fields. Pioneering researchers have provided a basis for DL in geophysics with promising results; more advanced DL technologies and more practical problems must now be explored. To close this study, we summarize a roadmap for applying DL in different geophysical tasks in terms of three levels.
• Traditional methods are time-consuming and require intensive human labor and expert knowledge, such as in first-arrival selection and velocity selection in exploration geophysics. • Traditional methods have difficulties and bottlenecks. For example, geophysical inversion requires good initial values and high accuracy modeling and suffers from local minimization. • Traditional methods cannot handle some cases, such as multimodal data fusion and inversion.
With the development of new artificial intelligence models beyond DL and advances in research into the infinite possibilities of applying DL in geophysics, we can expect intelligent and automatic discoveries of unknown geophysical principles soon.

A Coding Example of a DnCNN
The implementation of DL algorithms in geophysical data processing is quite simple based on existing frameworks, such as Caffe, Pytorch, Keras, and TensorFlow. Here, we provide an example of how to use Python and Keras to construct a DnCNN for seismic denoising. The code requires 12 lines for data set loading, model construction, training, and testing. The data set is preconstructed and includes a clean subset and a noisy subset; the overall data set includes 12,800 samples with a size of 64  64 (available at https://bit.ly/33SyXPO).
Any appropriate plotting tool can be used for data visualization. The training takes less than one hour on an NVidia 2080Ti graphics processing unit. The readers can try this code in their own areas as long as a training set is compatibly constructed.

Tips for Beginners
We introduce several practical tips for beginners who want to explore DL in geophysics from the perspective of the three most critical steps in DL: data generation, network construction and training. Though exploration geophysics is used as example, the tips for data generation and network training are generally applicable to most areas. Network construction generally depends on the task.

Data Generation
As noted by Poulton (2002), "training a feed-forward neural network is ∼10% of the effort involved in an application; deciding on the input and output data coding and creating good training and testing sets is 90% of the work". In DL, we advise that the percentages of the effort for network construction and data set preparation should be ∼40% and 60%. First, most DL approaches use an original data set as the input, thus reducing feature extraction efforts. Second, a wider variety of network architectures and parameters can be used in DL compared to those in traditional neural networks. Overall, constructing a proper training set plays a more prominent role in DL.
Synthetic datasets can be used effectively in DL, which is advantageous since labeled real datasets are sometimes difficult to obtain. First, to assess the applicability of DL in a specific geophysical application, using synthetic datasets is the most convenient method. Second, if a satisfactory result is obtained with synthetic datasets, a few annotated real datasets can be used for transfer learning via parameter tuning. Third, if the synthetic datasets are sufficiently complicated, that is, if the most important factors are considered when generating the datasets, the trained network may be able to process realistic datasets directly .
A synthetic training set should be diverse. First, we suggest using an existing synthetic data set with an open license, instead of generating a data set. For specific tasks, such as FWI, a data set may need to be generated based on a wave equation. Second, data augmentation methods, such as rotation, reflection, scaling, translation, and adding noise, missing traces, or faults to clean datasets, can be used to expand the training set. The goal is to generate extremely large synthetic datasets that are as close to realistic datasets as possible.
To generate realistic datasets, we suggest using existing methods to generate labels that should then be checked by a human. For example, in first-arrival picking, an automatic picking algorithm is used to preprocess the datasets, and the results are then provided to an expert who identifies the outliers. We also suggest using active learning (Yoo & Kweon, 2019) to provide a semiautomated labeling procedure. First, all datasets with machine annotation are used to train a DNN, and the samples with high predicted uncertainty are required to be manually annotated.

Network Construction for Different Tasks
Beginners are suggested to use a DnCNN or U-Net for testing. DnCNNs are available for most tasks in which the input and output share the same domain, such as denoising, interpolation, and attribute analysis. The input size of a DnCNN can vary since there are no pooling layers involved. However, each output data point is determined by a local field from the input rather than from the entire input set. Additionally, U-Net contains pooling layers, and all input points are used to determine an output point. U-Nets are available for tasks even when the inputs and outputs are in different domains, such as in FWI. However, the input size of U-Net is fixed once trained and the data need processed patch-wisely.
Combining a CAE and K-means is suggested for unsupervised clustering tasks, such as attribute classification. We do not suggest CycleGAN for geophysical tasks since the training process is extremely time-consuming and the results are not stable. An RNN provides a high-performance framework for time-dependent tasks, such as forward wave modeling and FWI. RNNs are also used for regression and classification tasks involving temporal or spatial sequential datasets, such as in the denoising of a single trace.
To adjust the hyperparameters of a DNN and optimization algorithms, we suggest using an autoML toolbox, such as Autokeras, instead of manually adjusting the values. The basic objective is to search for the best parameter combination within a given sampling range. Such a search is exceptionally time-consuming, and a random search strategy may accelerate the tuning process. Moreover, for most applications, the default architecture gives reasonable results.

Training, Validation, and Testing
The available data set should be split into three subsets: one training set, one validation set, and one test set to optimize the network parameters. The proportions of the subsets depend on the overall size of a data set. For datasets with 10-50 K samples, the proportions are suggested to be 60%, 20%, and 20%, respectively. For larger datasets (for instance, those larger than 1M), much smaller portions are often used for validation and test (∼1%-5%) since the alternative can result in using unnecessarily large test/validation sets and wasting the data that can be used for training and building a better model. In a classification task, we suggest using one-hot coding in training. The validation set is used to test the network during training. Then, the model with the best validation accuracy is selected rather than the final trained model. If the validation accuracy does not improve or decrease after some saturation during training, an early stopping strategy is suggested to avoid overfitting. Network hyperparameters should be tuned according to the validation accuracy. The validation set is used to guide training, and the test set is used to test the model based on unseen data sets; however, the test set should not be used for hyperparameter tuning.
Two commonly seen issues during training are as follows: the validation loss is less than the training loss, and the loss is not a number. Intuitively, the training loss should be less than the validation loss since the model is trained with a training data set. Several potential reasons for this issue are as follows: 1. Regularization occurs during training but is ignored during validation, such as in the dropout layer; 2. The training loss is obtained by averaging the loss of each batch during an iteration, and the validation loss is obtained based on the loss after one iteration; and 3. The validation set may be less complicated than the training set, especially when only the training set has been augmented. The potential reasons for NaN loss are as follows: 1. The learning rate is too high; 2. In an RNN, one should clip the gradient to avoid gradient explosion and 3. Zero is used as a divisor, negative values are used in logarithm, or an exponent is assigned too large of a value.

AE
Autoencoder; an ANN with the same inputs and outputs. AI Artificial Intelligence; Machines are taught to think like humans. ANN Artificial neural network; a computing system inspired by biological neural networks that constitute animal brains. Aurora A natural light display in the earth's sky; disturbances in the magnetosphere caused by the solar wind. BNN Bayesian neural network; the network parameters are random variables instead of regular variables. CAE Convolutional autoencoder; an AE with shared weights. CNN Convolutional neural network; a DNN with shared weights. DDTF Data-driven tight frame; A dictionary learning method using a tight frame constraint for the dictionary. Deblending In seismic exploration, several explosion sources are shot very close in time to improve efficiency. Then, the seismic waves from different sources are blended. The recorded data set first needs to be deblended before further processing. Dictionary A set of vectors used to represent signals as a linear combination. DIP Deep image prior; the architecture of a DNN is used as a prior constraint for an image. DL Deep learning; a machine learning technology based on a deep neural network. DnCNN Denoised convolutional neural network. DNN Deep neural network; an ANN with many layers between the input and output layers. DS Double sparsity; the data are represented with a sparse coefficient matrix multiplied by an adaptive dictionary. The adaptive dictionary is represented by a sparse coefficient matrix multiplied by a fixed dictionary. Event In exploration geophysics, a seismic event means reflected waves with the same phase. In seismology, an event means a happened earthquake. Facies A seismic facies unit is a mapped, three-dimensional seismic unit composed of groups of reflections whose parameters differ from adjacent facies units. Fault a discontinuity in a volume of rock across which there has been significant displacement as a result of rock-mass movement. FCN Fully convolutional network; an FCN is a network that contains no fully connected layers. Fully connected layers do not share weights. FCNN Fully connected neural network; an FCNN is a network composed of fully connected layers. FWI Full waveform inversion; full waveform information is used to obtain subsurface parameters. FWI is achieved based on the wave equation and inversion theory. GAN Generative adversarial network; GANs are used to generate fake images. A GAN contains a generative network and a discriminative network. The generative network tries to produce a nearly real image. The discriminative network tries to distinguish whether the input image is real or generated. Therefore, such a game will eventually allow the generative network to produce fake images that the discriminative network cannot distinguish from real images. Graphics processing unit (GPU) A parallel computing device. GPUs are widely used for training neural works in deep learning.
HadCRUT4 Temperature records from Hadley Centre (sea surface temperature) and the Climatic Research Unit (land surface air temperature). K-means A classical clustering algorithm, where K is the number of clusters. K-SVD A dictionary learning method using SVD for dictionary updating. LSTM long short-term memory; LSTM considers how much historical information is forgotten or remembered with adaptive switches. Magnetosphere Range of the magnetic field surrounding an astronomical object where charged particles are affected. M L Earthquake local magnitude; a method for measuring earthquake scale. Patch In dictionary learning, an image is divided into many patches (blocks) that are the same size as the atoms in a dictionary. PINN Physical informed neural network; A physical equation is used to constrain the neural network. PM Particulate matter. PM10 are coarse particles with a diameter of 10 micrometers or less; PM2.5 are fine particles with a diameter of 2.5 micrometers or less.

ResNet
Residual neural network; ResNets contain skip connections to jump over several layers. The output of a residual block is the residual between the input and the direct output. RNN Recurrent neural network; in time-sequenced data processing applications, RNNs use the output of a network as the input of the subsequent process to consider the historical context. SAR Synthetic aperture radar; the motion of a radar antenna over a target is treated as an antenna with a large aperture. The larger the aperture is, the higher the image resolution will be. Solar wind A stream of charged particles released from the upper atmosphere of the Sun. Sparse coding Input data are represented in the form of a linear combination of a dictionary where the coefficients are sparse. Sparsity The number of nonzero values in a vector. SVD Singular value decomposition; a matrix factorization method. A=USV, where U and V are two orthogonal matrices, S is a diagonal matrix whose elements are the singular values of A. SVD is used for dimension reduction by removing the smaller singular values. SVD is also used for recommendation systems and natural language processing. Tight frame A frame provides a redundant, stable way of representing a signal, similar to dictionary. A tight frame is a frame with the perfect reconstruction property; i.e., W T W=I. Tomography Inversion of the subsurface velocity based on travel time information. U-Net U-shaped network; U-Nets have U-shaped structures and skip connections. The skip connections bring low-level features to high levels. Wave equation A partial differential equation that controls wave propagation. WST Wavelet scattering transform; a transform involves a cascade of wavelet transforms, a module operator, and an averaging operator.

Data Availability Statement
Data were not used, nor created for this research.