Machine Learning Technique Using the Signature Method for Automated Quality Control of Argo Profiles

A profile from the Argo ocean observation array is a sequence of three‐dimensional vectors composed of pressure, salinity, and temperature, appearing as a continuous curve in three‐dimensional space. The shape of this curve is faithfully represented by a path signature, which is a collection of all the iterated integrals. Moreover, the product of two terms of the signature of a path can be expressed as the sum of higher‐order terms. As a result of this algebraic property, a nonlinear function of the profile shape can always be represented by a weighted linear combination of the iterated integrals, which enables machine learning of a complicated function of the profile shape. In this study, we performed supervised learning for existing Argo data with quality control flags by using the signature method and demonstrated the estimation performance by cross validation. Unlike rule‐based approaches, which require several complicated and possibly subjective rules, this method is simple and objective in nature because it relies only on past knowledge regarding the shape of profiles. This technique is critical for realizing automatic quality control for Argo profile data.


Introduction
Argo is an international effort involving the collecting of high-quality temperature and salinity profiles, typically from the upper 2,000 m of the global ocean (Gould et al., 2004). The data are obtained from battery-powered autonomous floats that drift predominantly at a depth where they are stabilized at a constant pressure level. Typically at 10-day intervals, the floats rise to the surface for approximately 6 hr while measuring temperature and salinity. On surfacing, the satellites position the floats and receive the transmitted data. Currently, the array of over 3,000 floats provides 100,000 temperature/salinity profiles annually, distributed over the global oceans at an average 3 • spacing. Quality control (QC) of the massive Argo profile data (ARGO, 2019) must be systematic to keep the quality of the observational data homogeneous and to utilize human resources efficiently. In addition, accurately quantifying the relationship between the profile shape and the effect it has on oceanic processes is essential for understanding the ocean state through the profile observation. Conventionally, significant time and effort are expended assigning a quality control flag to each Argo profile.
Regarding attempts for advanced automatic QC procedures on oceanographic profiles, several studies have been conducted on the Argo CTD profile because of the huge amount of data accumulated for 200 million profiles over 20 years. For example, Udaya Bhaskar et al. (2013) developed a semiautomatic QC procedure using objective mapping to remove anomalous values from the profiles. Udaya Bhaskar et al. (2017) demonstrated automatic QC by defining the convex hulls from the climatological data set. Meanwhile, Ono et al. (2015b) attempted to apply a machine learning method to the delayed-mode QC of Argo profiles toward a possible automatic QC system for an Argo data stream. Similarly, an integrated Argo data flow using machine learning was presented as an automated system with an improved QC ability (presented by Maze, 2017 in the report of the 18th Argo Data Management Meeting). Thus, QC procedures for oceanographic data have been gradually improved by many researchers using advanced tools or methods.
The discrimination procedures involved in the automation of the QC have been performed mainly in a rule-based manner (e.g., Hayashi et al., 2016;Kamikawaji et al., 2016;Ono et al., 2015a). As an alternative and more flexible approach, this study attempts to automate the process via supervised learning of the human judgment process. In doing so, it is essential to quantify the profile shape so that the function 10.1029/2019EA001019 that yields the quality control flag can be expressed as a linear combination of the numerical values that represent the profile shape. The machine learning thereby reduces to a linear optimization problem that can be easily solved. The key tool that enables this quantification is the signature, which is the set of all iterated integrals (Chevyrev & Kormilitzin, 2016;Levin et al., 2013), proposed in the theory of rough path by Lyons et al. (2007).
In this paper, we propose a procedure in which the vector sequence of each Argo profile is first converted into a sequence of real numbers that represents its shape, and then a nonlinear function of the shape is expressed in the form of a linear combination of these numbers; this conversion facilitates machine learning of the nonlinear function. A machine learning experiment involving learning of the function was performed and applied to automatic assignment of QC flags to the profiles.

Theoretical Background
The central concept in this study is the signature, proposed in the theory of rough path by Lyons et al. (2007).
In what follows, we briefly introduce the concept of signature and the notations used in this paper. For more details, refer to Chevyrev and Kormilitzin (2016), As perceived from a reexamination of controlled differential equations (refer to Appendix A), characteristics of a data sequence can be represented by the signature, which comprises the iterated integrals. Note that in this paper, the subscript notation X is used to denote dependence on the parameter ∈ [0, t]; A •n and A ⊗n denote the nth power and n times tensor product, respectively; otherwise, a superscript denotes a component.
Suppose we have a sequence of d-dimensional vectors X u (0 ≤ u ≤ t). Let the time order be 0 < t 1 < · · · < t n < t. We define the iterated integral for indices i 1 , · · · , i n = 1, · · ·, d as where we should be careful about the difference between the font for a sequence of vectors X i 1 t 1 and that for an iterated integral X (i 1 i 2 ···i n ) . By treating all the index values together, we obtain a tensor of order n: and X 0 has a constant value of 1. Moreover, by combining the tensors of order less than or equal to n, we obtain the signature up to degree n: which has (d •n + 1 − 1)/(d − 1) components. For instance, the signature up to degree 2 for a two-dimensional sequence is •2 denotes the second power, and X () = X 0 = 1. Note that the order of integrands matters in ∫ t 0 X 1 0,u dX 2 u and ∫ t 0 X 2 0,u dX 1 u . In general, it is important for the signature to encode the order in which each component changes along the path.
Suppose we have two paths, X = {X } 0 ≤ ≤ s and Y = {Y } s ≤ ≤ t . Their concatenation is the path defined by On the other hand, regarding their signatures,  n (X) = (X 0 , X 1 , · · · , X n ) and  n (Y ) = (Y 0 , Y 1 , · · · , Y n ), we can define the product as 10.1029/2019EA001019 whose components are For instance, the product of the signatures, up to degree 2, for the two-dimensional sequence is In this manner, the set of signatures has a group structure in the free tensor algebra with respect to the product ⊗. Furthermore, Chen's identity (Chen, 1958) is satisfied, which defines a homomorphism from path space with concatenation (5) to signature space with group operation (6).
In the context of geophysics, we can show that some diagnoses for oceanographic conditions are written in terms of iterated integrals. Consider a vertical sequence of vector (P , S , T ) (pressure, salinity, and temperature) in the ocean.
1. The first-order iterated integrals are which are profile depth, sea surface salinity, and sea surface temperature, respectively. 2. The second-order iterated integrals include which represent the square of profile depth, total salinity content, and total heat content, respectively.
We find another example in Appendix B.
Note that P is treated equally to S, T in the above because the seemingly redundant parameter is essential to ensure that the path has no self-intersection and the signature is invariant under the reparameterization of . If one parameterizes T, S with P, the path would be drawn on a two-dimensional T, S surface, which loses considerable information on the shape of the sequence.

Method
The data used in this research were observed by the global array of Argo floats (ARGO, 2019), each of which floats and sinks from the sea surface to a depth of approximately 2,000 m.
Because the shape of a vector sequence (P , S , T ) is only perceived in a certain reference frame, it is convenient to make the original quantities dimensionless; P in dbar, S in psu, and T in • C intoP = P∕2000, S = S∕2, andT = T∕20, where divisor (2000,2,20) is chosen as a typical scale of the components. For simplicity, henceforth, we omit the hat symbol for the component. Figures 1 and 2 show examples of the vertical profiles of temperature, salinity, and pressure, along with the corresponding iterated integrals. By virtue of quality control procedures with manual judgment, the quality control flags are already assigned to all of the data.
Here, we describe the basic concept of the signature method and how to apply it to Argo profiles. We also explain how to construct a procedure for supervised learning using the signature and how to verify the results.

Representing the Argo Profile Shape by Signature 3.1.1. Computation of Signature
Suppose we have d-dimensional profile data {X } 0 ≤ ≤ t that can be seen as a line graph connecting points X 0 = X u 0 , X u 1 , · · · , X u L = X t ; then, we can compute its iterated integrals as follows: , which has starting point X i u and slope X i u,u ′ , the iterated integrals are calculated as and the 0th iterated integral is constant 1. In this case, the signature (up to degree n) is nothing but a commutative exponential function for the vector X u,u ′ : where e i k is the i k -th unit vector. The numerical computation of the signature in this study was performed by using the Python library Esig (Kormilitzin, 2017).

Lead-Lag Transformation
Suppose we have a sequence of d 0 -dimensional (d 0 = 3) vectors with length L + 1: To more precisely grasp the shape of the line graph, we perform a lead-lag transformation (Chevyrev & Kormilitzin, 2016), which defines a sequence of d(= 2d 0 )-dimensional vectors with length L · d + 1: The transition rule for the lead-lag transformation is as follows: 1. Take two copies of X 0 and use it as the initial condition.
2. Update only one component among d components at once. 3. Use the previous value instead if the present value is missing.

Machine Learning Procedure for QC Process
Suppose we have a set of profile data X(m) def = {X (m)} 0≤ ≤t for m = 1,2, · · · , M, whose signature is denoted as X(m) def = (X(m)). Let us consider the problem of assigning the discriminant values to each profile depending on whether a profile matches the quality standard.
1. We first make a model for the rule of QC as a functional form; that is, a linear combination of the iterated integrals X I for all combinations of indices I = (), (i 1 ), (i 1 i 2 ), · · ·, (i 1 · · · i n ) yields the discriminant value.
where is the error. As each index in I runs over 1, 2, · · ·,d, the sequence of iterated integrals in Equation Note that X () represents the constant 1. Such a representation is possible because its nonlinearity is unraveled thanks to the property of shuffle product; for a fixed path X, the product of iterated integrals for indices A and B is expressed by the iterated integral with respect to the shuffle product A⧢B: For example, X (aa) X (b) = X (baa) + X (aba) + X (aab) . This means that a product of iterated integrals is always reduced to the sum of higher-order iterated integrals. Moreover, by virtue of the Stone-Weierstrass theorem, any nonlinear function of the shape of a path can be represented as a linear combination of the iterated integrals.
where each X(m) is a profile sequence, and y(m) = 0,1 is the discrimination value, which is already given to each sample m = 1, 2,· · ·, M as training data. Learning these data is simply deriving the weights w I that minimize an L 1 -regularized cost function: Because the terms in ∑ I |w I | are not quadratic but linear, they have the effect of selecting significant terms under the summation over the set labeled by I. This is the notion of least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996), which can help prevent overfitting. The larger the value of is, the smaller the number of selected terms with w I ≠ 0 is. To set an appropriate number of terms, several values of will be tested. 3. Using the coefficients w derived in (16), and substituting into Equation 14 the iterated integrals for a profile not used for training, we obtaiñ, an estimate for y, as follows.
where w I 's are estimated from the minimization of cost (16). 4. The minimization problem is efficiently solved by the coordinate descent (CD) method (Friedman et al., 2007).
For the L 1 -regularization term to apply evenly, each iterated integral X I is preprocessed by subtracting the ensemble mean I train of the training ensemble and dividing by the standard deviation I train of the training ensemble: The same operation is performed for the iterated integrals in cross validation. The minimization problem was solved by using the Python library scikit-learn (Pedregosa et al., 2011).

Assessment of Learning Results
The performance of the binary classifier can be quantitatively assessed by visualizing it with the receiver operating characteristic (ROC) curve (Egan, 1975). We refer to the profiles that pass the quality criterion as negative (normal) y = 1 and the others as positive (bad) y = 0. By shifting the cutoff value y c , one can count the number of positive profiles with̃< c and that of negative profiles with c ≤̃. Then, the samples fall into the four categories in Table 1.
The true-positive rate is defined as N TP /(N TP + N FN ), and the false-positive rate as N FP /(N FP + N TN ). The ROC curve is the two-dimensional plot of false-positive rate versus true-positive rate, by changing the cutoff y c . It has better performance if the trajectory approaches the upper left corner. Therefore, the area under the ROC curve indicates the performance.
Note that, to improve the readability of the histograms, we use a modified estimation value: where we apply transformation y ↤ 1− |1 − y| so that̃(m) ≤ 1.

Experiment Using Principal Component Analysis
Alternatively, principal component analysis (PCA) (e.g., Thomson & Emery, 2014) can be applied to represent the normal profiles. In that case, the experiment for estimating the QC flag is performed as follows.
1. We apply the same nondimensionalization to the T and S sequences as in the signature method and then perform nearest-neighbor interpolation at points 2, 000P = 5, 15, · · · , 1995, which are placed every 10 dbar. Accordingly, the sequences are transformed into a sequence X = (X 1 , · · ·, X L ) T with L = 400.
2. Let  def = {X(m)|m = 1, 2, · · · , M} be the set of all profiles. We randomly choose the training ensemble  tr ⊂  , which comprises negative (normal) samples  tr,n ⊂  tr , and positive (bad) samples  tr,p ⊂  tr . 3. Training is performed by computing the principal components (PCs) for negative training samples  tr,n . Let be the truncated PCs, and X = (X 1 , · · · , X L ) T be the ensemble mean for the training ensemble tr,n . 4. For the m-th profile X(m) ∈  tr (or ∈  ∖  tr for cross validation), the mean square residual for representing it by the first N pc PCs is computed as where I L is an L-dimensional identity matrix. The estimated̃is thereby defined as̃(m) def = −r(m). 5. For a fixed threshold value y c , we assign negative to the mth profile if̃(m) > c and positive otherwise.
The ROC curve is drawn by plotting false-positive rates versus true-positive rates for various y c .

Results and Discussion
We used a data set observed at the locations shown in Figure 3. Each profile is assigned a delayed-mode QC flag by Japan Agency for Marine-Earth Science and Technology (JAMSTEC). We treated profiles with depth spans (the difference between the minimum and maximum depths) of more than 1,000 m, and each profile had approximately L ∼ 100 observation points. The number of profiles was M = 8.2 × 10 4 , and the training data were randomly chosen from these profiles. After applying the lead-lag transformation, each profile was converted into the signature up to degree n = 6.
The machine learning results are summarized by the histogram of estimated values̃for normal samples (y = 1), and the histogram for bad samples (y = 0). Figures 4 and 5 show the histograms when 40% of the data are used for training and the remaining 60% are used for cross validation. We can see that learning is performed properly because there is little difference between the identification of training data and the cross validation. In particular, this approach never misidentifies negative (normal) profiles if the appropriate cutoff y c is used, but it may accept positive (bad) profiles with a probability of 0.6 when y c = 0.5. This property is also reflected in the tendency of the ROC curve ( Figure 6) to be almost tangential to the horizontal axis when x is small but not tangential to y = 1 when y is large. The histogram for positive samples has two clear peaks, which suggests that the ambiguity is not caused by the judgment by the machine learning but by the fact that the original QC flag had a criterion that cannot be decided only by the shape. For example, the original QC, encoded in y, may have a criterion about deviation from climatological variation, which cannot always be detected from profile shape. Moreover, the original QC is partly done through visual checking, for which the criteria can fluctuate between checks. Obviously, both are not represented by the signature, which is static and shape oriented. Figures 7 and 8 show the histograms when 2.5% of the data are used for training and the remainder are used for cross validation. In this case, there is a clear tendency of overlearning, which indicates that the number of learning samples, 2.5%, is insufficient.  The performance of a method can be measured by the area under the ROC curves (AUC). Comparing that for the experiments with various ratios of learning samples, we notice that overlearning occurs when the ratio is less than 20% (Figure 9).
We also compared the results of the experiments with various weights of the regularization term by the AUC. The number of terms under the summation over the set labeled by I in Equation 16 is dependent on . Therefore, if we increase the degrees of freedom of the coefficients w by using a smaller , the performance of the reproduction capability increases. However, the estimation capability begins to saturate at approximately 6,700 degrees of freedom (Figure 10), where = 10 −5 is used. At that point, the complexity appears to become appropriate.
To confirm the efficacy of lead-lag transformation, we performed a similar experiment as in Figures 4 and 5, except without lead-lag transformation. We set = 10 −5 and the proportion of training data to 0.4. Figure 11   shows the ROC curves for the experiment. The curves for the case with lead-lag are to the upper left of those for the case without lead-lag, which indicates that the lead-lag transformation helps improve the estimation of the QC flag.
As a reference case using another representation of the shape, we performed PCA experiments with N pc = 50 and 100 PCs. The proportion of training data was set to 0.4, the same as for the signature case. Figure 12 depicts the ROC curves for the PCA experiments. Although the PCA method also exhibits a considerable skill, the curves stay to the lower right of those for the signature method, which indicates that the signature method is more effective than the PCA method in estimating the QC flag. Meanwhile, the computational cost for the signature method (elapsed time of 573 min by serial processing) is higher than that for the PCA method (elapsed time of 101 min by serial processing), but the required time can be reduced by parallelization or some other algorithmic means, if necessary. Note that the former additionally requires converting each data sequence into the truncated signature, which can be performed separately for each profile as preprocessing (elapsed time of 0.12 min per profile).  Further, comparison with the data from the ARGO intercomparison project was performed. The performances of the real-time assignment of QC flags (Wong et al., 2020) by several institutes are shown in Wedd et al. (2015), when the corresponding assignments by the delayed-mode QC are regarded as the ground truth. Because the sets of profile data differ from those in our case, a direct comparison is not strictly relevant but will still serve as a measure of the performance. Figure 13 shows the false-positive versus true-positive rates for those samples in comparison to the signature case. Apart from the pressure data, all the points for real-time QC data lie on the bottom right side of our ROC curve. This suggests that the signature method may assign the QC flags more efficiently than the real-time QC procedure does, provided that the past assignment Figure 11. ROC curves for the cases with and without lead-lag transformation. The cases with lead-lag transformation (red and blue) are compared to that without lead-lag transformation (green); identification of the training data (broken curves) and cross validation (solid curves) are depicted. The horizontal axis is the false-positive rate, and the vertical axis is the true-positive rate.   Table 5 of Wedd et al. (2015).

10.1029/2019EA001019
results are ready for use. Another advantage of the signature method is that it assigns the flags consistently to all the components (P, S, T) with higher reliability.
Overall, we found that machine learning using the signature method can learn the existing QC flags of Argo profiles and automatically assign the flags to new profiles, but it sometimes overlooks bad samples because of the ambiguity inherent in the original QC flag. Comparative study shows the signature method has a higher skill in estimating the flag than other conventional methods, including the one with the PCA representation and the operational assignments of real-time QC.

Conclusions
In this study, we first demonstrated that the shape of a profile from the Argo ocean observing array can be represented by the iterated integrals. Then, we constructed a model for the function that assigns a QC flag to the shape of a profile, which is expressed as a weighted sum of the iterated integrals.
Subsequently, we performed supervised learning for the weights using the existing QC flags for training data and demonstrated via cross validation that it has good performance in estimating flags for unknown data.
A comparative experiment conducted using the PCA method showed that the signature method, in combination with lead-lag transformation, outperforms the PCA method in estimating the QC flag. This suggests that the signature method is superior to the conventional machine learning technique.
This algorithm can potentially enable automatic assignment of QC flags to new Argo data. The significance of the algorithm is that it objectively and automatically assigns the QC flag only on the basis of past knowledge about the quality of data without imposing any ad hoc rules. Hence, it should enable more objective and efficient QC compared to traditional manual methods or rule-based machine learning.
The signature method is quite effective for expressing the shape of an Argo profile and its nonlinear function quantitatively. The rationale for this advantage is that a nonlinear and complicated function for assigning QC flags can be transformed into a linear combination of the iterated integrals through algebraic transformation (shuffle product) without introducing any errors. This is superior to conventional multivariate regression models, which approximately regard nonlinear dependencies as linear dependencies. Along this line, we can express, as a function of signature, not only QC flags but also any oceanic phenomena.
One application of the signature method is assimilation of the signature of observational data into a general ocean circulation model. For example, we can convert a vertical sequence of observational data and that of model data into iterated integrals. We then construct a cost function that compares the signatures for model and observation, rather than directly comparing the state vectors composed of temperature and salinity at each depth. Although a cost term is for a single horizontal and temporal point, data assimilation, in particular the four-dimensional variational method, can combine the effects from multiple terms via model integration and adjoint integration. By doing so, we gain the advantage that the projection of a vertical profile onto any ocean phenomena attains a linear form, which will result in efficient data assimilation. This is expected because many diagnoses for oceanic conditions are written in terms of iterated integrals, as illustrated in section 2 and Appendix B.

Appendix A: Picard Iteration
To understand the notion of signature, consider how the theory of rough path treats a data sequence acting on a system. Suppose we have a system of ordinary differential equations with respect to Y forced by a path X : where Y is the jth component of vector Y , and F i k is the i,j,kth component of three-dimensional tensor F. Performing the Picard iteration yields a solution: where f is the Coriolis parameter, u,v are velocity, is density, and x,y are the longitudinal and latitudinal coordinates, respectively. For a fixed latitude y, by performing integrations along the x direction and then the P direction, we obtain an estimate for the meridional velocity as where we set v(x ′ ,P 0 ) = 0 as the layer of no motion. Integrating again along the P direction, we obtain the meridional flow rate as where the unit is in [kg s • − 1 ] because of the p coordinate.
Let ∈ [0,1] be a parameter for the order of observational points in a profile. Evaluating the density in Equation B4 with the state equation , we have which has iterated integrals as independent variables. Notice that the shuffle product property transcribes this as a linear combination of iterated integrals. Substituting this into Equation B4 finally yields