Evaluation method for moisture content of oil-paper insulation based on segmented frequency domain spectroscopy: From curve ﬁtting to machine learning

In recent years, frequency domain spectroscopy (FDS) is often used to evaluate oil paper insulation state in power transformer bushing. But it is still very difﬁcult to evaluate the moisture content accurately and quickly. In order to solve this problem, this paper proposes an intelligent algorithm based on random forest regression (RFR) to construct an efﬁcient evaluation method through segmented FDS curves. Furthermore, the characteristics of FDS curves were studied and the intelligent method was compared with support vector regression (SVR) and deep neural networks (DNN). The results show that the dielectric loss, the real part and imaginary part of complex capacitance all move upward with the moisture increasing, so they can be used as the input feature of the evaluation model; The moisture content evaluation accuracy of the RFR model in the whole frequency band is higher than that of SVR and DNN models; With the increase of lower cut off frequency (FDS test stop frequency), the FDS test time is greatly shortened, and the accuracy of the RFR model can still meet the


INTRODUCTION
The key connection component between the power transformer and the grid is bushing. Among them, the capacitive oil-paper bushing plays a vital role in the safe and stable operation of power systems [1][2][3]. Moisture is one of the most important factors affecting the insulation state of bushing [4,5]. In paper [6], it is found that the catalytic effect of moisture on cellulose degradation is much more serious than that of oxygen. Also, the increased moisture can cause the dielectric conductivity and the temperate of the oil-paper insulation to raise. Furthermore, the water vaporizes to generate bubbles, resulting in a decrease in the partial discharge inception voltage and breakdown voltage. Each time the moisture content doubles, the life of the insulation will be halved [7]. In recent years, the frequency domain spectroscopy based on dielectric response theory has attracted the attention of many scholars because of its low test voltage, simple wiring, and nondestruction to insulation [8,9]. However, due to the long test time of low frequency point, the complete FDS test process usually takes several hours. The long test time seriously restricts the schedule of on-site maintenance. Therefore, how to accurately and quickly extract the moisture content information from FDS curves has become a hot topic.
In paper [10], it has been found that the FDS curve integral value S tanδ and DC conductivity σ DC have a non-linear relationship with the moisture content of oil-paper insulation; In paper [11,12], FDS was used in high frequency band and PDC was used in low frequency band to obtain the response function C(f) of oil-paper insulation of transformer, and the moisture content of oil-paper insulation was evaluated by comparing with database. The above results can be used to evaluate the moisture content of oil-paper insulation, but these methods are mostly empirical formulas based on curve fitting, which IET Sci. Meas. Technol. 2021;15:517-526.
wileyonlinelibrary.com/iet-smt have some problems such as complex calculation process, overfitting, and poor generalization performance. With the development of artificial intelligence technology, some scholars have begun to use machine learning to evaluate the moisture content of oil-paper insulation equipment. Mousavi et al. [13] used genetic algorithm to find the equivalent circuit parameters of transformer and eliminated the influence of temperature on FDS curve through neural network. Ma et al. [14] used support vector machine (SVM) to evaluate the moisture content of transformers, where the complex permittivity in the FDS curves was the sample attribute, and the moisture content of oil-paper insulation was the mark. Fofana et al. [15] used FDS data to train a supervised neural network with 44 input layer neurons and 2 output layer neurons to evaluate the moisture content, and distinguished the influence of moisture and aging on FDS curve accurately, which is of great significance for evaluating the insulation states of oil-paper. Similarly, a method to diagnose the aging and moisture of bushing is proposed based on multiclass least square support vector machines optimized by cuckoo search algorithm [16]. However, it is usually difficult to find a suitable kernel function for SVM, resulting in poor evaluation performance. DNN also has many problems, such as complex network structure and slow learning speed. Also, these models all use the whole frequency band data to evaluate moisture content, which is not conducive to reducing the FDS test time.
Since the random forest based on the decision tree introduces data disturbance and attribute disturbance, it has a lot of advantages, such as simple structure, no need to do feature selection, fast learning speed, and not easy to fall into local extremum [17][18][19]. It has been widely used in biology, medicine, information engineering and other fields, and can be used as an intelligent algorithm to identify the moisture content of oil-paper insulation [20,21].
In order to solve the problem of long time to obtain input data and low accuracy when evaluating the moisture content of oil-paper insulation, this paper used segmented FDS curves as the input data set and selected RFR to build the evaluation model. At the same time, the evaluation accuracy of these three models applied to different lower-cut-off frequencies data were compared, which showed that RFR can evaluate the moisture content of oil-paper insulation accurately and quickly.

Basic theory of frequency-domain spectroscopy
In time domain, the step response of the total current density, J(t), within an insulation system generated by the electric field strength, E(t), is given by: where σ 0 is the DC conductivity, ε 0 is vacuum dielectric constant.
Taking Laplace transform of Equation (1), which can write where χ′(ω) and χ″(ω) are the real and imaginary parts of complex susceptibility. From Equation (2), the complex permittivity is: Hence, the dielectric dissipation factor in frequency domain is: Losses caused by conductance and polarization are included in C″. At the same time, the complex permittivity and the dielectric dissipation factor both depend on the frequency.

Preparation of oil-paper bushing samples
In order to simulate the real structure of field capacitive oil-paper bushing, insulation paper units with thickness of 1.04 mm, length of 600 mm and width of 100 mm were prepared [22], as shown in Figure 1.
In this section, the glassware was wiped clean with alcohol and placed in a vacuum oven. It was dried continuously at 105 • C/100 Pa for five hours to ensure that there was no residual moisture. The insulation paper units were put into the oven for dispersive arrangement and dried at 105 • C/100 Pa for 48 hours. After drying, the moisture content of the paper was measured by Karl Fischer Coulometer KFT831, which was less than 0.5%.
The new 25# Karamay transformer oil was dried at 105 • C/100 Pa with the same process for 72 hours. It should be noted that transformer oil and insulation paper units cannot be dried in the same oven, because the transformer oil or oil stains would volatilize into oil vapour and be adsorbed on the surface of insulation paper when heated, thereby reducing the exchange ability of insulation paper units with water in air. It not only affects the dryness of insulation paper units, but also is difficult to accurately control the moisture content when naturally absorbing water.
Since cellulose and mineral oil have strong hydrophilicity and hydrophobicity respectively, 97% of the water is mainly stored in the paper, while the water dissolved in the oil is very little. Therefore, only the insulation paper was damped in the experiment, and the transformer oil was kept dry. The paper samples were taken out and placed on the high-precision electronic balance immediately for weighting, and 110 dry insulation paper samples were divided into four groups, each containing 28, 28, 27, 27 samples respectively. These samples with the initial moisture content of 0.41-5.08% were prepared by controlling the time during absorbing water naturally. The ascending gradient of moisture content is 0.2%. The samples with moisture content greater than 6% were obtained by artificial humidification. The bushing in the field is usually not so severely damp. The purpose is to observe whether the FDS curves change rule at high moisture content is consistent with that at low moisture content, and provide more data for the models to improve the evaluation accuracy.
In order to eliminate the experimental error and ensure the accuracy of the measurement results, the Karl Fischer Coulometer KFT831 was used to titrate the moisture content of all samples three times, and the average of these results was taken as the final result.
After that, the prepared samples were placed in an oil-bearing sealed tank for 48 hours, so that the distribution of water was balanced naturally and the bubbles in the insulation paper overflowed.

FDS test and analysis
Oil-paper insulation samples with different moisture contents were obtained according to the above method. In order to obtain stable test results, the temperature is constant at 40 • C during the test process. Seven groups of samples with moisture content of 0.41%, 1.10%, 2.03%, 2.84%, 3.91%, 5.08% and 6.11% were selected for FDS test by DIRANA frequency domain dielectric response equipment. The test frequency band was 1 mHz-5 kHz and the peak voltage was 200 V. After that, we used the Karl Fischer Coulometer KFT831 to recalibrate the moisture contents. The FDS curves of insulation paper samples with different moisture contents are as follows. As shown in Figure 2, tanδ-f curves move up with the increase of moisture content. With the moisture content of oilimpregnated paper increasing, the number of polar molecules in the medium increases, resulting in greater polarization loss. The dissociation rate of impurity ions increases with increasing in moisture content, which leads to the increase of carrier concentration, the overall conductivity of the medium and the conductivity loss. At the same time, the polarization establish-  The tanδ-f curves with moisture content of 0.41%, 1.10%, 2.03% and 2.84% show a tick shape. For the samples with moisture content of 3.91%, 5.08% and 6.11%, the tanδ-f curves have a slight increase in the low frequency band, while have a significant increase in the middle and high frequency band.
As shown in Figure 3, with the increase of moisture content, the C′-f curves remain consistent in the high frequency band (100 Hz-5 kHz). The C′-f curves with moisture content of 0.41%, 1.10%, 2.03%, 2.84%, 3.91% mainly change when the frequency is less than 10 Hz. The C′-f curves with moisture content of 5.08% and 6.11% change in the mid-range, and rise sharply with the increase of the moisture content.
As shown in Figure 4, the C″-f curves are similar to tanδ-f curves, and the overall trend is upward. The C"-f curves with moisture content of 0.41%, 1.10%, 2.03% and 2.84% show a tick shape in the whole frequency band. The C″-f curves with FIGURE 4 C″-f curves moisture content of 3.91%, 5.08% and 6.11% are in a "straight line" shape, and the slope is approximately −1 in the frequency band between 1 mHz to 10 Hz. The slope of C"-f curves with moisture content of 0.41%, 1.10%, 2.03% and 2.84% is approximately −1 in the frequency band between 1 mHz to 1 Hz. The imaginary part of the complex capacitance C″ mainly characterizes the conductance loss and polarization loss process of oilpaper insulation. The conductivity loss is mainly reflected in the low frequency band. When the frequency is low enough, the polarization process has enough time to complete. Therefore, the conductivity process in oil-paper insulation is dominant, and the C″-f curves in low frequency band are straight lines whose slope is approximately −1.
where Q is the number of samples, θ n is an independent and identically distributed random variable, and N is the number of regression trees. If x p is used as input data, the importance I q of x p on the tree is: where x p is the data measured at a certain moisture content condition, x n is the data outside the package, Q OOB is the number of samples outside the package, f′(x n ) is the nth sample value in Hence, the importance of x p in the random forest is:

Feature subset selection
In high-dimensional space, problems such as sparse data samples, increased computational complexity, and complex calculations often occur. Selecting some important features as input features is an effective way to solve the "dimension disaster" problem. Initial feature space was composed of dielectric loss tanδ, real part of complex permittivity ε′, imaginary part of complex permittivity ε″, real part of complex capacitance C′ and imaginary part of complex capacitance C″. In order to remove redundant features, the importance of each feature was obtained by using the evaluation criteria based on minimizing the correlation coefficient.
There was a bootstrap sample b = 1, 2…, B. B is the number of training samples. We set the initial value of b to 1, and generated several regression trees h(x) and out of package data L OOB according to Equation (5). Then, we used h(x) to regress the out-of-package data, perturbed the value of the feature X j (j = 1, 2, …, 5), and calculated the importance of each feature according to Equation (7). The results are shown in Figure 5.
The imaginary part of the complex capacitor C″ has the largest importance, which is 0.368. This is because compared with other curves, the difference between the maximum and minimum values of C″ at any frequency point is more than two orders of magnitude, and the curves with different moisture contents have the largest discrimination.
Because of the proportional relationship between C″ and ε″, C′ and ε′, the importance of them is relatively close. Therefore, only one of the complex capacitance or permittivity was selected to represent these characteristics. This paper selected C″, C′ and tanδ as the input features and readjusted the data to obtain a new sample data set.

Parameters selection
Parameter is one of the most important factors affecting model performance. This paper used the improved grid search method based on rectangular expansion to find the optimal parameters of RFR, SVR and DNN. The parameters that affect the prediction performance of RFR are the number of decision trees, the maximum number of features and the maximum depth of trees. The parameters selection steps are as follows: 1. The objective function is to make the coefficient of determination R 2 as close as possible to 1.
where SST is the total sum of squares, SSR is the sum of squares for regression, y is the measured value of the moisture content of the insulation paper samples, as shown in the testing set in Table 2, its mean value isȳ,ŷ is the estimated value of the sample moisture content calculated by evaluation models. The coefficient of determination R 2 is used to measure the similarity between estimated and measured values. A value close to 1 indicate the best estimate.
2. Since the maximum number of features is 3, it is only necessary to determine the number of decision trees and the maximum depth of trees.
In the first traversal, the value range of the number of decision trees is e ∈ [E min, E max ] and the maximum depth of the trees is f ∈ [F min, F max ], which form a two-dimensional space area. We divided the area into S × T grids, then the parameter combination of each grid point is: where s = 1, 2, 3…, S; t = 1, 2, 3…, T. Figure 6, the objective function value of each grid point was calculated separately, and the rectangular area with the largest objective function value was selected as the new parameter value range. Further, the mesh was redivided into smaller steps, where a = (E max −E min )/S, b = (F max −F min )/T.  After cross-validation, the optimal parameters of the three models are shown in Table 1.

As shown in
This method can ensure that all possible regions around the approximate optimal combination can be searched, and it is easier to obtain the global optimal solution. Moreover, the selection of the step size is not directly related to the results, which enhances the adaptability of the grid method.

Model building
Dielectric loss tanδ, complex capacitance real part C′ and complex capacitance imaginary part C″ of FDS curves at 40 • C/1 mHz-5 kHz were selected as input attributes, and moisture content of oil-paper insulation samples was taken as output mark.
There are 55 sets of sample data, and each set contains 17 frequency points. In Section 2.2, 110 samples were divided into four groups. According to the stratified sampling method, in the first and second group, 28 paper samples with the moisture content interval close to 0.2% were selected as the training set, and in the third and fourth group, 27 paper samples with little difference in corresponding moisture content from the training set were selected as the testing set. The data are shown in Table 2. The model building process is shown in Figure 7. The construction process of the RFR is more complicated than that of the SVR and DNN. The characteristics of the three models have been marked in the flowchart. Among them, RFR introduces data disturbance and attribute disturbance, which can enhance the generalization performance of the model and prevent overfitting. At the same time, by combining the results of multiple decision trees, the evaluation accuracy of the model can be improved.
The RFR model building process is as follows.
1. The data sets with the same number of the training samples were extracted by using Boostrap sampling method. 2. The optimal parameters of the model were obtained by using the improved grid search method based on rectangular expansion. And the generalization performance of the learner was evaluated through tenfold cross-validation.
3. For each node of 600 decision trees, a subset containing two attributes was selected randomly from the attribute set, and then the optimal attribute was selected from this subset. The optimal partition attribute was obtained by the Gini index, and the expression is as follows: where p k (k = 1, 2, …, |y|) is the proportion of class k in the sample set. The regression results of 600 decision trees are arithmetic averaged to get the final model output.
4. There are several data points under each moisture content, the predicted results are averaged and compared with the real value. We used the coefficient of determination R 2 to measure the similarity between estimated and measured values, which is shown in Equation (8).

QUANTITATIVE EVALUATION OF MOISTURE CONTENT
In order to evaluate the moisture content of oil-paper insulation more quickly, this paper used RFR to identify the moisture content of FDS curves with different lower-cut-off frequencies, and compared the results with SVR and DNN models. The divided frequency bands are shown in Figure 8.
The data to be tested contain five different frequency bands, which are 1 mHz-5 kHz, 0.01 Hz-5 kHz, 0.1 Hz-5 kHz, 1 Hz-5 kHz, 10 Hz-5 kHz. For the convenience of expression, they are numbered 1-5. As a standard practice, at least two cycles of the sinusoidal test voltage are recorded to accurately ascertain the amplitude and phase differences between the voltage and current signals. On the other hand, under field conditions, testers usually determine the test time according to the trend and amplitude of the initial stage (100 mHz-5 kHz) curves of the FDS. When the initial amplitude is small, the measurement is stopped at a higher frequency, otherwise, the full frequency test is carried out. Here we only considered the condition of laboratory test. Therefore, the time required to obtain the data of these frequency bands are 51, 23, 5, 1 and 1 min respectively.
For the above data of five frequency bands, the evaluation accuracy and time of the RFR, SVR and DNN models are shown in Table 3.
The closer R 2 is to 1, the better the model performance is. The closer MSE, RMSE and MAE are to 0, the better the model performance is. As shown in Table 3, the evaluation accuracy of RFR in frequency bands 1 to 3 is higher than that of SVR and DNN under the four metrics, which verifies the evaluation accuracy of the RFR. At the same time, since the SVR model only contains two parameters, the test results can be obtained in a short time after the kernel function is determined. The DNN model contains 7,546,000 connection weight coefficients that need to be adjusted, so the training time is the longest, about 41 s. RFR is an integration of multiple decision trees, and its  In order to further analyse the evaluation accuracy of the model, the accuracy comparison of each frequency band with R 2 as the index is shown in Figure 9.
The predicted results with the frequency band from 1 mHz to 5 kHz are that RFR has the highest evaluation accuracy, which is 0.9137; the R 2 of SVR model is 0.8969; the R 2 of DNN model is 0.9012.
When the frequency band of test data is 0.01 Hz-5 KHz, the R 2 of RFR, SVR, DNN are 0.8718, 0.8969, 0.9012 respectively. Compared with the results from 1mHz to 5 kHz, the R 2 of RFR is only reduced by 0.0419. In addition, the acquisition time of 0.01 Hz-5 kHz data is shorter than that of the whole frequency band data by 28 min. On the basis of sacrificing the evaluation accuracy properly, the speed of model evaluation is greatly improved. The moisture content of oil-paper insulation can still be identified with high accuracy. At the same time, with the increase of lower-cut-off frequency, the number of input sample data is gradually reduced, resulting in a continuous decrease in the accuracy of the model. When the lower-cut-off frequency is greater than 0.1 Hz, the accuracy of DNN is higher than that of SVR and RFR, which is due to the small amount of data, resulting in over-fitting of all, so the evaluation accuracy is less reliable in these two frequency bands. Although the test time has been reduced accordingly, it is impossible to evaluate the moisture content of oil-paper accurately.
Therefore, the data in 0.01 Hz-5 kHz frequency band can be used to evaluate the moisture content accurately and quickly, while the data in other frequency bands cannot be used due to their low accuracy.
As shown in Figure 10, the evaluation accuracy of RFR from 0.01 Hz to 5 kHz is higher than that of SVR and DNN. This is due to the fact that using a frequency band containing fewer frequency points obtains data less than using the whole frequency band. It is equivalent to the missing value in the input data, which will make the model unable to get accurate results. However, the RFR model is an ensemble learning method based on decision trees. Since the decision tree can divide samples with missing attribute values into different sub-nodes with different probabilities, RFR is less sensitive to data loss. On the contrary, SVR and DNN do not have strategy to deal with missing samples, resulting in lower evaluation accuracy than RFR.
At the same time, these three models all have higher evaluation values for samples with low moisture content. This is because the polarization process at low moisture content is weak, and the change degree of tanδ, C′, C″ are small. The tanδ under low moisture content has little difference with higher moisture content, which leads to the overall higher evaluation results of these models.

CONCLUSION
In this paper, FDS curves of oil-paper insulation samples with different moisture contents were studied. And a new method for evaluating the moisture content based on RFR was proposed. The experiments show that the tanδ-f curves move up with the increase of moisture content; the influence of moisture content on the real part of complex capacitance C′ is mainly reflected in the low frequency band; the imaginary part of complex capacitance C″ increases with the increase of moisture content in the whole frequency band.
Among the three moisture evaluation models, the RFR has the highest accuracy in evaluating the moisture content represented by the FDS curves at different lower-cut-off frequencies. Moreover, when the frequency band of the model input data is 0.01 Hz-5 KHz, the evaluation accuracy of RFR model does not decline significantly, but the test data acquisition time is reduced by 28 min. Thus, we can use the frequency band containing fewer frequency points to solve the problem of long time to obtain input data in the moisture content evaluation process, and provide a new method to measure the moisture content of oil-paper bushing on site accurately and quickly.