Transfer learning based dynamic security assessment

Correspondence Hoi Andy Lam, School of Computer Science, The University of Sydney, Building J12, 1 Cleveland Street, Darlington, NSW 2008, Australia. Email: andylamhoi@gmail.com Abstract With the increasing deployment of wide-area monitoring systems (WAMS) and phasor measurement units (PMUs), along with artificial neural network (ANN) and highperformance distributed computation technique for smart grid and smart metering environment, online dynamic security assessment (DSA) plays a key role for early unstable event detection on power system security. It is especially important at a post-fault operation that the timing by DSA to detect an unstable event is critical to emergency remedial control action. However, excessive update training is one of the constraints for ANN to be effectively performed at pre-fault and post-fault operations on online DSA. This paper describes how transfer learning is successfully employed to shorten the training time for online DSA. It also helps to improve the validation accuracy if the training dataset of scratch ANN model is insufficient. Besides, a new approach of using the densely connected convolutional network with kernel principal component analysis (KPCA) is proposed to eliminate the traditional step of dimensionality reduction.


INTRODUCTION
In recent years, major power blackouts have occurred in various countries. Together with the increasing integration of renewable energy in the power supply system and the active response of the demand side, these uncertainties bring an unprecedented level of complexity to power system stability. Traditional DSA technology has not been able to provide an effective and timely system security assessment to meet this challenge. The conventional DSA analysis methods are based on timedomain simulation (T-D) techniques and energy functions. However, it is not efficient to meet the high-speed online assessment if a large-sized model parameter and contingency set is encountered. The corresponding remedial control actions will no longer be reliable to satisfy the dynamic system condition.
With the wide application of technologies such as WAMS and PMU in power system, fast online DSA has been adopted as an effective means of detecting the unstable status of power system stability by continuous inputting the features data of operating parameters of the power system in real-time [1][2][3].
The DSA is applicable to the assessment of the power system security status for both pre-fault and post-fault stages. It monitors if the system will continue to be stable with respect to an anticipated but not yet occurred fault conditions through evaluating the pre-fault steady-state of the power system features, and This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2021 The Authors. IET Generation, Transmission & Distribution published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology also predict the system security status under post-fault ongoing disturbance. As such, the appropriate remedial control action is activated to prevent and avoid the power system operating at insecure regions when it encounters an unexpected fault condition or an ongoing disturbance.
In a smart grid, the data-driven method based on artificial intelligence (AI) architecture and data mining technology can effectively achieve rapid evaluation for online DSA [4]. It is a mapping relationship to correlate the input of power system features and the corresponding secure/insecure operating regions of the power system. This method can rapidly assess potential faults as a secure category or an uninsured category.
In [5,6], decision tree (DT) and random forest (RF) were used for online DSA. In [7], support vector machine (SVM) was developed for rotor angle stability analysis. In [8,9], artificial neural network (ANN) was proposed to perform DSA. In [10], SVM and different neural network (NN) including probabilistic, multilayer perceptron (MLP), recurrent with a long short term memory (RNN-LSTM), extreme learning machine (ELM) and stacked auto-encoders (SAE) and convolutional neural network (CNN) were tested in power systems with very high accuracy. The author could not conclude which method was more suitable for transient stability assessment (TSA). In [11], a binary classification CNN model was generated for transient stability prediction. And a multi-class classification CNN model was built for unstable generator identification. The author demonstrated that the proposed CNN models provided more accurate results than DT, MLP, RF and k-nearest neighbor (KNN) for transient stability prediction and unstable generator identification. In [12], transfer learning technique was employed to minimise the marginal and conditional distribution differences between the trained data and unknown data in power system for utilising one trained model to assess unlearned faults.
The AI is capable of predicting the security status of the power system in real-time after being trained by a dynamic security database of the power system. However, if the training data set is too small, it results in a poor approximation. An overconstrained model will underfit the small training dataset. If the training data set is too large, it will slow down the rapid update of the online AI model to identify the potential risk of the power outage. The timing of an insecure event identification is crucial to the effectiveness of emergency remedial control action. The earlier the unstable status of the power system is detected, the greater the likelihood that the power system will return to a stable condition. Transfer learning technology can tackle this challenge to improve the validation accuracy by finetuning technique if the training dataset is too small and rapidly train the AI model to shorten its online update time by feature extraction technique if the training dataset is too large.
Using ImageNet pre-trained CNN features, impressive results were obtained on several image classification datasets, as well as object detection, action recognition, human pose estimation, image segmentation, optical flow, image captioning, and others. In [13], the author demonstrated that training of finetuning on ImageNet classes that were not present in PASCAL-DET was sufficient to learn features that were also good for PASCAL classes. This concept is employed in this paper, that is, the pretrained non-electrical features from ImageNet on an AI model are transferred to improve the performance on the same model which is based on the training dataset obtained from the simulation of the power system. As such, the feature data obtained from the simulation of stability studies after being selected by KPCA or ReliefF is converted into the twodimensional (2D) image input to train the scratch AI model. It is in the form of a two-dimensional graph whose two coordinates are feature samples and normalised weight properties of the feature sample to represent the distribution of data in a histogram. This simpler approach has the dimension lower than the feature dataset selected by KPCA and ReliefF.
In [14], the author reported that DenseNet outperformed the VGG19, ResNet50 and Xception on a biomedical image binary classification in an application of transfer learning method with a small dataset. In this paper, a binary classification DenseNet model is used to demonstrate the effectiveness of transfer learning approach even though the DenseNet can be used for multiclass classification. As DenseNet has a powerful capacity for feature analysis on different image patterns, this advantage is applied to analyse the 2D histograms of simulated OPs as different image inputs.
The pre-fault input data used in the work is converted to 2D histogram image data that is different but related to the dataset on ImageNet. The inductive transfer learning is appliable to this scenario. Then, the performance of the transfer learning approach with ImageNet is compared to scratch DenseNet trained by the operation points (OPs) obtained from the Western System Coordinating Council (WECC) 3-machine, 9 bus system [15][16][17]. Other smaller datasets from three scenarios on New England 39 bus system [18] are tested for further validation.
In addition, the transient stability status of a power system following a fault can be predicted early based on the measured post-fault values of the feature data obtained from the simulation of stability studies [7]. The feature data collected immediately after clearing a fault is used as inputs to an SVM classifier, predicting the transient stability status. As these feature data are used as training data for SVM, the proposed approach from this paper can convert the training data to be 2D data. Then, transfer learning techniques can be applied.
The main contributions of this paper are as follows. To the best of our knowledge, similar applications have not been reported in the field of DSA.
1. The non-electrical feature dataset on ImageNet is used to improve the performance for an AI model training on the power system. 2. The DenseNet with KPCA is proposed to eliminate the traditional step of dimensionality reduction. 3. The feature extraction technique of transfer learning is applied to shorten the training completion time on DenseNet if the training dataset is too large. 4. The finetuning technique of transfer learning is applied to improve the validation accuracy on DenseNet if the training dataset is too small. 5. The 2D histogram is used to represent the feature data of power system for the training of AI model.

COMMON LEARNING ALGORITHMS
In deep learning, CNN as one of the major computational applications for image recognition and image classification is widely used in different fields [19][20][21][22][23]. As CNNs draw representational power from extremely deep or wide architectures, it emerges the new problems of gradient vanishing and model degradation.
The recent network frameworks of ResNet [24], Highway Networks [25] and Fractal Nets [26] vary in the network topology to tackle these drawbacks; they all contain the same core ideas that they create short paths from early layers to later layers. DenseNet [27] has deviated from the above concepts that the network performance is improved through feature reuse and bypass settings of information flow. It connects each layer to every other layer in a feed-forward fashion that some high-level features may be produced in the deeper final layers of the network. This approach makes it fewer parameters than ResNet for easier training and alleviates gradient vanishing and model degradation.
Assume a photo of X 0 to be input on a traditional convolutional feed-forward network which comprises L layers, indexes the layer for implementing a non-linear transformation H ( * ).
H ( * ) can be a composite function of operations such as BN, ReLU, Pooling or Conv. The feature output at the th layer is denoted as X .
where [] refers to the concatenation of the feature-maps produced from layer 0 to layer −1. The input of the th layer is not only related to the output of the − 1 th layer but also accessible to the output of all preceding layers. The feature size must be the same in the same Denseblock for performing concatenation of feature maps on the different layers. Transition layers used for indirect information flow are set between different Denseblocks to achieve down-sampling. A single classifier on top of the network supervises all layers through transition layers. It makes the model structure and gradient calculation simpler than deeplysupervised nets (DSN) which have classifiers attached to every hidden layer.
If the feature maps are viewed as the global state of the network, the training goal of each layer is to determine the updated value that needs to be added to the collective knowledge of its Denseblock. Assuming that the output of each nonlinear transformation H is feature maps, the th layer has 0 + ( − 1) × input feature maps. The growth rate also determines how much information each layer updates to the global state. DenseNet can have very narrow layers that a relatively small growth rate is sufficient to achieve state-of-art performance. This characteristic makes it significantly different from the other neural network.
The DenseNet-B uses a 1 × 1 Conv (Bottleneck) before each 3 × 3 Conv to reduce the computational burden that the improved non-linear transformation of input feature maps becomes BN-ReLU-Conv(1 × 1)-BN-ReLU-Conv(3 × 3). If a Denseblock contains m feature maps, the transition layer generates θm output feature maps. It is called DenseNet-C if the compression factor < 1. If the Bottleneck is also used, it is called DenseNet-BC. DenseNet-BC is the most parameter efficient variant of DenseNet. It requires only about one-third of the number of parameters of ResNet with the same model accuracy. Comparing ResNet with 1001 layer over 10 M parameter and DenseNet-BC with the only 0.8 M parameter for 100 layers in training, they converge on the same training epoch.

A TRANSFER LEARNING BASED TRANSIENT STABILITY ASSESSMENT FRAMEWORK
Transfer learning is not a specific new concept for deep learning. Instead, it is entirely different from the traditional approach of machine learning to train a model. Users can re-use knowledge (i.e. features, weights, etc.) from another pretrained model for a related task in a new model. Reference [28] gave the definitions of "Transfer Learning" as follows.
Given a source domain D S and learning task  S , a target domain D T and learning task  T , transfer learning aims to improve the learning of the target predictive function f T (⋅) in D T using the knowledge in D S and  S , where D S ≠ D T or  S ≠  T . If the target task  T is different from the source task  S , it is the definition of inductive transfer learning no matter when the source and target domains are the same or not.
A domain D consists of two components: a feature space  and a marginal probability distribution P(X), where X = {x 1 , … , x n } ∈ , D = {, P (X)}.
A task consists of two components: a label space and an objective predictive function f (⋅) (denoted by  = { , f (⋅)} ), which is not observed but can be learned from the training data of pairs {x i , y i }, where x i ∈ X and y i ∈ . The function f (⋅) can be used to predict the corresponding label f (x) of a new instance x. From a probabilistic viewpoint, f (x) can be written as P (y|x), and  = { , P (Y|X)}.
When the learning tasks  S and  T are different, then either (1) the label spaces between the domains are different, i.e. S ≠ T , or (2) the conditional probability distributions between the domains are different, i.e. P( When the domains are different. Then either (1) the feature spaces between the domains are different, i.e.  S ≠  T or (2) the feature spaces between the domains are the same, but the marginal probability distributions between domain data are dif- When the target and source domains are the same, i.e. D S = D T , and their learning tasks are the same, i.e.  S =  T , it becomes a traditional machine learning problem.
The inductive transfer learning can be considered as multitask learning if there are ample labelled data available from the source domain for target labelled data to induce an objective predictive model in the target domain. It helps to achieve high performance in the target task by transferring knowledge from the source task.
The feature-representation approach to the inductive transfer learning problem aims at finding "good" feature representations to minimise domain classification error. If a lot of labelled data in the source domain are available, supervised learning methods can be used to construct a feature representation. In [29], a twostage approach was presented to domain adaptation. A generalisable feature representation with appropriate weights across different domains is selected to train a general classifier. Those features specifically useful for the target domain is picked up by employing semi-supervised learning.
The feature-projection mapping approach is used to map the data of each domain from the high-dimensional feature space to the low-dimensional latent feature space. After projection mapping, the marginal distributions of data between the source domain and target domain are close to each other in the low dimensional latent feature space. In this way, the tagged source domain sample data can be used to train the classifier and predict the target test data. The training model of a supervised learning algorithm can be used in this case. In [30], a new dimension reduction method was proposed to solve the dimensionality reduction by minimising the Maximum Mean Discrepancy of the source domain data and the target domain data in the latent space. In [31], a structural corresponding learning (SCL) algorithm was proposed to learn a projection mapping of pivot features from the feature spaces of the source domain and target domain to a shared, low-dimensional real-valued feature space through defining a set of pivot features on the unlabelled data from both domains.
The idea of supervised feature construction methods for the inductive transfer learning setting is similar to multi-task learning. A low-dimensional representation that is shared across related tasks is learned. The common features can be learned by solving an optimisation problem as follows.
In this equation, S and T denote the tasks in the source domain and target domain, The learned new representation can reduce the classification error of each task. The (r, p)-norm of A is defined as The above optimisation problem estimates the lowdimensional representations U T X T , U T X S and the parameters, A, of the model at the same time. It can be further transformed into an equivalent convex optimisation formulation and be solved efficiently.

A SCRATCH DENSENET MODEL FOR TRANSIENT STABILITY ASSESSMENT
The later the training on the AI model is completed, the less the likelihood that the risk of blackouts can be detected earlier.
In practice, it is common to pretrain a DenseNet on a very large dataset and then use it either as an initialisation or a fixed feature extractor. It helps to tackle the challenge if a scratch deep neural network of stability prediction takes too long to complete the training from a large feature dataset of the power system. Also, if the training data is not sufficient, e.g. the power system is under the situation of topology changes, it results in poor performance on validation accuracy.
In this paper, the ImageNet is used to pretrain the proposed DenseNet. Then, the techniques of transfer learning are applied to improve the performance that it outperforms the scratch DenseNet model to be trained by the dataset obtained from the simulation of stability studies. The training dataset is described in Section 5. The two major transfer learning scenarios are given as follows: a. DenseNet as a fixed feature extractor: The weights for all the DenseNet is frozen except that of the final fully connected layer. This last fully connected layer is replaced with a new one with random weights, and only this layer is trained. The feature codes for all images are extracted to train this layer as a classifier. The rest of the DenseNet is treated as a fixed feature extractor for the new dataset. b. Finetuning the DenseNet: Instead of random initialisation, it finetunes the weights of the pretrained DenseNet using data from ImageNet by continuing the backpropagation. As the earlier layers of a DenseNet contain a more generic feature; it is an option on finetuning some higher-level portion of the network rather than all the layers. Rest of the training is running as usual.
As described in Section 3, if the feature dataset of the power system is used to pretrain the DenseNet before feature extraction of transfer learning is performed, it becomes a traditional machine learning approach. If the DenseNet has already been trained by a feature dataset, the proposed transfer learning approach provides flexibility to re-use this DenseNet model as a pretrained deep neural network. Then, the weights of the DenseNet are updated from its last checkpoint through finetuning or feature extraction techniques to save for later use.
Since the target dataset is similar in context to the original dataset of the pretrained DenseNet at the last checkpoint, the feature extraction technique can rapidly train the DenseNet with validation accuracy close to the pretrained model. If the new dataset is very small, training a DenseNet on a very small dataset from scratch may affect its ability to generalise, often cause overfitting concerns. The finetuning technique can tackle this problem to get a better result.
The flow chart of transfer learning-based framework is shown in Figure 1.

DATABASE GENERATION
In [32], five indices were reviewed for the assessment of power system transient stability. They are the maximum rotor angle deviation, maximum speed deviation, maximum acceleration, transient rotor angle severity index (TRASI) and transient stability index (TSI). The indices of maximum rotor angle deviation, maximum speed deviation and maximum acceleration are specific to the individual generator that the running characteristics of each generator are feedbacked. But they do not provide the specific instability value of the entire system. The TRASI and TSI are the indices to measure the whole system stability status. TRASI measures the maximum rotor angle difference before and after a system fault to determine the severity. But the research for this subject has commonly used the angle-based stability index of TSI as described in Equation (4) [32]. Also, its concept is not complicated to be implemented by Power System Toolbox (PST) software. Therefore, it is used as the transient stability index in this paper. According to the criterion of Equation (5), secure and insecure labels are used to classify transient stability status. The popular WECC 3-machine, 9 bus system and New England 39 bus system are considered in this paper. The simple 9 bus system well demonstrates how the proposed DenseNet and transfer learning techniques are tested for transient stability classification. The generator at bus #2 is assigned as a wind farm, and the power system network and parameters are obtained from [33]. Then, the effectiveness of the proposed approach is further validated by New England 39 bus system.
Power angle-based stability index is defined as follows: where max is the maximum angle separation of any two generators at the same time in the post-fault response. The > 0 and < 0 correspond to stable and unstable conditions [34]. Under the above transient stability criterion, 4,900 OPs are sampled evenly distributed across the variation of the generator output power at bus #2 and load demand that 3,500 prefault secure OPs and 1,400 pre-fault insecure OPs of the 9 bus test system are produced, respectively. It covers various load patterns ranging from 60% to 180% and the electricity output capacity of the wind farm ranging from 0% to 100% of its base OP. A three-phase fault is simulated at bus #4 by PST software, and the fault is cleared after 0.05 s.
Other three smaller datasets of 500 OPs are generated from New England 39 bus system by DigSilent PowerFactory software as shown in Table 1. Monte Carlo technique is applied to generate the random load ratio within the pre-determined range between 0.52 to 1.2.  Table 1 records the protection measures on three scenarios that the three phases fault occurs on buses at 0.5 s to obtain different stability dataset. In the scenario of bus #3, the fault is clear at 0.75 s. It generates 207 pre-fault secure OPs and 293 pre-fault insecure OPs. No transmission line is tripped. On scenarios of bus #19 and #22, the faults are clear at 0.7 and 0.6 s. The transmission lines between bus #19 -#33 and #21 -#22 are tripped respectively on each scenario. The scenario of bus #19 generates 280 pre-fault secure OPs and 220 pre-fault insecure OPs. The scenario of bus #22 generates 217 pre-fault secure OPs and 283 pre-fault insecure OPs. The validation data is selected from 25% of the above datasets.

Dimensionality reduction
The advantage of using the feature selection technique has recently attracted much attention within the deep learning studies that it helps to reduce training time, overfitting and complexity of the feature datasets, and also improve data understandability and validation accuracy. The cost of data acquisition may be significantly reduced if the right subset is chosen.
The power system has a high dimension of features in nature. Only those features which can characterise the system stability status are served as useful inputs. ReliefF algorithm [35,36], which is well known applied to genetic analyses where epistasis is common is used in this paper to identify feature interactions without having to check every pairwise interaction exhaustively. It significantly reduces the initial dimensionality of the feature datasets with less calculation time than exhaustive pairwise search. The features are assigned with different weights according to the relevance of each feature and category. Also, KPCA for dimensionality reduction in non-linear space is used to compare the effectiveness of the ReliefF algorithm through the validation accuracy of the DenseNet model for transient stability prediction.
Dimensionality reduction has been studied widely for image processing in the deep learning community. Traditional dimensionality reduction methods try to project the original data to a 2D coordinate while preserving the critical properties of the original data for a CNN model. In [37], the maximum variance unfolding (MVU) technique was proposed to maximise the variance of the embedding while preserving the local distances between neighbouring observations for extracting a lowdimensional representation of the data. The MVU applies principal component analysis (PCA) [38,39] to kernel matrix K to choose the base eigenvectors for mapping data on KPCA. The concepts of PCA and KPCA are summarised as follows.

PCA
It is a transformation to diagonalise the covariance matrix of the feature data x j , j = 1, … , m for x j ∈ R d that the data is centred ∑ m j =1 x j = 0 to be defined as: The above problem can be solved by the following eigenvalue equation [40]: where ‖ ‖ = 1, ≥ 0 are eigenvalues and ∈ R d are eigenvectors.
The orthogonal projection on the eigenvector k of principle component for a test point x can be found as: where n ′ ≤ m and x j ′ are vectors after PCA transforms in n ′ lower-dimensional space.
Refer to [41,42]. The scatter matrix S is defined as a statistical expectation operator: where j = 1, … , m for x j ∈ R d , combine Equation (9) and Equation (10) that a k, i are the projections of the x j on k .
Given by [42], the variance 2 is a function of eigenvector k . Equation (11) can be represented as an eigenvalue problem with nontrivial solutions at local maximum or minimum as shown in Equation (12) S k = k (12) where 1 > 2 > … > k > … > m . E n ′ is the error in the representation of original data x j compared to data x j ′ after dimensionality reduction by PCA, and k is the eigenvalue cor-responding to k . As the eigenvectors 1 , 2 , … , m are corresponding to 1 > 2 > … > k > … > m , the lowest order component of eigenvectors with the largest eigenvalues gives the smallest error in representation. It implies the variance to be maximum in the direction of the eigenvectors.

KPCA
The dynamic security problem is in the form of a large number of non-linear differential-algebraic equations. PCA is appliable for linear correlation among the variables that its performance is degraded in a non-linear problem. Kernel PCA is more suitable to transform data which are confined to low dimensional non-linear subspace in power system [43,44]. It is a non-linear extension of PCA [45] to be described as follows. First step: The x i ∈ R d which is non-linearly separable is mapped to the feature space  , dimensionally higher (> d ), Φ ∶ R d →  , so that the new non-linear subspace is linearly separable to perform PCA in  for a specific choice of Φ [46]. As such, which is similar to Equation (8). The dot product of Gaussian kernel is: (14) Second step: Suppose the feature space  which is defined by the vectors {Φ(x 1 ), (x 2 ), … , (x m )} to be centralised, i.e. ∑ m j =1 Φ(x j ) = 0, covariance for these vectors is shown as:ĉ The eigenvalues ≥ 0 and nonzero eigenvectors ∈  that satisfy =ĉ is required to do PCA for Equation (15).
where all eigenvectors in the feature space  can be expressed as a linear combination of Φ(x k ) in Equation (16). There exist From Equation (16): The Equation (17) is simplified by defining the m x m Gram matrix K for K ij = (x i ) ⋅ Φ(x j ). It is also known as Kernel The projection on the normalised eigenvector k in feature subspace  for a test point x can be found as: The kernel function Φ(x i ) ⋅ Φ(x j ) = K(x i ,x j ), which acts as a dot product to satisfy the Mercer's theorem for mapping into the feature space  , can be used to reduce the computational cost for non-linear information of process. Using the kernel function of the Gaussian kernel in Equation (14), the dot product in feature space can be reduced to a function in input space.
Referring to Equation (14), the performance of a Gaussian kernel is highly dependent on the choice of bandwidth. As such, the simulations on different bandwidths of the Gaussian kernel were carried out using the stability data of different operating scenarios as described in Section 5. Nine bandwidths sizes were selected as 0.05, 0.75, 1, 10, 20, 40, 60, 80 and 100.
Using ReliefF, the values of all the features from the simulation of the 9 bus system were calculated and shown in Table 2 in descending order. Their distinct values imply the ability to classify stability instance.

WECC 3-machine 9 bus power system
The proposed scratch DenseNet model for transient stability prediction was trained using the data obtained from the dimensionality reduction by ReliefF and KPCA method as described in Section 5. To verify the benefits of the proposed transfer learning approach, the techniques of feature extraction and finetuning were repeated with the same training and validation datasets for comparison with the scratch model. The classification results of different techniques for the above test data are shown in the Case 1 and 2.
As compared to the effectiveness of ReliefF, the data obtained by using KPCA method for non-linear transformation is used to train the scratch model of DenseNet.
Also, the selected features as listed in Table 3 are reduced to the same dimensions as ReliefF did. The respective results of validation accuracy are shown in Table 4. It indicates that the bandwidth parameter = 20 has the best validation accuracy of 0.990283. This value is higher than the best validation accuracy of 0.982843 obtained from the simulation data of ReliefF, as shown in Table 5. The bandwidth = 20 was selected to proceed with the transfer learning approach.
According to Section 4, the simulation results of the scratch DenseNet model with data from ReliefF and KPCA are compared with the transfer learning approach, as shown in Figures 2,  and 3 and Tables 5-7.
The DenseNet obtained scratch train and validation results with data from ReliefF and KPCA at Epoch 14/14, as shown    Table 5. It is the scratch model that does not use the transfer learning technique during the model training.
As described in Section 3, the proposed transfer learning approach can improve validation accuracy or reduce training completion time compared to the simulation result of scratch DenseNet model, which does not use the transfer learning as illustrated in Figures 2 and 3. The following two case studies of transfer learning demonstrate the effectiveness on 9 bus system if the original pretrained dataset comes from ImageNet.
Case 1: Feature extraction Apply feature extraction technique of transfer learning to verify if training completion time is reduced. The DenseNet obtained train and validation results with data from ReliefF and KPCA at Epoch 14/14 are shown in Table 6.   So, the feature extraction technique reduces training completion time from 633 m 24s to 313 m 47s and 802 m 54s to 310 m 39s for data obtained from ReliefF and KPCA, respectively. As previously described, if the feature dataset of the power system is used to pretrain the DenseNet before feature extraction of transfer learning is performed, it becomes a traditional machine learning approach. The validation accuracy is close to the pretrained DenseNet.
Case 2: Finetuning Apply the finetuning of transfer learning to verify if validation accuracy is improved. The DenseNet obtained train and validation results with data from ReliefF and KPCA at Epoch 14/14, as shown in Table 7.
So, the best validation accuracies are increased from 0.982843 to 0.995098 and 0.990283 to 0.995951 for data obtained from ReliefF and KPCA, respectively. Figures 2 and 3   The red dash line shows a very positive effect if the ImageNet trained model transfers features to DenseNet model and finetune them. It generalises better performance than the scratch DenseNet model, which does not use the transfer learning technique. It has the advantage not only to enable training without overfitting on small target datasets but also boost generalisation performance even if the target dataset is small as 4900 OPs.

New England 39 bus power system
The proposed transfer learning approach which uses the Ima-geNet as pretrained dataset is further validated on New England 39 bus power system. The simulation results are shown in Tables 1, 8   1 and 2 that the feature extraction technique reduces training completion time, and finetuning technique improves validation accuracy.
In the scenario of bus #3, the feature extraction technique reduces the training completion time from 373 m 27s to 163 m 7s or 43.68% at Epoch 29/29. The best validation accuracy also increases from 0.616 to 0.696. The finetuning technique improves validation accuracy from 0.616 to 0.824, as shown in Table 8 and Figure 4.
In the scenario of bus #19, the feature extraction technique reduces the training completion time from 354 m 50s to 155 m 16s, or 43.76% at Epoch 29/29. The finetuning technique improves validation accuracy from 0.936 to 0.952, as shown in Table 9 and Figure 5.
In the scenario of bus #22, the feature extraction technique reduces the training completion time from 347 m 18s to 158 m 47s or 45.72% at Epoch 29/29. The finetuning technique improves validation accuracy from 0.848 to 0.872, as shown in Table 10 and Figure 6.
The simulation results using the dataset obtained from New England 39 bus system as shown in Tables 8-10 and Figures 4-6 have further validated Section 6.1, that the feature extraction In the scenario of bus #3, no transmission line is tripped as protection measures. In the scenarios of bus #19 and #22, transmission lines are tripped, respectively. On a large complex power system, it is a general practice to trip the fault-related transmission lines to isolate the fault location and stabilise the power system under severe fault condition. The scenarios of bus #19 and #22 which achieve high validation accuracy are more applicable to industrial practice.

CONCLUSION
The simulation results of transient stability prediction supported by feature extraction and finetuning techniques for pretrained ImageNet dataset and target dataset obtained from the WECC 9 bus system with dimensionality reduction by ReliefF and KPCA are shown in Figures 2 and 3 and Tables 5-7. The results illustrate that the feature extraction technique reduces training completion time and finetuning technique improves validation accuracy if transfer learning approach is applied, respectively. It also demonstrates that the finetuning technique maintains stronger convergence ability with higher validation accuracy than the feature extraction technique or scratch DenseNet model. The simulation results using the target dataset obtained from the New England 39 bus system with pretrained ImageNet dataset, as shown in Tables 8-10 and Figures 4-6 have further validated the above conclusion.
As the ReliefF and KPCA present the target dataset in a different feature space, the DenseNet model with finetuning technique is capable of achieving the high validation accuracy around 0.995098 to 0.995951 in Section 6.1. The feature extraction technique can reduce training completion time to 49.54% and 38.69%, respectively.
Besides, it demonstrates that KPCA is more effective than ReliefF to process features for dimensionality reduction. Also, the DenseNet successfully extracts features from the 2D histogram images of simulated OPs for further processing of transient stability assessment. The proposed approach works for pre-fault DSA, and the procedure is also appliable to post-fault DSA.