1. Introduction
The electric power system features the most complex transport system to date. It is presently undergoing a very rapid and intensive transformation, as it is transitioning toward the renewable and sustainable future. The most important aspect of that transformation is the accelerating change in the mix of generating resources, which reduces the proportion of conventional power plants (with large synchronous generators) while increasing the share of inverter-based resources (IBRs), i.e., wind and solar power plants. This change in the generation mix represents a fundamental shift in the operational characteristics of the power system as such, with potential (significant) reliability implications for the electric grid [
1]. Namely, generating resources need to provide on an ongoing basis, among others, voltage control, frequency support, and ramping capability in order to balance and maintain the grid [
2]. However, the vast majority of IBRs today do not participate in this support. At the same time, generating resources need to provide system stability during transient events, brought on by faults (i.e., short-circuits), severe weather, and other internal and external disturbances. On this front, new resource mix will significantly (and irreversibly) reduce the power system inertia, thereby exacerbating the power system transient stability (TSA) problems of tomorrow. The response of the power grids with low system inertia to disturbances necessitates new approaches for dealing with the TSA issues [
3]. The electric grid is also under increasing strain from the growing share of electric vehicles, which increase in the overall grid load and electricity demands. These combined stresses on the electric power system are without precedent and pose considerable technical challenges, which need to be successfully tackled if the electrification of the transport sector, and transition to the renewable energy, is to be realized.
These aforementioned developments are accompanied by still more innovations in the power system sector, one of which is the introduction of wide-area monitoring (WAM) aspects to the electric grid, based on extensive communication infrastructure and distributed phasor measurement units (PMUs). WAMSs bring a “big data” paradigm into the operational portfolio of the transmission system operators and provide novel avenues for tackling various power system problems, from load flow, state estimation, relay protection, to transient analyses [
4,
5]. Furthermore, electric power systems are integrating an increasing number of power electronic components, FACTS, AC/DC devices, thereby changing the response characteristics of the system to various disturbances [
6]. All these different aspects set new challenges and, at the same time, provide opportunities for novel technical solutions. Namely, some well-known traditional solutions—particularly in the domain of transient stability analysis—have acknowledged weaknesses that may compromise the security of the bulk power system under increased stress emanating from the accelerated pace of its ongoing transformation.
The transient stability analysis of the electric power system is currently the focus of intense research [
7], as it is increasingly becoming obvious that traditional solutions are not future-proof and are becoming increasingly maladjusted to rapid changes in modern systems. Neither time-domain simulations, direct methods (equal-area and Lyapunov methods), nor different hybrid schemes are able to deliver solutions that would satisfy the reliability requirements of modern power systems [
8]. Hence, a new breed of methods is slowly emerging, based on artificial intelligence and WAMS data troves.
Machine learning (ML) and its subset of deep learning have been thus far applied to the TSA problem, with promising results on benchmark test cases. For example, Tan et al. used a deep imbalanced learning framework for the TSA of a power system [
9]. Liu et al. introduced transient stability assessment models based on ensemble learning with network topology [
10], distance metric learning [
11] and kernel regression. Su et al. proposed a probabilistic stacked denoising autoencoder for a power system transient stability prediction with wind farms [
12]. Autoencoder is a special kind of artificial neural network (ANN) that can be trained using an unsupervised learning approach. Tian et al. introduced a transient stability boundary, constructed using the broad learning approach [
13]. Ren and Xu recommended a method for the TSA based on generative adversarial networks, where two ANNs contested with each other in an unsupervised manner [
14]. Also, Ren et al. employed transfer learning for the transient stability analysis [
15]. Transfer learning is the ability to reuse ANNs pretrained on one (general) task and repurpose them (with fine-tuning) for different but related (specialized) tasks. Li et al. suggested an intelligent TSA framework with continual learning ability [
16]. An ensemble of decision trees was employed in [
17,
18,
19] to tackle the TSA problem. Support vector machine (SVM) is a popular ML approach, used by many authors for tackling the TSA problem, e.g., [
20,
21,
22]. In addition, Bashiri et al. introduced a learning framework for the TSA that relied on a twin convolutional SVM [
23].
Inherent in the deep learning approach is the fact that the classical feature engineering step of the ML model building process is absent and replaced by an automatic learning of the features by the deep model itself during training [
24]. This removes expert knowledge and instead relies on deep ANN architectures to learn useful features. However, a considerable preprocessing of the data (signals) is still needed for any deep learning model, even without classical feature engineering. Another important aspect of the TSA problem, which any machine learning model needs to tackle, is the fact that data space is multidimensional, often with many thousands of features or more. This poses a challenge to training ML models, and some kind of dimensionality reduction technique or embedding process is typically employed as part of the model training. For example, Chen in [
25] applied an indirect principal component analysis (PCA) as a dimensionality reduction technique for a power system transient stability assessment. Bellizio et al. proposed a causality-based feature selection [
26]. Gnanavignesh and Shenoy introduced a relaxation-based spatial-domain decomposition method [
27]. Arvizu et al. used a diffusion maps approach for the dimensionality reduction in transient simulations [
28]. Still, other signal feature extraction and dimension reduction techniques have been proposed, e.g., [
29,
30].
This paper is concerned with examining the use of manifold learning (as part of the wider ML landscape) in assessing the power system TSA, based on WAMS data and power system simulations. Manifold learning is an approach to the nonlinear dimensionality reduction of the data through its low-dimensional embedding that preserves maximum information. The paper features both supervised and unsupervised learning with different embedding methods in order to demonstrate their feasibility in the domain of power system transient stability analysis with WAMS-PMU data. Namely, our overview of the published peer-reviewed research did not show comprehensive comparisons of different embedding methods for the power system TSA with PMU data. Hence, we believe that this paper fills that need and brings new insights in the domain of applying manifold learning to the power system TSA problem. The following methods are examined [
31]: (1) principal component analysis, (2) kernel PCA (kPCA), (3) truncated singular value decomposition (tSVD), (4) isomap embedding, (5) locally linear embedding (LLE) and its variants, (6) spectral embedding, (7) multidimensional scaling (MDS), and (8)
t-distributed stochastic neighbor embedding (
t-SNE). These methods are applied on the WAMS-PMU data derived from the IEEE New England 39-bus test-case benchmark power system [
32]. The model hyperparameters are optimized using cross-validated metrics and a custom-tailored simulated-annealing stochastic optimization algorithm. When supervised learning is applied, it is based on a support vector machine classifier.
Our research hypothesis is that a low-dimensional embedding, derived from the WAMS-PMU data (features), preserves information and can detect power system TSA events in the supervised and unsupervised settings. However, it is shown that not all embedding methods produce equally good results on the TSA data, and that some are better suited than others. We find that a combination of kPCA with SVM classifier provides good overall results in the domain of supervised learning. The same can be said about the isomap embedding and MDS when used with the SVM. However, unlike kPCA, embeddings produced by isomap and MDS are not divisible into separate clusters, which makes them less favorable. We also find that LLE and its variants cannot produce useful embeddings with these data. Furthermore, when the unsupervised learning is concerned, we find that only kPCA is able to generate separable clusters. Finally, we find that the SVM classifier trained on top of several of the tested embeddings is able to detect power system disturbances from the WAMS-PMU data.
The paper is organized in the following manner.
Section 2 first introduces in
Section 2.1 a dataset which emanates from the IEEE New England test power system simulations and brings a description of the steps in its processing pipeline. Next, in
Section 2.2, it exposes a simulated annealing stochastic optimization algorithm and its application to hyperparameter tuning.
Section 2.3 introduces unsupervised and supervised learning for the power system transient stability assessment, with different embedding methods. The results of the models’ applications on the IEEE New England dataset are provided in
Section 3, which is followed by a discussion in
Section 4 and a conclusion is
Section 5.
2. Materials and Methods
The materials part introduces the dataset and describes its processing steps: statistical sampling, features engineering, shuffling, splitting, and normalizing. The methods part first describes the simulated annealing algorithm and then briefly introduces embedding models. Next, it provides details on the training process (i.e., supervised and unsupervised learning), including the hyperparameter optimization and model scoring.
2.1. Power System Stability Dataset
The power system stability dataset was produced from the extensive set of time-domain simulations on the IEEE New England 39-bus benchmark test power system, with varying degrees of loading and three different short-circuit types scattered throughout the network. Further details on the benchmark power system, time-domain simulations, and dataset construction can be found in [
33]. The full 3.8 GB dataset of PMU signals was deposited on Zenodo with a CC BY license [
34]. These raw signals were further processed according to the pipeline graphically depicted in
Figure 1. It consisted of following steps: a statistical processing of data cases (from systematic into the stochastic domain), extracting features from the time-domain signals, splitting the data into training and test sets, and scaling them. The processing of data cases by random sampling without replacement followed a probability distribution of short-circuit events found in the actual power systems [
33]. The splitting strategy used a stratified shuffle split in order to preserve the class imbalance between stable (majority) and unstable (minority) cases. The scaling standardized (training and test) the datasets by removing the mean and scaling them to a unit variance.
Table 1 presents selected features extracted from the time-domain PMU signals. Each time-domain signal was represented by only two points in time, based on prefault and postfault values of the measured quantity. It needs to be stated here that based on the fault location within the network, a coordination between generator out-of-step protection and distance relay protection of the incident transmission lines should determine the time instants for the feature selection. The labeling of data cases as stable or unstable was carried out by means of the power system transient stability index (TSI) [
8,
33]. The statistically processed dataset with 350+ features was also deposited on Zenodo with a CC BY license [
35].
It should be mentioned that we used expert knowledge in selecting these features, and that we had already removed some redundant features (e.g., one of the three phases, all phase angle values, etc.). Notwithstanding that, the number of features was still quite large for this small power system. More importantly, it grew considerably with the increase in number of buses and generators in the system and could easily become difficult to manage and process (i.e., curse of dimensionality). Hence, the main obstacle to the successful training of ML models was controlling and/or reducing this large number of features without significant loss of information.
2.2. Simulated Annealing Algorithm
Simulated annealing is a general-purpose stochastic optimization algorithm that usually consists of four main parts [
36]: (1) an objective function that is being optimized, (2) a perturbation method that is used for generating candidate solutions, (3) an acceptance criterion, and (4) a temperature schedule (i.e., cooling). Our improvements on the classical algorithm constituted using the burn-in period, a multidimensional random walk based on Student’s
t-distribution as a perturbation method, and an early stopping criterion. This combination of burn-in with early stopping balanced exploration and exploitation while reducing the possibility of overfitting.
The objective function in our case constructed the model for which the hyperparameters were being optimized and returned a cross-validated metric by which it was scored.
We employed an
n-dimensional random walk as a perturbation method, where
n is the model’s number of hyperparameters
. For that purpose, we drew uncorrelated random samples from the multivariate statistical distribution in the following manner. During the initial phase of the algorithm’s exploration of the search space (i.e., during a select number of initial iterations
), random samples were drawn from the multivariate Student’s
t-distribution as follows,
:
with
z being the random 1D vector drawn from the multivariate normal distribution, where
is the diagonal matrix of covariances for the hyperparameters (as uncorrelated random variables);
u is the random variable from the uniform
distribution with
degrees of freedom, and
w is finally the random 1D vector from the multivariate Student’s
t-distribution with
degrees of freedom. This phase of the algorithm (where
) is called the burn-in period, and it emphasizes the exploration of the search space (by using low values of the
parameter). After the burn-in period, the samples for the subsequent steps (
) are drawn directly from the multivariate normal distribution, but (generally) with a lower variance, as follows:
where
is the factor by which the initial standard deviations (that form the diagonal covariance matrix) are scaled. If there is a statistical correlation between hyperparameters, that information can be readily introduced through the covariance matrix itself.
Hence, for the
ith iteration, a random walk takes the steps
, where
w comes from Equation (
3) if
or from Equation (
4) if
;
x is the 1D vector holding the (either new or old) coordinates of hyperparameter values inside the multidimensional search space. It can be seen that the burn-in phase (first
k iterations of the algorithm) take longer steps (from the Student’s
t-distribution which has more mass in the tail than a normal distribution) and thus is able to explore more of the search space. After the burn-in period, the step size is reduced and, in addition, the starting position of vector
x at the beginning of iteration
is taken from the best coordinates found during the burn-in period of the previous
k iterations. This starts the exploitation phase. Furthermore, bounds can be imposed on hyperparameters within the search space, such that each variable
x from the 1D vector
x can have upper and lower bounds, i.e.,
, where
is the lower bound and
is the upper bound. If, during the random walk, a step takes any variable
x outside its imposed bounds, that step is reversed as follows:
if
and
if
. In other words, if the step brings a variable above the upper bound, it is reversed to the left; alternatively, if the step brings it below the lower bound, it is reversed to the right.
The acceptance criterion is based on the Metropolis algorithm [
31], where the difference between each two successive objective function’s evaluations is computed as
. If
, the new solution is readily accepted. On the other hand, the case when
is treated as probabilistic. First, a random number is drawn from theuniform distribution
. Then, a probability that the new solution is accepted is derived from the
Boltzman probability distribution as follows:
where
T is the current temperature. It can be seen that this probability is related to the temperature, and that it decreases as the system state cools down. Now, a new solution is accepted only if
; otherwise, it is rejected. As can be seen, this last step is stochastic and allows us to accept inferior solutions at the beginning, when the temperature is still high, which helps to explore the search space. Later, when the temperature is decreased, the acceptance of inferior solutions is far less probable.
The temperature schedule (i.e., cooling) followed the exponential law, where the temperature for the
ith iteration can be determined as follows:
where
is the initial temperature and
is a constant that determines the cooling time. A total number of iterations is determined directly from the cooling schedule, i.e., the algorithm proceeds until the system has cooled down to a certain predetermined temperature.
After each iteration step, the accepted solution is compared to the best solution that has been found thus far and takes its place when having a lower value (i.e., is closer to the optimum). Furthermore, when holds, where is an arbitrary small number, the algorithm does not converge. This is monitored and if there is no progress after a certain number of iterations (i.e., waiting period), the algorithm is terminated under the early stopping criterion.
The complete simulated annealing algorithm, described in pseudocode, is depicted in Algorithm 1.
Algorithm 1: Simulated annealing algorithm with burn-in and early stopping. |
input: k, , , , , , , , , stop | |
| ▹ initial values |
| |
| |
| |
i = 0; wait = 0 | |
while do | |
if then | ▹ burn-in phase |
| |
| |
| |
else | |
| |
| ▹ random walk |
for do | |
if then | ▹ below the lower bound |
| |
else if then | ▹ below the upper bound |
| |
| |
| |
if then | |
| |
| |
else | ▹ Metropolis acceptance |
| |
| |
if then | |
| |
| |
T ← | |
if then | |
| |
| |
if then | |
wait += 1 | |
if wait > stop then | |
break | |
i += 1 | |
2.3. Embedding Models
The following embedding methods, as already mentioned, are briefly introduced hereafter: PCA, kPCA, tSVD, isomap embedding, LLE and its variants (modified and Hessian), spectral embedding, MDS, and
t-SNE. Interested readers are at this point advised to consult [
31] for more information and additional mathematical background on some of these different methods.
2.3.1. PCA, kPCA, and tSVD
Principal component analysis is a well-known method for decomposing a multivariate dataset into a set of orthogonal components (i.e., principal components). It employs a linear projection and relies on the SVD algorithm. Kernel PCA is a direct extension of PCA, which introduces a nonlinear dimensionality reduction through the use of kernels (i.e., the kernel trick); see [
31] for more information. Different kernels can be applied, but we examined only the radial basis function (RBF) kernel. Truncated SVD is another method closely related to PCA. It uses, as the name suggests, a singular value decomposition of the feature matrix and retains only its largest singular values, thereby reducing the dimensionality of the feature space [
31]. If
k is the dimensionality of the projection, then a truncated SVD applied to feature matrix
X produces a low-rank approximation
, where
holds the top
k singular values (from the main diagonal), while
and
hold, respectively, the left- and right-singular vectors.
2.3.2. Isomap Embedding
Isomap embedding can be seen as an extension of the kernel PCA. It seeks a lower-dimensional embedding that maintains geodesic distances between all data points, through the following three-step process [
37]: (1) construct a neighborhood graph from distances between data points, (2) compute the shortest paths on the neighborhood graph, and (3) construct a low-dimensional embedding from the partial eigenvalue decomposition. The nearest search is based on the ball-tree algorithm [
38]. The embedding is encoded in the eigenvectors corresponding to the largest eigenvalues of the isomap kernel.
2.3.3. Locally Linear Embedding
Locally linear embedding is yet another extension of PCA which seeks a lower-dimensional projection of the data that preserves distances within local neighborhoods. This method can be viewed as an application of a series of local PCAs, which are then globally compared to find the best nonlinear embedding [
38]. This method also comprises a tree-step process as follows [
39]: (1) assign a neighborhood to each data point based on local distances, (2) compute weights
that best linearly reconstruct the data points from their neighbors, and (3) compute the low-dimensional embedding from these weights
, by finding the smallest eigenmodes of the sparse symmetric matrix
, where
if
and 0 otherwise. Although it relies only on linear algebra, since the data points are reconstructed only from their neighbors, this method can still produce highly nonlinear embeddings. If one were to use multiple weight vectors in each neighborhood (instead of a single one), that would result in the so-called modified LLE method. Furthermore, if the locally linear structure were to be recovered by means of the Hessian-based quadratic form, that would result in the so-called Hessian eigenmapping. This is a variant of LLE also known as Hessian-based LLE; see [
40] for more information.
2.3.4. Spectral Embedding
Spectral embedding with Laplacian eigenmaps is yet another nonlinear embedding method which is closely related to the locally linear embedding discussed above. It preserves local information and implicitly emphasizes natural clusters in the data. This method constructs a low-dimensional representation of the data using a spectral decomposition of the graph Laplacian. It also consists of three individual steps, as follows [
41]: (1) construct the adjacency graph from nearest neighbors, (2) weight the graph edges by forming the matrix
W, where
if vertices
i and
j are connected by an edge and
otherwise, and (3) compute eigenvalues and eigenvectors for the generalized eigenvalue problem of the form
, where
D is a diagonal weight matrix
and
L is the Laplacian matrix
.
2.3.5. Multidimensional Scaling
The central concept in multidimensional scaling is the dissimilarity data
expressed through the distances in geometric spaces. The MDS algorithm minimizes the so-called
stress function for the configuration of points
in
t-dimensional space (
), which is described by the following root-mean-square residual:
where
are distances between points
, and
are those values that minimize
S, subject to the constraint that
have the same rank order as the dissimilarity measures
. The stress function is seen as a measure of how well the low-dimensional representation matches the data. The minimization of the multiparameter stress function can be tackled by means of the steepest decent or some other suitable optimization algorithm; see [
42] for more information.
2.3.6. t-Distributed Stochastic Neighbor Embedding
The
t-SNE method is based on converting the affinities of data points into probabilities, where the affinities in the original space are represented by Gaussian joint probabilities and the affinities in the embedded space are represented by the Student’s
t-distributions. The
t-SNE method then minimizes the Kullback–Leibler (KL) divergence between a joint probability distribution,
P, in the high-dimensional space and a joint probability distribution,
Q, in the low-dimensional space [
43]:
with
, and
. The pairwise similarities, for the low-dimensional map
, are given by:
and for a high-dimensional map, it is stated that
, where
is the conditional probability [
43]
scaled by the variance
of the Gaussian that is centered on the data point
x.
The KL divergence is minimized by means of the steepest descent algorithm [
43]. The
t-SNE focuses on the local structure of the data and tends to extract clustered local groups more than the other previously mentioned algorithms. This ability to group samples based on the local structure might be beneficial in some circumstances.
2.4. Model Training with Supervised and Unsupervised Learning
Producing a low-dimensional embedding from the high-dimensional data entails training the associated model by means of unsupervised learning. However, some models (such as the PCA) can be trained using supervised learning as well; see [
31] for more information. In addition, a classifier can be trained, using supervised learning, on top of the embedding produced by any of the above-introduced (dimensionality reduction) models. This essentially establishes a pipeline which consists of the embedding model (as a transformer) and a classifier. We used a support vector machine (SVM) with an RBF kernel as a classifier. The entire process of model training (both supervised and unsupervised) is graphically represented in
Figure 2.
As can be seen from
Figure 2, the unsupervised learning loop uses only data features (matrix
X), while the supervised learning loops additionally use targets (vector
y) that are fed to the classifier. In addition, supervised learning can optimize the entire pipeline (outer loop), or only the classifier (inner loop). The hyperparameter optimization (both of the embedding method and of the SVM classifier) was carried out using the same simulated annealing algorithm described in
Section 2.2. Only the objective function was different, as is explained next.
2.4.1. Unsupervised Learning
Unsupervised learning used only the features of the dataset and no targets when optimizing the hyperparameters of the embedding models. We demonstrated this approach using two different embedding models: (a) kPCA and (b) t-SNE.
- (a)
The kPCA model has only one hyperparameter, the
value of the underlying RBF kernel. The objective function here is the mean squared error (MSE). Namely, kPCA first constructs a low-dimensional embedding and then reconstructs the original high-dimensional data by the inverse transform. The MSE error between the original data and their reconstruction was minimized using the simulated annealing algorithm described in
Section 2.2.
- (b)
The
t-SNE model has one important hyperparameter, a perplexity (
P) parameter [
38]. The objective function here is based on the Kullback–Leibler divergence and can be described by the following relation [
44]:
where
KL is the Kullback–Leibler divergence,
P is the perplexity, and
n is the number of samples. This function was minimized using the simulated annealing algorithm described in
Section 2.2.
Unsupervised learning can be extended a step further by applying clustering on top of the low-dimensional embedding; see for example [
45]. We used K-means clustering for separating the low-dimensional space of samples into stable and unstable cases [
38]. Since we knew a priori that there are only two classes (stable and unstable), we assigned the K-means clustering with (hopefully) finding these clusters withing the embedding space.
2.4.2. Supervised Learning
Supervised learning used an SVM binary classifier with a radial basis function (RBF) kernel and class-weight balancing for classifying cases into stable and unstable ones. An SVM constructs a decision function of the form:
where the summation of sample pairs
is over the space of the support vectors (SV), while
is the kernel:
Unknown coefficients
and
b are to be determined by solving the following optimization problem [
31]:
where
C is the penalty that acts as an inverse regularization parameter, while
is a slack variable, and
n is again the number of samples. The penalty (
C) and the RBF kernel coefficient (
) are two crucial hyperparameters of the SVM that need to be optimized.
The supervised loop can be applied to the SVM classifier itself only or to the entire pipeline (see
Figure 2). In case that it is applied to the classifier itself, there are two hyperparameters that need to be optimized. These are penalty
and
. In this case, the classifier “sits” on top of the embedding. The objective function here was the negative value of the cross-validated
-metric as a classification score. On the other hand, if the supervised loop is applied to the entire pipeline, it optimizes the hyperparameters of the embedding model in addition to those of the SVM, and it does that at the same time. We demonstrated this approach on a pipeline which consisted of kPCA as the embedding model (with an RBF kernel) and an SVM classifier (also with its own RBF kernel). In this case there were a total of three hyperparameters (one for kPCA and two for the SVM). Kernel PCA has its own
parameter for the RBF kernel, that is independent of and unrelated to that of the SVM classifier. All three hyperparameters were optimized (by means of the simulated annealing) at the same time. The objective function was again the negative value of the cross-validated
-metric.
It is important to note that the hyperparameter search space during optimization (both supervised and unsupervised) is based on the common logarithm space (i.e., decadic logarithm). Namely, since the parameters can span large values, for example, , in log-space, this transforms to , which means searching within the interval . Since this is a much narrower band (when seen in log-space), it significantly improves the convergence speed.
2.4.3. Scoring Models
The SVM binary classifier, trained with supervised learning on top of the embedding, was scored on the test set. The scoring metric is reported as the mean value with a standard deviation (
), obtained from a 3-fold cross validation. The process of scoring the model is graphically visualized in
Figure 3. It can be seen that the embedding model (as a transformer) and the SVM classifier constituted a pipeline, which was fed the test dataset along with the optimal hyperparameters obtained from the training phase. The classifier produced predictions, which were compared to the ground-truth values (i.e., test-set targets) and a certain metric was reported as the score (i.e., measure of test accuracy). Since the dataset was class-imbalanced, we used the
-score as the harmonic mean of the classifier’s precision and recall, i.e.,
, where TP is a true positive, FP is a false positive, and FN is a false negative outcome (from the confusion matrix of the classifier).
It can be seen that both the objective function and the scoring method were based on the same metric, which was not necessary but streamlined the comparison. This scoring process enabled us to compare different embedding models in terms of quality. Namely, the SVM classifier had a higher score if the embedding on which it was trained preserved more of the informational content.
3. Results
The previously discussed embedding models were applied to the IEEE New England 39-bus power system dataset (
Section 2.1). This section presents the results and other findings which were derived from that application. In order to make the task more challenging for this small power system, each embedding model was required to transform the original high-dimensional space (with 350+ features) into a two-dimensional (2D) space, while preserving as much information as possible. Each embedding model was individually trained using the unsupervised learning approach (
Section 2.4). A good embedding should feature only two (distinct) separable clusters, one (larger) for the stable and another (smaller) for the unstable cases. In order to further compare the quality of embeddings, the SVM classifier was individually trained (using the supervised training approach) on top of each of the embeddings. A classifier with a higher score indicated that the embedding model was able to preserve more of the information content and hence produced a better low-dimensional representation.
First, in order to demonstrate the convergence of our modified simulated annealing algorithm for hyperparameter tuning, as an example,
Figure 4 presents the algorithm’s convergence when optimizing the hyperparameters of the kPCA model using the unsupervised learning approach. It can be seen that the algorithm attained convergence before reaching 200 iterations.
The models produce embedding in a 2D space, which could be easily visualized and further inspected. Hence,
Figure 5 demonstrates an example of a “good” embedding (from kPCA on the left side) and a “bad” embedding (from spectral embedding on the right side). It can be clearly seen that kPCA produced two distinct clusters, while spectral embedding was not able to provide a meaningful separation between stable and unstable cases. The color in
Figure 5 was superimposed on the data points from the labels (after the fact) in order to visually emphasize stable and unstable cases (this information, however, was not available during the unsupervised training of these models).
Moreover, we found that from all the different embedding models that we tested on this dataset, only kPCA produced exactly two separable clusters. Furthermore, as can be seen from
Figure 5 (left side), these clusters (roughly) coincided with stable and unstable classes. Other embedding models, such as the isomap embedding and MDS, preserved information that could be valuable to the classifier which could be trained on top of them but did not produce separable clusters (we found this to be the case with PCA and tSVD as well), or produced too many meaningless clusters (which was the case with
t-SNE). For example,
Figure 6 presents the embedding produced by
t-SNE from unsupervised learning (unlabeled on the left and labeled on the right), where the color was added after the fact for visual emphasis. At the same time,
Figure 7 presents the embedding produced by MDS, without labels (on the left) and with the addition of labels after the fact (on the right). It is clear from
Figure 7 that this low-dimensional embedding produced a single (somewhat loose) cluster of data points; similar (but more dense) single clusters resulted from applying PCA and tSVD as well. However, if labels were present in the data, then these embeddings would still preserve enough information for training the classifier on top of them (with supervised learning or even semisupervised learning).
In addition, when the kPCA model was trained together with the SVM classifier using the pipeline approach (i.e., using the outer optimization loop from
Figure 2), we found that this prioritized the classifier’s accuracy (as would be expected), and that the resulting embedding occasionally did not feature two clearly distinct clusters. An example of this outcome is depicted in
Figure 8, where we show labeled data on the left and unlabeled data on the right. Of course, labels must be known a priori in order to apply this (supervised learning) approach.
Furthermore, interestingly, we found that locally linear embedding (LLE), including all of the variants that we tested (i.e., modified and Hessian), produced a 2D embedding that had very low informational content. Hence, classifiers trained on top of these embeddings had low accuracy, as shown later. Moreover, we found that the embedding produced by LLE (and its variants) was very sensitive to the number of neighbors used for producing it. In order to demonstrate these effects,
Figure 9 provides the embedding computed by LLE with 10 (left) and 20 (right) neighbors. This high sensitivity was not found with other embedding methods and makes LLE less useful for this particular application.
Finally, an SVM classifier was independently trained by means of supervised learning (i.e., using the inner optimization loop from
Figure 2) on top of each embedding, and its classification metric (i.e., the threefold cross-validated
-score) is reported, along with its optimal hyperparameters (
), in
Table 2. The last column shows the number of iterations of the simulating annealing algorithm, along with a Y/N indicator that reports if early stopping was triggered. It should be mentioned that the SVM classifier was not calibrated.
In a fully unsupervised learning approach, one would take the unlabeled data and train the embedding model (which hopefully would produce only two clusters), which could then be discovered by the appropriate clustering algorithm (as another unsupervised learning method). We demonstrate this approach here on the embedding produced by the kPCA method trained using the unsupervised learning; see
Figure 4 (left). The K-means clustering was applied on top of that embedding. The produced clusters are graphically presented in
Figure 10 (left), where the color indicates the cluster membership of each data point. If this is compared to
Figure 4 (left), it can be seen that many of the points from the clusters also belong to the actual classes, which means that a completely unsupervised learning approach (kPCA plus K-means) is able to detect power system disturbances from the WAMS-PMU data. Furthermore,
Figure 10 (on the right) presents a decision boundary from the SVM classifier trained on top that embedding with a supervised learning approach. An overlap between cluster and classifier boundaries can be discerned from
Figure 10 by comparing the left- and right-hand sides, corroborating the fact that clustering on top of the kPCA embedding can separate power system TSA cases and identify unstable ones.
4. Discussion
We compared our proposed simulated annealing approach to a hyperparameter optimization with a random search approach [
38] and found that it produced better results with the same scoring method, while (generally) using fewer iterations. We further found that the random search was less consistent since it exhibited a larger deviation of the cross-validated scores. This is not completely unexpected, since a random search simply draws samples from (in our case exponential) distributions and has no overarching guiding principle in finding the objective function’s optimum value. On the other hand, our simulated annealing balances exploration with exploitation through the use of burn-in and cooling schedule and avoids overfitting by means of the early stopping. Also, early stopping enables the simulated annealing to stop searching before all iterations have been exhausted, which is not possible with a random search [
38]. However, the random search algorithm, unlike simulated annealing, can be run in parallel on multicore architectures, which helps reduce its execution time.
As far as different embedding methods are concerned, we found that the isomap embedding and MDS provided good bases for supervised classification (which can be seen from
Table 2) but, unfortunately, could not produce separable clusters. Hence, they are not well suited for a fully unsupervised learning approach with this kind of data. The same can be said about the PCA and tSVD methods, which also could not produce separable clusters. Furthermore, although an SVM trained on top of the
t-SNE embedding had a high score (as evident from
Table 2), it suffered from a similar deficiency, which emanated from the fact that that embedding contained a lot of small and artificial clusters (see
Figure 6). Hence, the
t-SNE method could not be combined with clustering in a fully unsupervised learning approach, in contrast to what was proposed for some other datasets, e.g., see [
45]. This finding is characteristic for the studied dataset and may be due to its peculiar information content (time instants of phase voltages, currents, active and reactive powers, etc.). We also found that only kPCA, when applied to this dataset, produced an embedding that could be utilized in a fully unsupervised manner, as evidenced from
Figure 10 (left side).
It is important to mention that there are several sources of randomness that need to be considered here: in the dataset itself (statistical sampling), in the way it was split into training and test sets, in the embedding process (finding nearest neighbors, etc.), and in the simulated annealing (which rests on a random walk). All of these aspects contribute to the variability in the classifier scores between different runs. However, we found that there was a persistence where some embedding methods, when combined with the SVM, would reliably (and consistently) produce better results. These were the PCA, kPCA, and MDS methods (even their 2D embeddings were consistent between runs, despite different data splits). This is a reassuring finding. Also, the tSVD often produced embedding (almost) identical to that of the PCA, which is not surprising considering their similar backgrounds. On the other hand, we found that LLE (and its variants) was quite brittle, sensitive to inputs, and produced embeddings with a low informational content. Finally, although t-SNE often produced embeddings that happened to be fertile ground for the SVM training, it was not consistent between runs and could not produce meaningful (nor separable) clusters.
Also, it can be stated that the classifier scores improved across the board when more than two dimensions were retained for the low-dimensional embedding, which was something to be expected. For example, PCA with only seven principal components was able to explain 90% of the variance in the dataset. Classifiers trained on embeddings with more than two dimensions generally produce
-scores above the
level. However, if the goal is to use unsupervised (or semisupervised) learning, then two- and three-dimensional embeddings are particularly appealing, since results of applied clustering can be easily visualized, inspected, and explained (i.e., retaining the properties of the white-box model). We showed that kPCA with K-means clustering produced satisfactory results on the IEEE New England 39-bus dataset. We also found that contrary to the findings on other datasets [
45],
t-SNE was not able to produce meaningful clusters. Neither could the other embedding models that we tested.
In addition, we also tested an LLE variant that implemented a local tangent space alignment algorithm [
38]. This approach proved no better than any other LLE variant. Furthermore, it can be mentioned in passing that we also examined a stacked autoencoder for embedding this same dataset and found that it could not consistently produce separable clusters [
33]. However, we did not carry out exhaustive tests of different autoencoder architectures and hence, our conclusions are limited by these facts. Finally, there are still other embedding methods, but due to the limited space, they could not be tested here.
5. Conclusions
This paper introduced manifold learning to the electric power system transient stability analysis, based on the dataset derived from extensive transient simulations of the IEEE New England 39-bus benchmark power system. A majority of published papers that applied machine learning approaches to the power system TSA problem were based on supervised learning, with very few exceptions. This paper, however, explored supervised and unsupervised learning for the power system TSA problem. It compared and contrasted different embedding methods. The main contributions of this paper to the state of the art can be summarized through the following findings: (1) kPCA is able to produce a two-dimensional embedding with clearly separable clusters of stable and unstable cases; (2) K-means clustering can be linked with kPCA for a fully unsupervised learning approach; (3) several embedding methods (e.g., multidimensional scaling, isomap embedding) produce a low-dimensional embedding which preserves information without finding clearly separable clusters; (4) an SVM classifier can attain a high accuracy when trained on top of these low-dimensional embeddings; (5) t-SNE embedding could not produce meaningful clusters; and (6) simulated annealing can be used for hyperparameter optimization and provides better results than a random search approach.
It should be mentioned that some of these conclusions may be restricted to the examined dataset, although some findings reported here (e.g., the SVM classifier performance) have been corroborated by other researchers. Certainly, further research is needed to independently verify and extrapolate these findings on other power system datasets. Furthermore, additional research is needed to stress-test these findings in an environment that is not a benchmark test power system. This would include introducing various levels of noise into the dataset, along with missing values, measurement errors, etc., that would corrupt the data in different ways. Also, the resilience of different embeddings to the data drift could be examined. These are some future research directions that we can foresee at the present time.