The theoretical approaches used for this study include principal component analysis (PCA), maximum information coefficient (MIC), Savitzky–Golay convolution smoothing computational method, and LSTM.
2.2. Maximum Information Coefficient
The maximum information coefficient (MIC), as introduced by D.N. Reshef and his team in 2011, is a measure of correlation between variables based on the theory of mutual information (MI) [
22]. Mutual information is a metric that quantifies the similarity of information between two random variables, U and V. The MIC calculation relies on both the marginal and joint probability densities of the variables in question. A higher MI value indicates a higher level of similarity between U and V. MI is calculated using entropy. The entropy of a variable U with probability distribution
can be expressed using the following formula:
The mutual information I(U,V) between two variables U and V is defined as follows:
where p(U, V) is the joint probability density of the two random variables X and Y, and p(U) and p(V) are the marginal probability distributions of U and V, respectively.
The MIC calculation involves partitioning the scatter plot of a dataset D into several sub-regions and computing the mutual information for each sub-region between the two variables U and V. The highest mutual information value among these sub-regions is then chosen as the MIC value for the variables. The calculation formula for MIC is as follows:
Here, “n” represents the sample size, and “B(n)” represents the maximum number of bins or grid cells. Typically, B(n) is chosen as .
2.3. Savitzky–Golay Convolution Smoothing Algorithm
The Savitzky–Golay filter (SG) is a time-domain filtering method that effectively removes noise from waveform signals while preserving their shape and width. It achieves this by employing local polynomial least squares fitting. Due to its ability to smooth and denoise data, the Savitzky–Golay filter has found extensive applications in various fields [
23].
The effectiveness of data smoothing depends on the chosen window width, which determines the number of data points used to calculate the average value during the smoothing process. The smoothing process can be represented by the following equation:
To minimize the overall impact of the algorithm on the waveform, each point’s value is multiplied by a smoothing coefficient . To address the limitations of traditional smoothing algorithms, the least squares principle and polynomial fitting are used to improve the drawback of in the algorithm.
Suppose the smoothing window has a size of n = 2m + 1, and each point’s value is represented by z = (−m, −m + 2, ..., 0, 1, m − 1, m). The points within the window are fitted using a k − 1 degree polynomial, as follows:
By applying the least squares method, we can obtain a system of k linear equations. When the number of data points (n) is greater than the number of unknowns (k), the equations will have a solution. Consequently, the fitting parameters A can be determined. Hence, the following relationship holds:
It can be described using the following matrix:
The least squares solution for matrix A, denoted as
, is given by
The smoothing matrix for matrix D is
The “•” in formulas 12 to 14 represents the dot product operation between two matrices.
2.4. Long Short-Term Memory Neural Network
The long short-term memory (LSTM) is a type of recurrent neural network used to address the vanishing gradient problem in traditional recurrent neural networks. It is commonly employed for handling time-series data with long time intervals and delays [
24]. The internal structure of an LSTM model is depicted in
Figure 1. By utilizing different modules within the network, LSTM can assess the importance of current system information and determine whether to retain or forget this information. Hence, LSTM is an effective approach for tackling long-term dependency issues. The structural components of an LSTM unit include an input gate
, a forget gate
, an output gate
, and a self-connected memory cell state
.
In the LSTM network, the forget gate determines if the quantity of information from the previous time step’s cell state
should be preserved in the current moment step’s cell state
. The input to the forget gate is composed of the previous moment step’s hidden state
and the current moment step’s input
. By using a sigmoid function, a decision vector
is generated, which determines the extent to which information from the previous moment step’s cell state
should be forgotten. The result of the forget gate is obtained by element-wise multiplication between the decision vector
and the previous moment step’s cell state
. The specific calculation formula is as follows:
By the aforementioned calculation, the forget gate can determine which information from the previous time step’s cell state should be forgotten based on the previous hidden state and the current input. This mechanism enables LSTM to flexibly retain and forget relevant information when processing time sequences, thus capturing the dependencies between sequences more effectively.
The input gate is responsible for determining how much new information should be added to the current state. By using the tanh function, candidate information
is generated. Then, the decision vector
, generated by the sigmoid function, determines how much content from the candidate information
can be added to the cell state
. The specific calculation formula is as follows:
By the aforementioned calculation, it is possible to determine the importance of new information based on the input gate. The candidate information is then multiplied by the decision vector to obtain the updated memory state . This mechanism allows LSTM to update the memory state based on the current input and the previous hidden state, thereby capturing long-term dependencies in time sequences more effectively.
The output gate decides the output of the current cell and the hidden state that will be conveyed to the next cell. The input of the output gate consists of the previous hidden state
and the present input
. By using the sigmoid function, a vector
is generated, which determines how much information should be output from the current cell state
. Additionally, the present cell state
is subjected to the tanh activation function to generate another variable, and these two variables are multiplied to obtain the hidden state
at the current time step. The specific calculation expression is as follows:
By the aforementioned calculation, the output gate determines the output of the current cell state and also determines the hidden state that will be passed to the next cell. This mechanism allows LSTM to control the output and transmission of information based on the importance of the current state, enabling it to adapt better to different time-series tasks.
Both the input gate and the forget gate contribute to the cell state
. The cell state
is obtained by adding the product of the previous cell state
and the forget gate decision vector
, and the product of the input gate decision vector
and the candidate information
. The specific expression is as follows:
In the equation, and are weight matrices between different layers, while and are the corresponding bias vectors. represents the sigmoid activation function.
The recurrent structure of LSTM, as shown in
Figure 2, consists of multiple LSTM modules. Through the connections and interactions between these modules, LSTM can effectively handle long-term sequence prediction tasks.
In this structure, each LSTM module consists of an input gate, a forget gate, an output gate, and a cell state. Each module performs calculations based on the current input and the previous hidden state, and then outputs the current hidden state and cell state. By connecting multiple LSTM modules in layers, effective modeling and prediction of long-term sequences can be achieved.
The recurrent structure of LSTM allows for the transmission and interaction of information in the temporal dimension. Through gate mechanisms, the flow of information is controlled, enabling LSTM to effectively capture long-term dependencies in time-series data.