1. Introduction
With the development of deep learning, recurrent neural retworks (RNNs), as one of the most representative deep neural networks (DNNs), have been widely used in various fields, including mechanical fault diagnosis, emotion analysis, and stock forest [
1,
2,
3].
Although RNNs theoretically can capture information from variable-length sequences, their performance in practical applications is often suboptimal due to the problems of gradient vanishing and explosion [
4,
5]. Specifically, during the training process, the gradients of weights tend to decay or grow rapidly in the back-propagation steps, resulting in instability in updating network weights and hindering the ability of the RNN to model long-term dependencies [
6,
7]. While the problem of exploding gradients can be tackled with a simple clipping strategy [
8,
9], there is no easy way to adequately address vanishing gradients in a vanilla RNN.
One of the most popular alternatives to the vanilla RNN is long short-term memory (LSTM) [
10,
11], which addresses the gradient vanishing problem in the training process. The core advantage of LSTM is that the hidden state is updated by superposition of multiple component “gates” instead of using transfer operators such as matrix multiplication. Although LSTM has improved the limitations of RNN to some extent, there are still some problems. Concretely, LSTM significantly increases the model’s complexity, making the model not only inefficient in training but also difficult to interpret [
12]. These problems also exist in another RNN variant, namely the Gated Recurrent Unit (GRU) model [
13].
As such, to address the limitations inherent in LSTM and RNNs, in this work, we propose a Light Recurrent Unit (LRU) model, which offers a compact structure coupled with enhanced interpretability. The design philosophy of the Light Recurrent Unit (LRU) aims to balance model performance and computational efficiency. It employs a compact structure, reducing the number of gating units to lower the network parameters and computational complexity while maintaining accuracy in processing long sequences. This simplification not only makes it easier to track hidden state changes across successive time steps but also enhances model interpretability. In addition, the activation function is modified to accelerate the convergence of the training process. In the meantime, this modification enhances the interpretability of the function and improves the memory capacity of RNNs for capturing long-term dependency information. Through these innovations, the LRU can retain long-term memory effectively, thereby better handling various tasks involving long sequences.
Moreover, the design of the LRU theoretically suits environments with limited resources. Due to its simplified structure and reduced computational requirements, the LRU has the potential for deployment on platforms with constrained computational resources, such as mobile devices and embedded systems. This design takes into account the typically limited computational power and battery life of these devices, necessitating efficient algorithms to handle complex tasks. By reducing the number of gating units, the LRU decreases computational load and memory demands, theoretically exhibiting superior performance in such environments.
For example, the study by Zhang et al. [
14] presents a low-cost, low-power, and privacy-preserving facial expression recognition system based on edge computing, evaluating four deep learning algorithms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and deep neural networks (DNNs). This demonstrates the practical application potential of lightweight RNN models in resource-constrained environments. Similarly, the study by Al-Nader et al. [
15] proposes a novel scheduling algorithm for improving the performance of multi-objective, safety-critical wireless sensor networks using long short-term memory (LSTM), further supporting the application potential of our model in these specific settings. Future research can further validate the performance of the LRU in these specific settings.
The main contributions of this paper can be summarized as follows.
The proposed LRU introduces a latent hidden state that is highly interpretable. It has a minimal number of gates in all possible gated recurrent structures: only one gate to control whether the past memories should be kept or not, so that the requirement for training data, model tuning, and training time can be reduced, while at the same time, the model accuracy is maintained.
The proposed LRU leverages the Stack Recurrent Cells (SRCs) to modify the activation function, consequently improving the gradient flow in deep networks. This modification leads to accelerated convergence rates of the network and enhances the interpretability of the model learning parameter.
Experimental results on various tasks demonstrate that LRU can keep long-term memory to better process long sequences. Despite reduced model complexity, LRU has overall better accuracy as well as faster convergence speed compared to LSTM.
The remaining sections are organized as follows.
Section 2 discusses related work on representative RNN variants.
Section 3 demonstrates the design of the proposed LRU model.
Section 4 shows that LRU obtains competitive or even better performance compared to previous RNN models on various tasks that require long-term dependencies.
Section 5 draws conclusions.
2. Background for RNN
In recent times, self-attention-based models, including the Transformer architecture [
16] and its derivatives, have excelled in numerous tasks. For example, Huang et al. [
17] examined the relationship between peer feedback classified by deep learning and online learning burnout, demonstrating the potential of deep learning techniques in educational settings. Zheng et al. [
18] proposed a modified Transformer architecture based on relative position encoding, which showed outstanding performance across various tasks [
18]. However, RNNs have significantly fewer parameters and computational demands compared to Transformer-based models, making them still very common in many applications.
By assigning additional weights to the network graph, RNNs create a loop mechanism within the network that allows the network to explicitly learn and utilize the context information of the input sequence, and they are therefore well-suited for processing tasks involving sequence input. RNN architectures have made an enormous improvement in a wide range of machine learning problems with sequential input involved [
19,
20,
21].
In the widespread application of RNNs, particularly in tasks like text classification and sentiment analysis that require learning long-term dependencies, optimizing RNN parameters remains challenging. The primary reason is the “vanishing gradient” problem, which hampers learning long-term dependencies. RNNs rely on temporal unfolding to fit and predict time series data, updating parameters based on multiple time steps. However, the vanishing gradient problem limits this unfolding, causing updates to be influenced only by recent time steps and not capturing distant historical information. As the temporal distance of dependencies increases, the difficulty of training RNNs also rises. In response to these challenges, researchers have explored various approaches. The following sections highlight some of the prominent directions in current RNN research aimed at addressing these issues.
2.1. RNN with Special Initialization
Some researchers have attempted to capture long-term dependencies in simple, non-gated RNN through better weight initialization. Le et al. [
22] proposed IRNN, which uses an identity matrix to initialize the recurrent weight matrix. The critical innovation in IRNN is to produce near-identity projection at hidden states. However, this model is reported to be fragile to hyperparameter settings and fails easily in training [
23,
24,
25]. Talathi et al. [
26] proposed np-RNN based on IRNN. Their np-RNN adds a stronger constraint on initial recurrent weights by forcing the recurrent weights to be a normalized-positive definite matrix, with all except the highest eigenvalue less than one. While these models help to ease the gradient vanishing problem at the beginning of training, they cannot completely avoid the issue throughout the entire training process.
2.2. RNN with Structure Constraints
Another direction for addressing the gradient vanishing problem in RNN is to add certain constraints to the model structure [
27]. Mikolov et al. [
28] proposed the Structure Constraint Recurrent Network (SCRN), which forces a diagonal block of the recurrent matrix to be equal to a reweighted identity throughout the training. They declare that the reweighted identity block in the recurrent matrix changes their state slowly, which helps the entire network capture a longer history. Hu et al. [
29] analyzed the gating mechanisms in LSTMs and proposed a structure called the Recurrent Identity Network (RIN), which adds an extra identity map projection to the hidden layer. These models, however, cannot efficiently improve the model performance, especially when compared with gated RNNs.
3. Proposed Model
In this section, we first introduce the traditional RNN and LSTM models and point out existing issues. Secondly, we introduce the proposed LRU model. Compared with traditional models, the contextual information stored in the hidden layer of LRU can be transmitted over a longer distance in the time domain, and we further modify the activation function to achieve faster model convergence and improved interpretability.
3.1. RNN and LSTM
As shown in
Figure 1a, the state update in a basic RNN can be described as follows:
where index
t indicates the current position in the input sequence,
and
are the input and hidden state at time
t,
U and
W denote the parameter matrices related to
and
, respectively,
b is a bias term, and
is an element-wise activation function that applies to the hidden states. The term
in RNN outlines that the last hidden state vector is recurrently fed back to the input to compute the next state.
Gated RNNs, especially LSTM and its variants, address the gradient vanishing problem in RNNs mainly by introducing component-wise gates to control the information flow within the network. The proposed LSTM model in [
10] differs from the original RNN in three aspects.
(1) There is one more state vector in the hidden layer in addition to . Also, is designed as a “concealed” state that maintains long-term memories, while maintains short-term memories. Further, is also called the Constant Error Carousel (CEC), since it is updated additively instead of by using a matrix operation. This design ensures that the gradient flow retains stable in updating .
(2) An input gate is applied to the input to determine which part of new information can be added to at time t. The inputs are transformed as the update . Further, and together form the new cell state .
(3) An output gate is added to control which part of should be output as the hidden state .
In [
30], a forget gate
was first applied to
to determine which part of the old memory should be kept, before adding new information to
. This new LSTM structure with a forget gate has been widely applied thereafter. The overall structure of this three-gated LSTM is illustrated in
Figure 1b, and the update rules are as follows:
In Equation (
2),
and
(
j =
) are parameter matrices. The activation
for the gates usually uses logistic sigmoid to constrain the value of gates within
. For each dimension of information, a gate of value 0 means that the gate is “closed”, and 1 means that the gate is “fully open”. For example,
of value 0 forces the model to discard any input information in time
t, and
of value 1 allows the model to keep all previous memories. The value of each gate is determined during training by the current input and hidden information. A tanh nonlinearity is applied when computing
and
, while ⊙ indicates the element-wise product.
When LSTM was first proposed, researchers attempted to increase the model’s complexity for better performance. For example, recent studies have explored adding “peephole” connections between
and the three gates (
,
,
), allowing the cell state to influence the gates more directly. While these modifications have shown performance improvements in some applications, they can also introduce additional complexity and may not always be effective [
9,
31]. On the other hand, some gates have been empirically found to be less effective than others [
32,
33].
Recently, researchers have started to investigate the redundancy within the LSTM structure [
32,
34,
35,
36]. Among these models, the most representative one is the Gated Recurrent Unit (GRU) [
33], which lowers the number of gates in LSTM by removing the output gate. Although GRU has fewer trainable weights than LSTM (about 3/4), it still achieves similar performance on various tasks [
32,
37]. This phenomenon rests with the fact that, among the three gates in LSTM, the forget gate is essential [
6,
33,
38], and the effect of the input and output gates is less obvious [
32]. We argue that it is possible to design a simplified gated RNN model that further lowers complexity while maintaining accuracy.
3.2. Proposed LRU
This section describes in detail the design of the proposed Light Recurrent Unit (LRU). Our motivation is to present an RNN model with both an accessible structure and high interpretability. The overall structure and data flow of LRU are illustrated in
Figure 2. At any time step
t, LRU takes as input
and updates its hidden state
as follows:
Here, is a transformed input that adds information to , and is a vector that controls both:
The portion that is remembered in the last state , for each component;
The portion that is added to from for each component.
Therefore,
couples the input and forget gates in LSTM by specifying that
Compared with the update vector
that has similar functionality in LSTM,
omits the influence from the last hidden state. This change makes it easier to track the changes in hidden states in consecutive time steps, as will be further discussed in
Section 3.4.
3.3. Stack Recurrent Cells
Similar to other RNN variants, LRU can be stacked with multiple recurrent layers to improve its memorization capacity. This can be achieved by feeding the output vector of the previous layer as the input to the next layer. Formally, considering an
L-layered LRU, the computation at time step
t is specified by the following equations:
The highway network [
39] has been proven to improve the gradient flow in deep networks such as CNN. It uses a skip connection that directly links the hidden state to the input, allowing the gradient to propagate to the previous layer directly. This component can be applied in a stacked LRU by specifying that, for
,
, and
tanh activation is replaced with identity mapping, such that:
The following pseudocode in Algorithm 1 outlines this process, demonstrating the step-by-step procedure involved in computing the relationship as described in Equation (
6).
Description of the main variables:
: Input sequence.
: Hidden state of layer l at time step t.
: Candidate hidden state of layer l at time step t.
: Forget gate of layer l at time step t.
: Weight matrix for candidate hidden state of layer l.
: Weight matrix for forget gate of layer l.
: Weight matrix for forget gate input of layer l.
: Bias for forget gate of layer l.
: Activation function (sigmoid).
: Activation function (hyperbolic tangent).
⊙: Element-wise multiplication.
Algorithm 1 Computation of an L-layered Light Recurrent Unit (LRU) |
- 1:
Input: Input sequence - 2:
Output: Hidden states for all layers - 3:
Step 1: Initialization - 4:
Initialize the hidden state for the first layer: - 5:
Step 2: Computation for each layer - 6:
for do - 7:
Compute the candidate hidden state:
- 8:
- 9:
- 10:
end for - 11:
Step 3: Output - 12:
Return the hidden states for all layers
|
3.4. Analysis
In this section, the properties and behavior of the proposed LRU are discussed.
First, LRU can be regarded as a gated RNN derived from LSTM, but with many fewer parameters. This property makes the learning process faster and less vulnerable to overfitting. Assuming the input vector dimension is m and the hidden state dimension is n, LSTM has four sets of parameters that determine , , , and , resulting in a parameter number of (bias term omitted for simplicity). Meanwhile, LRU only has two sets of parameters, one for calculating the forget gate , the other for . The total parameter number in LRU is only . In the case where , the parameter number of LRU is of LSTM; in the case of , LRU has only the parameter size of that of LSTM. In short, given the fact that LRU has only one gate and no recurrent connections, LRU can be regarded as a minimal design of any gated RNN units. Despite the simplicity, the utilization of the forget gate allows LRU to process sequence learning without suffering from the gradient vanishing problem, as demonstrated later in the experimental section on various tasks.
Second, the removal of recurrent connections in LRU also allows us to describe the learned model in a quite straightforward perspective. Since the hidden state in LRU is updated in an additive, non-recurrent fashion, at each step, the hidden state can be regarded as a weighted average of all previous inputs in the same layer. Formally,
Therefore, the hidden state at time
t can be tracked back to all previous inputs in the same layer, with assigned weights indicating the inputs’ relative importance. It is easy to prove that all
sum up to 1 for each
l. Therefore,
can be viewed as a weighted average of previous inputs. It should be noted that these weights are also component-wise vectors, allowing us to analyze the behavior of each neuron of the hidden state. This property can also be regarded as a soft attention mechanism, as will be shown in
Section 4.2.
Despite the simplicity, the utilization of the forget gate allows LRU to process sequence learning without suffering from the gradient vanishing problem. In traditional RNNs, the gradient vanishing phenomenon during training primarily arises from two sources. Firstly, the repeated multiplication of the hidden state weight matrix causes the gradient to be suppressed in most positions, a problem that becomes particularly pronounced during long sequence training, ultimately leading to gradient vanishing. Secondly, commonly used activation functions such as sigmoid and lead to an overall scale reduction in gradient back-propagation, further exacerbating the gradient vanishing problem. The LRU effectively mitigates the gradient vanishing issues associated with traditional RNNs by introducing a simplified gate structure and a direct candidate hidden state update mechanism. As demonstrated later in the experimental section, across various tasks, the LRU excels in sequence learning tasks.