This work used a comprehensive methodology for diagnosing SLE using Stacked Deep Learning Classifiers (SDLC) trained on data supplied from the GEO database. Our dataset offered a multimodal view of SLE pathogenesis, including transcriptome profiles, clinical characteristics, and laboratory results acquired from GEO. We preprocessed the data to fix missing values, standardise features, and reduce bias.
We then used the TensorFlow framework to build and train our models, creating a deep learning architecture with many layers of neural networks. We used batch normalisation and dropout regularisation to improve model generalizability and reduce overfitting. Supervised learning algorithms were used to train the SDLC model, optimising hyperparameters and performing cross-validation.
2.3. Stacked Deep Learning Classifier (SDLC)
This section presents a Stacked Deep Learning Classifier (SDLC) for SLE diagnosis that combines the power of two networks: an attention-based CNN (ACNN) model and a Bi-LSTM network. The SDLC framework combines deep learning architectures to extract sequential and spatial patterns from multi-modal data sources to improve diagnostic precision and interpretability.
Figure 2 shows the proposed methodology, a Stacked Deep Learning Classifier (SDLC).
Machine Learning (ML) encompasses algorithms that acquire knowledge from data and generate predictions based on that acquired knowledge. Some commonly used ML algorithms are support vector machines, decision trees, and k-nearest neighbours. Typically, these algorithms require manually designing features in order to achieve the highest level of efficiency. On the other hand, Deep Learning (DL), which is a subset of ML, utilises neural networks with multiple layers to autonomously acquire feature representations from raw data. DL’s hierarchical learning capability makes it well-suited for handling complex datasets with minimal manual intervention. Our study introduces a method known as a “Stacked deep learning classifier,” which utilises a hierarchical ensemble structure consisting of multiple deep learning models. More specifically, we utilise the following components: The Adaptive Convolutional Neural Network (ACNN) is a powerful deep learning model that utilises multiple convolutional layers to extract valuable features from transcriptomic data. SBi-LSTM is a powerful recurrent neural network that utilises multiple LSTM layers to effectively capture temporal dependencies in the data. The Meta-Classifier combines the predictions of the ACNN and SBi-LSTM models to make a final prediction using an ensemble learning approach. By stacking models, we can effectively combine various patterns and features from transcriptomic data, clinical features, and laboratory results. This integration results in enhanced accuracy when it comes to diagnosing Systemic Lupus Erythematosus (SLE).
Our Stacked deep learning classifier, which combines the strengths of deep learning and ensemble techniques, outperforms traditional ML methods in terms of accuracy. This makes it an invaluable tool for precision medicine in the management of SLE.
One part of the SDLC architecture is an attention-based convolutional neural network (ACNN), which can extract geographical information from diverse datasets like gene expression profiles, clinical data, and medical imaging. Convolutional and max-pooling layers comprise the ACNN, which effectively captures local spatial information. Also, attention methods are used to zero in on important features in the convolutional feature maps. Dot product attention allows the ACNN to dynamically prioritise various spatial regions, which improves feature representation and diagnostic accuracy.
The SDLC’s SBLSTM (Stacked Bi-LSTM) component complements the ACNN by capturing temporal dependencies and sequential patterns inherent in sequential data such as time-series clinical measurements or longitudinal patient records. Bi-LSTM networks can learn from past and future information, making them well-suited for modelling sequential data with long-range dependencies. By leveraging bidirectional processing, the Bi-LSTM can effectively capture complex temporal dynamics and subtle patterns in multi-modal data, improving diagnostic performance.
2.4. Attention-Based CNN Model (ACNN)
The model comprises multiple layers from the input, beginning with convolutional layers and moving on to max-pooling layers. These layers are responsible for extracting and down-sampling features. Specifically, we use three convolutional layers, each tailed by a max-pooling layer. This allows us to gradually increase the number of filters to capture progressively more complicated patterns. To properly capture local spatial information while maintaining computational efficiency, the kernel size for the convolutional layers has been selected as (3, 3). On the other hand, two attention methods are subsequently integrated to focus on relevant characteristics selectively.
Figure 3 depicts the ACNN model’s architecture. By applying convolutional filters, the ACNN model can identify complex patterns in gene expression data that are indicative of SLE, thereby improving the accuracy of the diagnostic process.
The first attention mechanism employs a dot product operation among the feature mappings and the average pooled feature vector to calculate attention weights. This operation is used to compute attention weights. These attention weights are applied to the feature maps to generate a weighted total, highlighting crucial spatial information. A second attention mechanism is used in a manner that is analogous to the first to refine feature representation further after the initial layer has been passed through additional convolutional and max-pooling layers. Lastly, the output of the final max-pooling layer is flattened and connected to fully connected layers for classification. The last layer provides class probabilities by utilising a softmax activation function. This attention-based CNN design incorporates spatial and attention-based information to enhance feature representation and improve classification performance.
The model’s multi-layered design, as shown in
Table 2, has the first layer serving as an input that takes
photos with three colour channels. Using a
kernel size and ReLU activation function, the following three convolutional layers (conv2d) are added after the input layer: one with 64 filters, another with 32 filters, and the last with 128 filters. The mathematical equation for the convolution operation in discrete form can be expressed as follows:
where
represents the result of the convolution of functions
and
at position
, and
and
represent the values of the functions
and
at positions
and
respectively. The sum is taken over all possible values of
, which typically depends on the support of the functions involved. After every convolutional layer, there are max-pooling layers (max_pooling2d) with a
pool size that cut the feature maps’ spatial dimensions in half. The mathematical equation for max pooling operation with a pooling size of
in a 2D setting can be expressed as follows:
where
represents the output value at position
in the pooled feature map,
denotes the input feature map,
and
denote the pooling size in the height and width dimensions, respectively,
and
iterate over the height and width dimensions of the output feature map, and
represents the channel dimension. The max operation is applied over the
region of the input feature map centred at position
. This operation is applied independently to each channel of the input feature map. We use two dot product attention layers (dot_product_attention) to calculate attention scores between the feature maps after the third convolutional layer. The attention weights are calculated by the dot product attention mechanism using the key vectors and the query vectors, and then by using a softmax function to get the final attention weights. The dot product attention mechanism’s mathematical equation is as follows:
The following is how the dot product attention mechanism computes attention weights α given a collection of query vectors
, key vectors
, and value vectors
:
where
denotes the dot product operation.
computes the dot product between query vectors
and key vectors
transposed.
is the dimensionality of the query and key vectors. The softmax function computes the attention weights
, ensuring they sum up to one and represent the importance of each key vector relative to the query vector. A weighted sum of the value vectors
is computed using the attention weights
once they have been ccalculated:
Values linked to key vectors with larger attention weights have a significant influence on the final attended output, which is the outcome of a weighted sum. Many attention-based models, such as transformer architectures, make use of the dot product attention mechanism, which effectively captures correlations between key and query vectors. Later, two further convolutional layers are used, one with 256 filters and the other with 512 filters. Then, max-pooling layers are added, one with max_pooling2d_3 and the other with max_pooling2d_4. The feature maps are rendered globally through a global average pooling layer (global_average_pooling2d). An 11-dimensional vector (concatenate) with 1280 dimensions is created by merging the results of the attention layers and the global average pooling layer.
We can get the average pooling mathematical equation using the average values in each pooling window. The following equation can be used to represent average pooling if the pooling window is
in size: Given an input feature map
of size
(height
, width
, and number of channels
) and a pooling window of size
, the average pooling operation results in an output feature map
of size
. For each channel
of the input feature map, the value of each element
in the output feature map is computed as the average of the values within the corresponding pooling window in the input feature map:
where
is the value of the element in the output feature map at position
and channel
.
is the value of the element in the input feature map at position
and channel
.
is the size of the pooling window. The average pooling operation reduces the spatial dimensions of the input feature map while preserving the number of channels, resulting in down-sampled feature maps with reduced spatial resolution. This combined vector undergoes ReLU activation and is then sent through two dense layers that are fully linked, and each has 512 units. A popular and straightforward choice for non-linear activation functions in neural networks is the Rectified Linear Unit (ReLU) activation function. In mathematics, it is expressed as:
where
is the input to the ReLU function.
is the output of the ReLU function. The ReLU function takes an integer
as input and returns it if it is greater than zero; otherwise, it returns zero. In terms of geometry, this is analogous to a linear function where the slope is positive for positive input values and zero for negative. The model is able to learn intricate patterns and correlations in the data because the ReLU activation function incorporates non-linearity into the network. Its efficiency in preventing the vanishing gradient problem during training and its simplicity make it a popular choice for many neural network topologies. Our final step is to generate the output probabilities for two classes using a dense layer (dense_1) with 2 units and a softmax activation function. There are a grand total of 2,220,474 trainable parameters in the model.
2.5. Stacked Bi-LSTM Architecture (SBLSTM)
Accurate diagnosis of SLE depends on the identification of sequential patterns and temporal relationships in the clinical and laboratory data, which the SBi-LSTM model makes possible. An improvement on the original LSTM design, the Bi-Directional LSTM (Bi-LSTM) architecture allows the model to better capture both the past and future concerning a given set of events. The BiLSTM processes the order in both directions simultaneously, as opposed to the one-way processing of the unidirectional LSTM, which follows the input sequence in a specific order. The BiLSTM is ideal for jobs requiring a thorough comprehension of the incoming data because of its bidirectional processing, which allows it to grasp context from both ends of the sequence. Building Long Short-Term Memory (BiLSTM) neural networks allows for the bidirectional development of input orders. The forward LSTM model starts with the first-time step and finishes with the last-time step by analysing the input sequence from the past to the future. As the model goes through the steps, iteratively computing hidden states and cell states is done.
A variant of the LSTM model, the backward LSTM reads inputs in reverse, starting with the most current time step and working backwards. This strategy can incorporate information that could be employed for interpreting the model at present step. The reverse LSTM examines the sequence from the beginning to the end, which may allow it to pick up on different patterns and dependencies than the forward LSTM. Moreover, calculations of hidden states and cell states are executed. It is usual practice to use BiLSTM networks while processing data sequentially. Between each iteration, the front LSTM and the reverse LSTM combine their hidden levels. Through this process of concatenation, the BiLSTM’s final hidden state is formed. The sum of the forward and reversed LSTMs’ hidden states is fed into the output layer, which it uses to make predictions or perform additional processing on different tasks. The Stacked BiLSTM architecture is illustrated in
Figure 4. At each time step, the hidden states from the forward and backward directions are combined to generate the final hidden state. Using context from both past and future time steps comprehensively is a key component of many jobs, and the Bi-Directional LSTM architecture excels at this. Combining the two bodies of knowledge improves the SBLSTM’s ability to record far-reaching associations and its power to find complex patterns in the incoming data.
For a SBLSTM design, the mathematical equations for the forward and backward analyses are as follows for a given time step ‘t’:
The Forward LSTM equations are as follows:
The Candidate Cell State (
):
The Cell State
:
Hidden State
:
where
is the input at time step
.
is the hidden state of the previous time.
is the cell state of the previous time step.
is the sigmoid, and
is the hyperbolic tangent activation function.
, , , , , , , and are the weight matrices. , , , and are the bias vectors. While both LSTMs use comparable problems, the weight matrices and bias vectors used by the forward and backward versions are distinct:
The Input Gate
:
The Forget Gate
:
The Candidate Cell State
:
The Cell State
:
The Output Gate
:
The Hidden State
:
where
is the input at time step
.
is the hidden state of the backward LSTM from the next time step.
is the cell state of the backward LSTM from the next time step.
is the sigmoid activation function. The tanh is the hyperbolic tangent activation function.
,
,
,
,
},
,
, and
are the weight matrices for the gates.
,
,
,
are the Bias vectors for the backward input gate, forget gate, candidate cell state, and output gate, correspondingly. The ultimate disguised state at each time step is obtained by integrating the hidden states from the prior and subsequent time steps:
Both the attention-based CNN and the Stacked Bi-LSTM models’ hyperparameter tuning outcomes are shown in
Table 3. We tried out several configurations of convolutional layers, filter numbers and sizes, and drop-out regularisation inside the attention mechanism for the attention-based CNN.
The parameters used for the models in this study were determined through hyperparameter tuning. We employed a systematic approach to optimise the performance of our models by adjusting key hyperparameters. This process involved exploring various combinations of hyperparameters and identifying the ones that produced the most favourable outcomes according to predetermined evaluation metrics. Specifically, we used grid search and cross-validation techniques to explore the hyperparameter space efficiently. Early stopping was also implemented to prevent overfitting by halting the training process when the model’s performance on the validation set ceased to improve. Detailed information about the hyperparameters and their optimal values is provided in
Table 3.
We also fine-tuned the training performance by adjusting the learning rate and batch size. Alternatively, the Stacked Bi-LSTM model required changing the dropout rates for both regular and recurrent connections, the number of units inside each model layer, and the total number of Bi-LSTM layers. We improved the model’s performance by optimising the learning rate, batch size, and other parameters, much like the attention-based CNN. We aimed to achieve a compromise between model complexity and generalisation capabilities by selecting these hyperparameters based on empirical testing and domain knowledge. Our goal in optimising these parameters was to make both models as accurate and fast as possible to converge so that they could be used effectively for diagnosing SLE.
2.6. Meta Classifiers
In this part, we present a meta-classifier that integrates the predictions of the attention-based CNN and Stacked Bi-LSTM models using a Voting Classifier method. The Voting Classifier aggregates each model’s predictions and produces a conclusion using a majority vote or weighted average. First, we have , which stands for attention-based Convolutional Neural Network, and , which is for Stacked Bi-LSTM. Class labels or class probabilities can be used to represent these predictions.
When using hard voting, the Voting Classifier considers the predictions from all basic classifiers and chooses the class label with the most votes. In mathematical terms, this looks like:
With
representing the attention-based CNN model and
standing for the Stacked Bi-LSTM model, accordingly, as predictions (class labels). For soft voting, however, the Voting Classifier takes an average of the class probabilities predicted by all the base classifiers and chooses the label for the class with the highest average probability. From a mathematical perspective, this can be written as:
represents the predicted class probabilities (for all classes) from the attention-based CNN model and represents the same from the Stacked Bi-LSTM model, where N is the total number of classes. Both use the Voting Classifier, which improves SLE diagnostic performance by combining predictions from base classifiers to produce a final conclusion based on the combined knowledge of several models. The meta-classifier combines the predictions from both models, improving the overall accuracy and reliability of the SLE diagnostic process.
Implementing the SDLC model in clinical settings requires a series of important steps. Firstly, the model can be implemented as a software tool within hospital information systems, necessitating robust infrastructure to handle extensive datasets and conduct real-time analysis. It is important for healthcare professionals, such as physicians and laboratory technicians, to undergo training in the use of the SDLC tool. This training should include instructions on data input procedures, understanding the model’s outputs, and effectively incorporating the results into clinical decision-making. Integrating with existing Electronic Health Records (EHRs) systems can greatly enhance workflow efficiency by automatically retrieving pertinent patient data for analysis. Prior to being widely implemented, the SDLC model needs to go through extensive validation in clinical trials to prove its effectiveness and safety. It is crucial to obtain regulatory approvals from organisations like the FDA or EMA to ensure compliance with healthcare standards. Regular monitoring and updates using the latest data and advancements in the field will ensure that the model remains accurate and reliable.