1. Introduction
Coronary Heart Disease (CHD) is the most common cause of death worldwide [
1], mainly characterized by a partial narrowing of the coronary artery due to an adipose plaque formation [
2]. This condition, also called coronary stenosis, reduces the oxygen blood supply reaching the heart muscle, ultimately leading to a heart attack [
3]. Generally, manual stenosis detection requires exhaustive visual inspection of coronary images, whose efficacy could be deteriorated by the clinical standards and differences of expertise among physicians. For this reason, Computer-Aided Diagnosis (CAD) supports and tends to reduce the workload of the medical expert diagnosis for stenosis detection.
Although various coronary imaging techniques exist, such as ultrasound, magnetic resonance, and computed tomography, X-ray coronary angiography (XCA) remains the gold standard for CHD diagnosis [
4]. Furthermore, physicians prefer the XCA screening test as a simultaneous coronary artery bypass surgery renders a reliable solution [
5].
Moreover, the XCA screening test obtains high-resolution images of the main coronary arteries and their branches [
6]. However, automatic stenosis detection is not easy due to the specific characteristics of XCA images, mainly background noise, the presence of a coronary stent, non-coronary vascular structures (i.e., ribs), and multiple superposed branching points [
7,
8,
9], as shown in
Figure 1.
In the last decade, CNNs have achieved outstanding performance gains in classification and segmentation tasks in the medical image domain compared with the traditional machine learning (ML)-based methods [
10,
11]. The core of CNN is its capability to extract, select, and classify features during the optimization step, while in ML methods, each of these steps is conducted independently. Different methods have been introduced to improve CNNs capabilities, such as attention mechanisms that adaptively recalibrate the intermediate feature maps by weighting their inter-channel and inter-spatial relationships; however, this increases the number of parameters of the network.
This paper proposes a Lightweight Residual Squeeze-and-Excitation Network (LRSE-Net) for stenosis detection. The proposed LRSE-Net model relies on Depthwise Separable Convolutions (DSC) [
12] that have been shown to learn rich features with a reduced parameter set efficiently. Moreover, individuals improve the baseline architecture further.
2. Related Work
Machine learning techniques have been proposed to detect automatic stenosis in XCA images [
13,
14,
15]. These studies first extract discriminative features based on texture and shape information. Then, a feature selection process is performed to choose the most suitable features to feed a classifier. Finally, different classifiers, such as Naive Bayes and Support Vector Machine, accomplish stenosis detection. However, features extracted in a hand-crafted manner limit the effectiveness of feature selection, and consequently, the classification performance.
Recently, deep learning methods have been able to tackle feature extraction, selection, and classification within the optimization procedure in an end-to-end manner, showing outstanding performance compared to the hand-extracted feature-based methods. Wu et al. [
16] proposed a deep learning framework consisting of two stages. First, from the full raw XCA, candidate frames are selected based on the segmentation results that produce a UNet [
17]. Subsequently, an object-based detection network employing a VGG (Visual Geometry Group) [
18] as a backbone network provides the classification of stenosis regions. Following the same idea, Pang et al. [
19] detected stenotic regions, including prior coronary artery displacement information. They used a Residual Network (ResNet) [
20] that acts as a backbone model of the object detector network. Later, Danilov et al. [
21] evaluated different object detection network configurations, including a Single Shot multi-box Detector (SSD) [
22], Faster Region-Based Convolutional Neural Networks (Faster-RCNN) [
23], and Region-based Fully Convolutional Networks (R-FCN) [
24]. In their networks, distinct backbones networks have been employed, such as MobileNet-v2 [
25], ResNet (50, 101) [
20], and Inception-v4 [
26].
However, the previous methods require the whole angiographic test and assume that a single stenosis region is present in the image. Another approach to solving this task is using a patch-based classification network. In this way, the full-size XCA image generates n-patches to be classified as positive or negative stenosis cases. In this context, Antczak and Liberadzki [
27] employed a VGG-based model of only five convolutional layers to classify XCA image patches into the stenosis and no stenosis categories. A pre-training strategy was performed by synthetic data, consisting of a Bezier-based generative model to improve the results. Further, Ovalle-Magallanes et al. [
28] proposed a novel hierarchical Bezier-based generative model to generate more realistic synthetic XCA patches. The dataset was evaluated on different ResNet configurations (18, 34, 50), including the Convolutional Block Attention Module (CBAM) [
29]. Later, Ovalle-Magallanes et al. [
30] performed an exhaustive evaluation of the impact of three attention mechanisms (Squeeze-and-Excitation [
31], Convolutional Block Attention Module [
29], and Efficient Channel Attention [
32]). They demonstrated that a Trimmed ResNet18 with a Squeeze-and-Excitation attention module achieved the best trade-off between classification performance and computational cost. The methods mentioned above only employed a subset of the negative samples of the dataset released by Antczak and Liberadzki [
33] to create a balanced training and test dataset; thus, only 125 negative and 125 positive cases were selected. This can lead to a biased classification when a large dataset is tested.
As discussed in previous paragraphs, different deep learning approaches have been used to develop strategies to detect stenosis in XCA images, through either object-based or patch-based models. These methods have shown notable performance; nevertheless, object-based approaches are limited to detecting a single stenosis case in the whole image. Meanwhile, patch-based methodologies are restricted to detecting small stenotic regions (i.e., based on the size of the patch). Moreover, both approaches take as their backbone network architectures designed for the ImageNet dataset, changing only the top of the model. Hence, redundant kernels may exist, limiting the classification performance.
This study presents a Lightweight Residual Squeeze-and-Excitation Network (LRSE-Net) for a patch-based stenosis classification based on two compression methods to reduce the model size: (1) redundant kernels deletion and (2) tensor decomposition by Depthwise Separable Convolutions. Additionally, they include independent ratios for each attention module to improve the feature extraction and generalization. The proposed LRSE-Net is up to
smaller (in number of parameters) than previous models employed for this task. The network’s performance is evaluated employing two public datasets: (1) The full dataset from Antczak and Liberadzki [
33] consisting of 1519 images with 125 positive cases of stenosis and the remainder as negative. (2) A patch-based version of the dataset released by Danilov et al. [
34], which includes 6769 positive patches and 26,699 negative patches. The main contributions of this research are as follows:
An LRSE-Net model is proposed by replacing vanilla convolutions with Depthwise Separable Convolutions, drastically reducing the number of parameters;
Independent dilation ratios for each attention module are selected to enhance the network performance;
Redundant kernels in the convolutional layers are removed to obtain a smaller model;
A data augmentation policy is introduced to mitigate the imbalance of the dataset;
A new patch-based dataset is released to validate the model performance.
4. Results
The proposed LRSE-Net model was evaluated through multiple comparisons with different architectures employed for stenosis detection. The performance analysis was conducted using the datasets P-ADSD and A-ASSS described above. First, the evaluation metrics are defined. Secondly, the implementation details for training the model are explained. Finally, numerical results are shown.
4.1. Evaluation Metrics
For the evaluation of the proposed approach, five metrics are considered: Accuracy, Sensitivity, Specificity, Precision, and F
-score, which are defined as follows:
where TP refers to the number of true positives, TN is the number of true negatives, FP denotes the false positives cases, and FN represents the number of false positives.
4.2. Implementation Details
The training process employs the Stochastic Gradient Descent with Momentum (SGDM) optimizer [
38] with a learning rate of
and a momentum of
. The model was trained with a batch size of 32 for 100 epochs minimizing the Cross-Entropy Loss. The model was implemented using the Pytorch framework, and the experiments ran on Google’s cloud servers, including a Tesla P4 GPU with 2560 CUDA cores and 8 GB of RAM.
To fairly compare the proposed method with other models, all the experiments followed the same hyperparameters and were initialized using the same seed. Moreover, a k-fold cross-validation (5-fold) was set following an 80:20 ratio from the validation subset. The validation step allows for saving the best weight during the training process.
Table 2 summarizes the dataset partition distribution. Both dataset and their train–validation–test partition are freely available at:
https://github.com/eovallemagallanes/LRSE-Net (accessed: 30 October 2022).
4.3. Ablation Study
An ablation study over the A-DSSS dataset is presented to demonstrate the impact of the DSC, and the SE module is reported in
Table 3. All configurations were trained from scratch employing the hyperparameters presented in the previous subsection. The comparative analysis evaluates four main groups of configurations: (1) without DSC and SE, (2) without DSC but with SE, (3) with DSC but without SE, and (4) with DSC and SE. For configurations using the SE module, two variants were tested: (1) with default reduction ratios (
) and (2) with independent ratios
. As mentioned before, the TPE algorithm was employed to find the model configuration minimizing the validation loss of the first fold.
Numerical results indicate that incorporating SE attention modules with individual reduction ratios increased Specificity and Precision compared with no attention model and default SE ratios and with a lower parameter addition. The exclusive use of DSC showed very competitive results in Accuracy, Sensitivity, and Specificity concerning the baseline model (with vanilla convolution operations). Still, it drastically reduced the number of parameters by around . The DSC with SE, including default dilation ratios, achieved the best Specificity and Precision. In particular, including DSC and SE with individual reduction ratios presented the highest Accuracy, Sensitivity, and F-score and the second-best required parameters, reducing the number of parameters by around compared to the baseline model. Therefore, this last model configuration was selected as the default model for subsequent comparison.
4.4. Stenosis Classification Performance Comparison
The performance of the LRSE-Net was evaluated on two public datasets (see
Table 2). The methods trained all models from scratch and employed the same hyperparameters to ensure a fair comparison.
For the A-DSSS dataset, the results are shown in
Table 4. It can be seen that the proposed LRSE-Net achieved the best mean Accuracy (
), Sensitivity (
), Precision (
), and F
-score (
). On the other hand, Vanilla ResNet18 achieved the best Specificity (
). Even though LRSE-Net achieved
less in Specificity concerning Vanilla ResNet18, it attained a gain of
,
,
and
in Accuracy, Sensitivity, Precision and F
-score. Compared with other attention models, Vanilla SE-ResNet18 obtained higher Specificity than the LRSE-Net, around
; however, Sensitivity, Precision, and F
-score were widely overcome by LRSE-Net. The training and validation curves are shown in
Figure 6 and
Figure 7, where it can be seen that the proposed model got the highest accuracy curves and the lowest loss. The second-best accuracy and validation curves are the ones of the CBAM-ResNet34. After 50 epochs, all validation losses started overfitting, showing up and down values due to the fold class imbalance. Notice that the validation subset is not augmented. The Trim ResNet18 achieved the most stable validation accuracy curve over the epochs.
The performance employing the P-ADSD dataset is shown in
Table 5. In this case, the proposed model achieved the best mean Accuracy, Sensitivity, Precision, and F
-score with
,
,
, and
, respectively; and the second-best Specificity with
(only
below). Comparing the models within an attention mechanism, the proposed model had a gain in four evaluation metrics; CBAM-ResNet34 obtained the best Specificity, while Trim SE-ResNet performed poorly in Sensitivity (
) and F
-score (
). Their corresponding training and validation curves are shown in
Figure 8 and
Figure 9, confirming that the proposed model attained the lowest validation loss and higher validation accuracy than Trim-ResNet18 and Vanilla SE-ResNet18. The training curves exhibited a smoother behavior than the validation curves, where the LRSE-Net displayed lower accuracy and greater loss. Nevertheless, this leads to a better generalization performance.
Numerical results in both datasets demonstrate the efficacy of the proposed approach and indicate that SE modules with independent dilation ratios can enhance the feature representation, thus learning more discriminative features. Further, LRSE-Net accomplished better than the CBAM mechanism, which uses channel and spatial attention.
4.5. Class Activation Maps Compassion
The Gradient-weighted Class Activation Map (GradCAM) [
39] retrieves a visual explanation of the most important regions in the image for the model’s decision.
Figure 10 illustrates the Grad-CAM for the test set of the A-DSSS dataset. High discriminative regions for stenosis detection are colored in hot tones (red colors) and cold tones (purple colors) for less informative regions (i.e., the gradient contributes in a minor way). In the model without attention (a) and including CBAM module (d), the GadCAM focused on corner regions more than blood vessel zones. For instance, the Vanilla ResNet18 showed two false negative cases in the last two test images; the CBAM-ResNet34 has one false positive (third row) and four false negative cases. In the case when the model includes the SE block (b), (c), and (e), the GradCAM started to set greater attention to blood vessel regions. The Vanilla SE-ResNet18 (b) arose a false positive case (first test image), the Trim SE-ResNet18 (c) an extra false negative (sixth column). In particular, the LRSE-Net presented greater attention over the blood vessel with non-false positive or negative cases.
As can be seen in
Figure 11 for the P-ADSD dataset, the GradCAM featured more isolated high-attention regions in all the cases. These regions are located over blood vessel pixels for the Vanilla ResNet18 and the ResNet’s including SE block. In addition, the CBAM-ResNet34 (d) showed high attention to the positive stenosis cases in the background zones of the image.
The test images can include different blood vessel widths, background artifacts, and blood vessel bifurcations that affect the gradient activation regions. However, the GradCAM produced proper attention over the blood vessel for test cases with visible major blood vessels.
5. Discussion
The performance results validate the capability of the proposed method to classify stenosis cases in XCA image patches in different size datasets with major negative stenosis cases. Moreover, it was demonstrated that individual selection of dilation ratios for SE modules boosts the network performance. As the model goes deeper, the dilation ratios are smaller; this suggests that deeper features require an SE module with additional parameters to recalibrate the features. Similarly, the inclusion of DSC and the redundant kernel removal drastically reduced the network’s complexity (in terms of the number of parameters) up to compared with a vanilla ResNet18, concerning a vanilla SE-ResNet18, and smaller than the CBAM-ResNet34.
By visualizing training and validation curves, it can be seen that the network performance is directly affected by the quality and quantity of the training data. For example, the first dataset (A-DSSS) showed poor performance and rapid overfitting, even when data augmentation was performed. This scenario is not depicted employing the P-ADSD dataset, where around 33K images are available.
The GradCAM recovered a reasonable visual explanation over blood vessel regions, highlighting discriminative regions in hot tones and those with lower contributions in cold tones. Moreover, it supported the importance of incorporating an attention mechanism to improve the model numerical and explainable capabilities.
6. Conclusions
This paper proposed an LRSE-Net to classify stenosis cases from XCA images. The model consists of two main elements, a DSC and an SE module, which reflect high classification rates with lower computational requirements in terms of the required parameters. The proposed model is smaller than Vanilla SE-ResNet18 and smaller than CBAM-ResNet34. The experimental results demonstrate that LSRE-Net consistently outperformed Residual models with or without attention mechanisms. Additionally, the individual selection of dilation ratios for the SE blocks improved the classification performance, including a smaller dilation ratio as the network goes deeper. In particular, greater boosts were achieved when the dataset was small, with a gain of , , , and in Accuracy, Sensitivity, Precision, and F-score, respectively. Moreover, the LRSE-Net GradCAM maps retrieved a refined region proposal of the stenosis location, which could support the physician’s decision-making process.
Although the recognition rates are high, there is still a need for further improvements, such as evaluating the proposed model as the backbone for an object-based recognition system and detecting stenosis cases from the full XCA test. A future direction of this work concerning model compression may be to analyze other approaches, such as quantization, different low-rank-tensor decomposition, and knowledge distillation. Another research direction to address the limited training data could be generating artificial data by deep generative models.