1. Introduction
The heart is an essential organ in the body, where its main job is to push the blood all around the human body. Furthermore, it is the main and central part of the cardiovascular system, which contains the blood vessels that form the blood circulation [
1]. Moreover, cardiovascular diseases (CVDs) play a great role in the worldwide death toll, and this highlights the importance of early diagnosis of such disease. According to World Health Organization (WHO), CVD is the first cause of death in the world, taking 17.9 million lives each year [
2].
According to the authors in [
3], CVD is an abnormal illness that affects the heart and the blood vessels. With that being said, the authors in [
4] highlighted that in their study of the worldwide deaths that were caused by CVDs, almost half of the deaths (48.5%) were associated with coronary heart disease, while strokes only took part in 20.8% of the population tested and the rest is for other diseases. Hence, it indicates the importance of preventing the progression of coronary heart disease.
In addition, some of the risk factors of CVDs could be due to high blood pressure or high cholesterol. As a result, a buildup of inflammatory cells known as plaques in the artery wall, resulting in blood limitation to the heart and lower oxygen intake, can be one of the main causes of such disease. This phenomenon is known as atherosclerosis. As a result, early detection of this condition could aid in reducing the advancement of atherosclerosis as well as heart failure. In
Figure 1, the plaque buildup is usually seen in the common carotid artery (CCA) and the internal carotid artery (ICA).
The carotid artery, which is made up of two blood vessels and has numerous components, including the internal, exterior, and common parts, is one approach to discover plaques in the arterial wall. Plaques can form in the interior segment of the carotid artery as well as the common blood vessels. Hence, plaques thicken the walls of these vessels, which is quantified as intima-media thickness (IMT). Thus, the difference between the lumen-initima (LI) and media adventitia (MA) walls can be measured to determine cIMT as a risk marker for early detection of heart disease [
5]. Referring to a review done in [
6], cIMT measures have shown to be able to predict CVD events independently of other risk variables; in fact, according to a study published in [
7], it is a stronger predictor of strokes than other vascular disorders.
The carotid IMT test is a method of detecting IMT and diagnosing atherosclerosis that is carried out in clinics using an ultrasound instrument and is mostly performed by doctors. When the ultrasound image is obtained, the physician segments the IMT measurements manually. Another option is semi-automatic detection and segmentation, in which a physician locates the area of interest followed by automated segmentation of the artery walls. Furthermore, fully automated systems can detect and calculate cIMT without the need for a physician’s intervention. This highlights one challenge: systems must be accurate in their calculations in order to provide a reasonable evaluation of the IMT measurement. When a fully automated model is implemented, it eliminates the need for physicians. Hence, it encourages the employment of portable devices.
Furthermore, detecting IMC and measuring it have proven difficult in some cases, where locating the artery walls and determining its boundaries can vary depending on the quality of the B-mode ultrasound images. Furthermore, ground truth points generated by physicians may contain some errors due to differences in inter and intra-observer readings.
Using deep learning techniques to diagnose such disease can be beneficial in many ways, namely, it can be deployed in portable devices, hence, help patients in self-diagnosing themselves. Additionally, it can reduce the load on doctors that might be examining and diagnosing each patient including the ones with no risks. Many applications have been conducted for cIMT segmentation and identification using deep learning and machine learning techniques, however, the accuracy of the cIMT estimation is arguable.
In this paper, we focus more on evaluating the encoder-decoder model on IMC segmentation along with finding the best hyper-parameters for the model. This is mainly done using encoder-decoder networks that aim to compress the data to a latent representation and decode it using another decoder network to decompress the image, where latent representation commonly contains the features of the image. Additionally, we train and test the model using the encoder-decoder architecture as well as a dataset from [
8] with pre-processing and post-processing techniques. The main aim of this research is to perform segmentation of B-mode ultrasound images using deep learning encoder-decoder architecture.
In this work, the main purpose is to develop a system that is able to segment the IMT in the arterial walls using deep learning models. Thus, deep learning models, specifically encoder-decoder models are investigated. The main contributions of the research are summarized as the following,
Provide a comprehensive review of convolutional autoencoder (CAE) applications as well as IMT segmentation applications.
Develop a convolutional autoencoder model for carotid intima-media complex (IMC) segmentation and IMT measurement on B-mode ultrasound images.
Evaluate the effectiveness of CAEs in variation with hyper-parameters.
Find an optimal architecture for CAEs by comparing the effectiveness of models with state-of-the-art methods.
In addition, we focus on main research questions to be able to evaluate the outcome of our solution, such as, how does the encoder-decoder model improve carotid IMT segmentation as well as how is it unique from previous solutions. Finally, we examine if the encoder-decoder model is able to be effective with the data augmentation on the given dataset with a limited number of images.
The sections for the rest of the paper are divided as follows; in
Section 2, we highlight the recent work applied for carotid IMT segmentation and classification, including the encoder-decoder applications as well. Then, in
Section 3, we propose our solution and present the model architecture for the deep learning model along with the data preparation process. Whereas, in
Section 4, we present the experimental setup along with the evaluation metrics and the results of the model. After that, in
Section 5, the results are discussed and the main challenges are pointed out. Finally, we conclude and explain the future work in
Section 6.
3. Proposed Method
Carotid artery detection and segmentation can be one of the important solutions for healthcare using medical imaging techniques. The lack of labeled large-scale datasets is one of the challenges that can be faced for this task. In the case of the carotid artery, one dataset was found with no labeling. Due to the fact that segmentation of regions of a medical image is very helpful for doctors to diagnose, a pre-processing technique on the original dataset is used to prepare this dataset for segmentation purposes. Then, using the proposed deep-learning-based model the carotid artery is segmented. To remove the false segmented pixels in the images a post-processing technique is applied using morphological operation. In this section, each step is described in detail.
3.1. Data Preparation
The dataset used for this paper is a dataset in [
8]. It contains 100 carotid IMT B-Mode ultrasound images with their ground truth points determined by two clinical experts. In their work, Loizou et al. [
8] highlighted that images were taken from 42 female and 58 male symptomatic patients aged between 26 and 95, where they produced longitudinal ultrasound images.
The images were obtained from the ATL HDI–3000 scanner (Advanced Technology Laboratories, Seattle, WA, USA), and were logarithmically compressed to produce images with 768 × 576 pixels resolution and 256 grey levels. The scanner has a multi-element ultrasound scan head with an operating frequency range of 4–7 MHz, an acoustic aperture of 10 × 8 mm, and a transmission focal range of 0.8–11 cm, with 64 elements fine pitch high-resolution, and 38 mm broadband array. Furthermore, the bicubic method was used to resize digital images to a standard pixel density of 16.66 pixels/mm.
Figure 2 illustrates three sample images from the dataset that we work within this paper. Given that the ultrasound image is all that is required, the frames in the samples that included patient information were removed as part of the pre-processing step.
Similarly, the authors of [
8] identified the IMT measurements from both experts with a number of techniques. They used speckle reduction, as well as normalization as pre-processing steps. In this research, we focus only on normalized images, hence, only IMT measurements for normalized images are used.
Table 2 shows the expert’s measurements for IMT in the case of normalized 100 images.
The experts readings included two time periods one at time 0 months and the other at time 12 months. According to others in [
8], this was done to test the intra-observer variability for the same expert. This means that the experts highlighted the carotid walls two times in different period of time for the same image in order to assess observer errors.
3.1.1. Pre-Processing
First of all, we took the raw images and removed the frames that were not of interest. After that, images were normalized and taken to be processed using Sobel and Prewitt gradient methods that are available built-in functions in MATLAB. We have examined them with other filters like the canny filter, however, given that it is an edge detector, the IMT region was identified as small, disconnected circles. Thus, it was not retrieving the features accurately. Then, we experimented with gradient images and found that they were able to distinguish the IMT region properly. Both methods produced gradient magnitude as well as gradient direction. For this implementation, we stored only the normalized gradient direction for both methods. After trying other filters, the gradient directional images were found to be the most accurate shows the cIMT more clearly than the rest of the images.
As for the ground truth points, we converted the points to binary images in order to input them as labels in the deep learning model, as well as to compare them with predicted images in the testing phase. In
Figure 3, the original image along with the gradient images and the produced ground truth image are shown. Additionally, in order to train the model with data augmentation, the newly generated ground truth mask images were produced using lines that connects the ground truth manual points given in the dataset without using a threshold.
3.1.2. Data Augmentation
Data augmentation is mainly used when we have a small dataset and would like to increase the number of images in a given dataset [
29]. Thus, it provides small operations that can give the ability to rotate, flip, shift, zoom, or translate a given image without changing its content. Hence, we keep the final image as the original image features. Moreover, in order to do data augmentation, we need to have binary mask images since we are changing the display of an image, then the given mask should go through the same process.
In this stage, we use special features to implement augmentation namely, rotation, width and height shift, and zoom.
Table 3 shows the values used for the augmentation. The augmentation was done using ImageDataGenerator library in python, where it was used to augment both image and its binary mask.
3.2. Encoder-Decoder Architecture
The introduction of deep learning techniques, such as convolutional neural networks, improved the image segmentation task in terms of the quality and quantity of segmented parts in an image. From the famous architecture that used CNNs layers for images segmentation, we can find SegNet [
30] and U-Net [
31] which are encoder-decoder-based models that achieved good results in semantic segmentation. Both architectures are capable of binary and multi-class segmentation, where binary image segmentation is much easier than colored image segmentation. Thus, we were inspired by the SegNet architecture to implement the proposed model for segmenting the carotid artery from ultrasound images. The proposed architecture used two inputs instead of the model’s single input as used by U-Net and SegNet. A model’s multiple input can assist in the extraction of useful information while allowing for multi-feature learning. The proposed model includes two encoders for feature extraction, a fusion layer, a decoder with upsampling, and convolutional layers to produce the final results.
Each encoder is implemented based on VGG-19 [
32] backbone as SegNet, which is a series of (conv+BN+PReLU) layers with pooling layers and batch normalization (BN) [
33]. The two encoders are merged by concatenating the feature maps of each encoder output. Like SegNet and U-Net, the proposed decoder employs blocks of upsampling (unpooling) and convolutional layers (upsampling+Conv+BN+PReLU). The encoder part’s feature maps are converted into the final label by the decoder, which takes into account spatial restoration. The same structure used in the encoder is also used for the decoder by replacing the pooling layers with upsampling layers. Hence, the final architecture is shown in
Figure 4. For the loss function, the SoftMax function is used to calculate the loss function:
where
N is the number of pixels in the input image,
k is the number of classes and, for a specified pixel
i;
denotes its label and the prediction vector.
For this paper, we use two components namely, the Sobel gradient direction image and Prewitt gradient direction image as inputs of the model. Additionally, these images went through pre-processing steps which are discussed in
Section 3.1.1.
In the case of the segmentation process, it was mainly done by using 80% of the final dataset and converting it to both Sobel gradient and Prewitt gradient. These images were fed to the encoder-decoder architecture described in
Figure 4. During the training phase, we trained the model multiple times in order to get the best performance and tune the hyper-parameters for better accuracy. Finally, the training was done using 50 epochs along with 10 steps per epoch, which means it increases the data augmentation for each epoch 10 times.
Furthermore, post-processing techniques were applied since the final segmented image had some noise that needed to be reduced and specific regions of interest (ROI) needed to be highlighted. For that, we used morphological opening, which removes any small noise in the image and can detect discontinued blobs. Additionally, we used morphological closing in order to avoid having a discontinued segmentation of the carotid artery IMT. The shape for the morphological operations was a rectangular shape with different sizes each time. Firstly, morphological close is applied with a size of [(2 ,30] in order to close the gaps between discontinued IMT. Then, morphological open with a size of [2 ,30] is applied to remove small noises from the image. Finally, morphological close is applied with a size of [3 ,30].
4. Experimental Results
This section tackles the setup and evaluation metrics that lead to the given results as well as an analysis of the produced results.
4.1. Experimental Setup
In order to implement the architecture and train it using the model described in
Section 3.2, we use python programming language. We built and trained the model on NVIDIA GeForce GPU using Python. The implementation was done using Windows 10 operating system and the framework was done on Anaconda. Python version 2.7 was used for the training process. Keras and Tensorflow software packages were used to model and evaluate the results.
4.2. Evaluation Metrics
The evaluation of the deep learning model performance computed in the testing phase was based on the segmentation metrics [
34,
35]. These metrics are defined as follows:
Precision: This calculates how close the values are to each other and how close they are to the true values.
Recall: Also known as sensitivity. This is the ratio of the correct results by the overall correct data.
F1 Measure: This is calculated using both precision and recall, where it gives an overall overview of the performance of the system.
Sorensen Dice Coefficient: This calculates the similarity of two samples and is mainly used to validate image segmentation algorithms. It is also more about the percentage of overlap between two images.
Jaccard Index: This is the percentage of similarity for two images. It is similar to the Dice index, however, the Jaccard index takes into account true positive only once, while in the Dice Coefficient it does it twice.
We focus mainly about the F1 measure, Dice coefficient, and the Jaccard index in this work, as they mainly evaluate the similarity and the efficiency of the model and segmentation algorithm.
4.3. Evaluation
During the implementation of the deep learning model, we did extensive training as we ran the experiments various times with different numbers of epochs. This included changing the batch size as well as experimenting with the input images along with the pre-processing techniques.
Firstly, the training was done on the Prewitt and Sobel images as inputs with batch sizes equal to 32 and 8 as well as we included data augmentation in this phase. As a result, it was clear that the 32 batch size was not segmenting only the desired part and 8 batch size was more accurate. Thus, we trained the model using 8 batch size and augmentation in the second phase. Furthermore, we examined with changing input images as Prewitt and Sobel, and in the other experiment we made the inputs as the original image and Sobel image. However, the two gradient images were giving more accurate results. The first results were done using the architecture without batch normalization layers. We examined then the batch normalization layers on the final results and it gave preferable results than the outputs previously examined.
Similarly,
Table 4 illustrates the trials that were done and how the parameters were changed. The last two trials included the batch normalization layer, which had the higher percentages in the final results explained in
Table 5. In addition, in
Table 6 the performance of each trial is provided.
After tuning the hyper-parameters and decreasing the learning rate we were able to get the results shown in
Figure 5.
In addition, the results show better similarity with the ground truth with little extension of the line. During the testing phase, we evaluate the model using three metrics discussed in
Section 4.2. The results of the metrics are shown in
Table 6.
According to the Jaccard index and the Dice coefficient, they show a similarity of the tested data with the binary masks. The highest percentage is the F1 measure, where it gives an overview of the performance of the system.
The results showed that the system has somewhat good performance, however, it can be further enhanced, where pre-processing or post-processing techniques need to be further enhanced. Results might not give the best accuracy due to the fact that the dataset is not very clean, as it was hard to work with.
Moreover, we performed pixel calculations to get the thickness of the predicted IMT measurement. The calculations were made by calculating the distance from the upper boundary to the lower boundary. It was done using MATLAB functions bwdist(), as it calculates the vertical distance of a binary object. Additionally, the local max value was taken and then the mean value was calculated for all tested images to get the thickness as 2.989 pixels. Furthermore, converting pixels to mm we get 0.54 mm as the mean IMT measurement.
In comparison to the work done in [
8], as well as the ground truth,
Table 7 illustrates the error found in both the dataset and the proposed method compared to the ground truth. As explained before in
Table 2, the ground truth IMT has been determined by two experts at a certain time. We observed from the table that the minimum IMT measurement in the proposed solution is smaller than the ground truth. This is due to some predictions where the IMT was not clear in the image, thus, the model was only able to distinguish a small part of the IMT. Furthermore, the median value of the proposed model is close to the reading for Expert 2 than the one in [
8]. Regarding the max value, the proposed model was also able to achieve a similar thickness as Expert 2. With that being pointed out, the dataset was used in [
8] had a semi-automated model, comparing our model to theirs, our results were better given the automation.
5. Discussion and Challenges
Given the results discussed in
Section 4, we were able to achieve a segmented region for the carotid IMT, which was then used to estimate the thickness. For this paper, we were able to train and test the images and compare them to the ground truth points.
During the implementation of this solution, many other architectures were investigated, including UNet segmentation using MATLAB. These models were trained using more than 50 epochs with no good results. Therefore, the encoder-decoder architecture was able to produce segmented output which achieved a good performance. The results of this model look promising and are good for future expansion.
One of the main challenges faced in this research is finding a good segmented and annotated dataset. We faced many issues to get a dataset and we were able to receive the dataset that we worked on. Moreover, the dataset was not clean enough to be processed, hence, it was time consuming to work on these images, where in some cases the IMT was not very clear. Thus, the output for these images from the model are discontinued parts of sections around the IMT. Additionally, the dataset has no recent studies, which makes it hard to compare between results.
According to the research questions described in
Section 1, we can conclude now that the model was able to segment IMC fairly well. However, due to the lack of variety of images in the given dataset, it is not clear if the model can improve IMT segmentation. Thus, further research needs to be done regarding the dataset. Moving to the second question, the model chosen has not been used before for IMT segmentation and it has shown good results for this dataset and it is open for further improvements. We also observed from the trials and experimentation that 8 batch output with augmentation showed better segmentation than the one without augmentation. Hence, the data augmentation was effective on the dataset along with the encoder-decoder model.
In general, after comparing with the results found in [
8], we identify that the proposed method is robust and fast and is fully automated compared to their semi-automated snake segmentation.
6. Conclusions and Future Direction
In conclusion, CVDs take millions of lives on a yearly basis, which means it is important to provide people with ways for early diagnosis of such a disease. Many implementations were done for such a problem using computer vision techniques for B-mode ultrasound images. In addition, we looked at recent work on carotid intima-media thickness segmentation and encoder-decoder applications. In this research, we investigated a deep learning model, specifically a convolutional autoencoder with two inputs for two encoders, and identified the optimum hyper-parameters and architecture that produced results that were similar to the dataset’s provided ground truth. We trained the encoder-decoder architecture using 10 steps per epoch and 50 epochs and 80% of the dataset. We were able to obtain results of 79.92%, 74.23%, and 60.24% for the F1 Measure, Dice coefficient, and Jaccard index, respectively. We also calculated the IMT thickness, which was 0.54 mm. The model showed good performance with the lowest error of 0.03 mm, compared to the ground truth data.
Further enhancement could be done by experimenting with the optimized model along with other ultrasound B-mode carotid datasets, this would give an overview of the generality of such system and the performance given other images. Furthermore, we could experiment with different modern filters to input with the model and evaluate the performance. Our proposed system is highly recommended to be used along with a portable device that acquires ultrasound images and processes them in order to give patients the ability to early diagnose themselves.