2.2. Dataset
This study employed the EyePACS dataset, encompassing 35,155 images, each classified according to the severity of diabetic retinopathy (DR), from level 0 (no retinopathy) to level 4 (proliferative retinopathy). For this research, a numerical system, 0, 1, 2, 3, 4, was utilized to denote the severity of DR, as reflected in the accompanying image data. Each grade represented a distinct stage of DR, allowing for an organized and efficient analysis of the disorder’s progression.
Level 0: No apparent retinopathy, 25,849 images.
Level 1: Mild non-proliferative retinopathy, 2438 images: In the earliest stage of diabetic retinopathy, the walls of the blood vessels in the retina weaken. Tiny bulges protrude from the vessel walls, sometimes leaking fluid and blood into the retina. Nerve fibers in the retina may begin to swell, and central vision may be affected.
Level 2: Moderate non-proliferative retinopathy, 5288 images: As the disease progresses, blood vessels that nourish the retina may swell and distort. They may also lose their ability to transport blood. Both conditions cause characteristic changes to the appearance of the retina and may contribute to DME (diabetic macular edema).
Level 3: Severe non-proliferative retinopathy, 872 images: Many more blood vessels are blocked, depriving blood supply to areas of the retina. These areas secrete growth factors that signal the retina to grow new blood vessels.
Level 4: Proliferative retinopathy, 708 images: At this advanced stage, the signals sent by the retina trigger the growth of new blood vessels. This process is called neovascularization. However, these new blood vessels are abnormal and fragile. They grow along the retina and the surface of the clear gel that fills the inside of the eye. By themselves, these blood vessels do not cause symptoms or vision loss. However, they have thin, fragile walls. If they leak blood, severe vision loss and even blindness can result.
2.3. Preprocessing Techniques
In this research, three key preprocessing techniques were implemented across all three models to ready the image data: grayscale conversion via OpenCV, data augmentation, and image background removal using REMBG.
Grayscale conversion: This process was conducted using OpenCV, converting the original-colored images into grayscale. The objective of this conversion was to lessen the computational burden and to avoid inconsistencies arising from the color variance in the images. By standardizing the images to grayscale, the models could focus on distinguishing patterns and features without the additional complexity of color variations.
Data augmentation: To enhance the robustness of the models and combat potential overfitting, data augmentation techniques were applied. These included image manipulations such as rotation, zooming, and horizontal flipping. By introducing these transformations, the model was exposed to a wider variety of data scenarios, helping it generalize better to unseen data and improve its predictive power.
Image background removal: The third preprocessing step was the use of REMBG to remove the backgrounds from the images. By doing so, the model could concentrate solely on the crucial features of the retina, potentially improving its accuracy in predicting the severity of diabetic retinopathy.
These preprocessing steps were instrumental in enhancing the effectiveness and performance of the models in distinguishing and categorizing various levels of diabetic retinopathy severity. The outcome of these preprocessing steps can be visualized in
Figure 3,
Figure 4,
Figure 5 and
Figure 6.
2.4. Model Development
In this study, we implemented and analyzed three main deep learning models for diagnosing the severity of diabetic retinopathy (DR) from retinal images. The models share the same architecture, employing convolutional neural networks (CNNs), a well-established method for image-based machine learning tasks.
Training in this research was meticulously set up to ensure the integrity and reliability of the results. The training was executed with a batch size of 128 images, with each image resized to a uniform 180 × 180 pixels. To ensure robust model training and validation, the dataset was split into two parts: 80% for training and 20% for validation. Importantly, this division was conducted before the execution of any image transformations or preprocessing steps to avoid the risk of data leakage. Data leakage, where the model obtains access to information it should not during training, is a common issue in machine learning, which could lead to overly optimistic performance estimates. In this study, ensuring the strict separation of training and validation data means that the model never sees the validation data during its training, eliminating any chance of obtaining prior answers.
Moreover, this study adopted a compact model design to expedite the training process given the computational resources at hand. Utilizing a smaller model not only facilitated rapid training but also allowed for the effective management of resources. This strategy made it possible to conduct extensive experiments without overtaxing the available computational capacity. The resultant model, despite its size, was robust and capable of accurately diagnosing diabetic retinopathy from image data, highlighting the feasibility and efficiency of smaller model architectures for tasks of this nature. The model architecture was built using several layers, each tailored to a specific task, optimizing the network for image feature extraction and classification.
Input Layer: This layer ingests preprocessed images ready for the model.
Conv2D Layers: The first two layers in the architecture are 2D convolutional layers, each with 16 filters. These layers use learned filters to conduct convolution operations on the input data, aiming to capture local features within the image such as edges and corners.
Max Pooling Layer: Following the Conv2D layers is a max pooling layer, which reduces the spatial dimensions of the input by selecting the maximum value in each window defined along the features axis.
More Conv2D and Max Pooling Layers: The architecture continues with an alternating pattern of Conv2D and MaxPooling2D layers. The number of filters in these layers progressively increases (from 16 to 32 and then to 64), each time reducing the spatial dimensions.
Flatten Layer: The 2D matrix produced by the preceding layer is transformed into a 1D vector by a flatten layer, preparing it for input to the subsequent dense layers.
Dense Layers: The flattened output is then passed to a fully connected layer with 128 neurons, which performs classification based on the features extracted by the previous layers.
Dropout Layer (0.5): Following the dense layer is a dropout layer with a dropout rate of 0.5. This means that during each training update, 50% of the neurons are randomly set to zero. We chose a dropout probability of 0.5 because it is widely recognized as an optimal value for preventing overfitting without significantly under-training the model. A dropout rate of 0.5 provides a balance between retaining sufficient network capacity for learning and introducing enough regularization to improve generalization. Compared to lower dropout rates (e.g., 0.1 or 0.2), a 0.5 rate introduces more noise during training, which helps the model avoid becoming too reliant on any particular set of neurons, thus enhancing its ability to generalize to new, unseen data [
31].
Output Layer: The final layer in the architecture is another dense layer, serving as the output layer of the model. This layer has five neurons, representing the five classes of DR severity, and outputs the probabilities of the input image belonging to each class.
The complete model, therefore, alternates layers of convolution and max pooling for feature extraction, followed by a flattening operation and dense layers for classification based on these features. In total, the model has 3,991,605 trainable parameters and no non-trainable parameters, which are explained in
Figure 7 and
Table 2.
This study was conducted to experimentally build and compare three distinct models to identify the one with superior performance. While all three models share the same architecture, their primary difference lies in the type of input data they handle: the first model uses grayscale image data, the second uses normal color image data, and the third employs normal color image data but with fewer epochs during training.
Model 1 (Grayscale Image Data): This model was trained for 200 epochs using grayscale image data. Subsequently, the same model was retrained with data augmentation applied to the original dataset for another 200 epochs. Finally, the background was removed from the same dataset, and the training continued the initial model for 10 more epochs. The resulting model and its experimental results comprise Model 1.
Model 2 (Normal Image Data): This model was constructed similarly to Model 1, but normal color image data was used instead of grayscale images. The model was first trained for 200 epochs, then retrained with augmented data for another 200 epochs. Subsequently, the background was removed from the same dataset, and the training continued the initial model for 10 additional epochs. The resulting model and its experimental results comprise Model 2.
Model 3 (Normal Image Data with Fewer Epochs): Again, the procedure was similar to Models 1 and 2, but this model was trained for fewer epochs. The model was first trained for 40 epochs, then retrained with augmented data for another 40 epochs. After the background was removed from the same dataset, the training continued the initial model for an additional 10 epochs. The resulting model and its experimental results comprise Model 3.
Upon completion of training the three models, we performed a comparative analysis to evaluate their performance based on specific metrics such as validation accuracy and loss values. Further details and results from this comparison can be seen in
Figure 7.
Please note that while all models have been trained and tested with the same number of epochs for fairness, the number of epochs was reduced in Model 3 to investigate the impact of fewer training iterations on model performance. Furthermore, the variation in data preprocessing techniques (grayscale conversion and background removal) provides a more comprehensive understanding of their effect on the model’s performance for DR diagnosis.
2.5. Evaluation Confusion Matrix
In this study, accuracy served as the principal metric for performance evaluation. However, we also utilized the confusion matrix, a robust tool that provides a comprehensive representation of a classification model’s performance. The confusion matrix offers a detailed overview of the model’s predictions in comparison to the actual labels.
For the purpose of this research, a “positive instance” refers to an image that has been correctly identified as depicting diabetic retinopathy (DR), whereas a “negative instance” refers to an image that has been correctly identified as not depicting DR. With these definitions in mind, the confusion matrix allows us to derive the following performance metrics:
True Positive (TP): The count of positive instances correctly identified as positive.
True Negative (TN): The count of negative instances correctly identified as negative.
False Positive (FP): The count of negative instances incorrectly identified as positive (Type I error).
False Negative (FN): The count of positive instances incorrectly identified as negative (Type II error).
Using these values, we calculated the following performance measures:
By considering these metrics, we obtained a comprehensive understanding of the model’s performance, taking into account both positive and negative predictions. The confusion matrix is crucial in unveiling the model’s strengths and weaknesses in distinguishing between different classes. For the purposes of this research, both the confusion matrix and validation accuracy were used to interpret the results.