The experiments were carried out using an Intel Core i7, 64 GB of RAM, and an 8 GB graphic card, NVIDIA 3070Ti.
3.3. You Only Look Once (Version 8 and 9) Architecture
YOLOv8, a cutting-edge convolutional neural network (CNN) model for object detection, offers a promising blend of speed and accuracy. It addresses the problem of detecting multiple eye signs and characterizing DR as a segmentation task, facilitating the identification of different stages or severity levels of the disease. The network architecture has three main components: the backbone, the neck, and the head, as shown in
Figure 2. YOLOv8, while sharing a similar backbone with YOLOv5, introduces innovative changes in the cross-stage partial connections (CSP) layer, now known as the C2f module. This module, a cross-stage partial bottleneck with two convolutions, effectively merges high-level features with contextual information, thereby enhancing detection accuracy [
38]. YOLOv8 adopts an anchor-free model with a decoupled head, allowing for the independent processing of objectness, classification, and regression tasks. This design, which enables each branch to concentrate on its specific task, significantly enhances the model’s overall accuracy. In the output layer of YOLOv8, the sigmoid function is used as the activation function for the objectness score, indicating the likelihood that the bounding box contains an object. The softmax function is employed for the class probabilities, indicating the likelihood of the objects belonging to each possible class [
38]. YOLOv8 leverages the CIoU [
39] and DFL [
40] loss functions for bounding-box loss and binary cross-entropy for classification loss. These losses significantly enhance object detection performance, particularly when dealing with smaller objects, instilling confidence in the model’s capabilities [
38].
The backbone [
42] is responsible for extracting rich feature representations from the input image
I, which is defined as:
where
H and
W are the height and width of the input image, respectively. A series of convolutional layers are applied to the input image to extract features.
where
K is the convolutional kernel,
s is the stride,
p is the padding, * denotes the convolution operation, and BatchNorm denotes the batch normalization. Residual blocks help to learn deeper features:
where
X is the input of the residual block, and
K1 and
K2 are the kernels of the convolutional layers within the block.
The neck [
42] aggregates feature maps from different stages of the backbone to enhance feature representation. The Feature Pyramid Network (FPN) combines feature maps from different scales:
where
Pl is the feature map at level
l,
Cl+1 is the feature map from the previous layer, and UpSample denotes the up-sampling operation. The Path Aggregation Network (PAN) enhances the feature pyramid by combining feature maps in both top–down and bottom–up pathways:
where
Ul is the output feature map at level
l and Concat denotes the concatenation operation.
The head [
42] predicts the bounding boxes, objectness scores, and class probabilities for the detected objects. The bounding box prediction is defined as:
where (
cx,
cy) is the center of the anchor box, (
pw,
ph) are the dimensions of the anchor box, and
tx,
ty,
tw, and
th are the predicted offsets. The sigmoid function
σ ensures that the outputs are within a valid range. The objectness score is defined as:
where
tc is the raw class score for class
c.
The loss function used to train YOLOv8 combines multiple components to ensure accurate predictions. The localization loss is defined as:
where
Bi is the predicted bounding box,
B′
i is the ground truth bounding box, and
smoothL1 is the smooth
L1 loss. The objectness loss is defined as:
where
BCE is the binary cross-entropy loss,
oi is the predicted objectness score, and
o′
i is the ground truth objectness score. The classification loss is defined as:
where
pi (
c) is the predicted class probability and
p′
i (
c) is the ground truth class label. The total loss is then defined as:
where
λloc,
λobj, and
λcls are weighting factors for each loss component [
42].
YOLOv9 marks a significant advancement in real-time object detection, introducing groundbreaking techniques such as Programmable Gradient Information (PGI) and the Generalized Efficient Layer Aggregation Network (GELAN) [
25]. The new version, developed upon the robust codebase of YOLO version 7, shows remarkable efficiency, accuracy and adaptability improvements. Information loss in deep neural networks is a critical challenge that YOLOv9’s advancements try to address. The core innovations of version 9 lay in the Information Bottleneck Principle (IBP) and Reversible Functions (RFs).
Figure 3 shows the architecture diagram of YOLOv9.
The IBP highlights a crucial challenge in deep learning: as data pass through multiple layers of a network, the information loss increases. This phenomenon is mathematically represented as:
where
I means mutual information and
f and
g are transformation functions with parameters
θ and
φ, respectively. This loss can lead to unreliable gradients and a poor model convergence. One solution is to increase the model’s size, retaining more information. YOLOv9 counters this challenge by implementing PGI, which aids in preserving essential data across the network’s depth, ensuring more reliable gradient generation, convergence, and performance. PGI is a solution comprising a main branch for inference, an auxiliary reversible branch for reliable gradient calculation, and multi-level auxiliary information to tackle deep supervision issues effectively without adding extra inference costs.
A function is defined as reversible if it can be inverted without any loss of information, as expressed by:
where
ψ and
ζ are parameters for the reversible and its inverse function. This ensures no information loss during data transformation, enabling the network to maintain all the input data through all the layers and provide more accurate updates to the model’s parameters. YOLOv9 incorporates RFs within its architecture to mitigate the risk of data degradation and preserve critical information for object detection tasks.
GELAN represents a unique design that fits the PGI framework, enabling YOLOv9 to achieve superior parameter utilization and computational efficiency. It allows for the flexible integration of various computational blocks, making version 9 adaptable to various applications without sacrificing speed or accuracy. For more detailed information on YOLOv9 and the YOLO family, see [
25,
45].
Table 3 shows the available variants of YOLO versions 8 and 9, which are accessible online for project development, highlighting the sizes of input images (in pixels), number of used parameters (in millions), and floating-point operations per second (FLOPs, i.e., number of parameter and computational needs). YOLOv8 comes in nano (n), small (s), medium (m), large (l), and extra-large (x) model sizes. YOLOv9 offers model variants from tiny (t) to small, medium, compact (c), and extensive (e). We used the c and e versions of YOLOv9 for this research, the only two available iterations at the beginning of this work [
45,
46].
3.5. Performance Metrics
The performance metrics [
32,
47] used to analyze the research outcomes include:
Average Precision (AP): AP computes the area under the Precision × Recall curve, providing a single value that encapsulates the model’s precision and recall performance;
Mean Average Precision (mAP): this extends the concept of AP by calculating the average AP values across multiple object classes, as shown in Equation (14). It immediately provides a comprehensive evaluation of the model’s performance. It is commonly used in computer vision model research to compare both different models on the same task and different versions of the same model;
Precision (P) and Recall (R): the first quantifies the proportion of true positives among all positive predictions, assessing the model’s capability to avoid false positives. The latter calculates the proportion of true positives among all actual positives, measuring the model’s ability to detect all instances of a class. Precision and Recall are calculated for each class by applying the formulas for each image, as shown in Equations (15) and (16), respectively;
Accuracy (Acc): Acc is a metric that measures how often a model correctly predicts the outcome. In other words, accuracy is equal to the number of correct predictions divided by the number of predictions made, as shown in Equation (17);
F1 score: this the harmonic mean of Precision and Recall, providing a balanced assessment of a model’s performance while considering both false positives and false negatives, as shown in Equation (18);
Intersection over Union (IoU): this is used to estimate the similarity between two sets of samples, and the ratio between the area of overlap and the area of the union of the predicted bounding boxes and the ground truth bounding boxes obtains it.
where
TP is true positive,
TN is true negative,
FP is false positive, and
FN is false negative.
We provide a Precision–Recall curve example, a useful tool in model performance evaluation. From this curve, we can calculate AP and mAP as the weighted means of the Precisions achieved at each threshold, with the increase in Recall from the previous threshold used as the weight. An IoU of 0.5 is selected to calculate the proposed model’s performance and compare the results with the other works from the literature.
Figure 4 shows the PR curves from the YOLOv8 model-
s used to detect ODs, MAs, and HEMOs. The
x-axis of the PR curve represents the Recall, while the
y-axis shows the Precision. In this space, the goal is to be in the upper right corner (1, 1), meaning that the predictor classified all positives as positive (Recall = 1), and that everything classified as positive is true positive (Precision = 1) [
32]. We use the top right corner summary table to identify the performance achieved by the model on each class, showing the AP per class and the calculated mAP at an IoU of 0.5.
Figure 5 shows an example of a confusion matrix, a table containing data from experiments with the adopted approach. It summarizes the information on the performances achieved and lets us compare them to other work. For example, we show the confusion matrix from the same experiment with the YOLOv8 model-
s as before. To better understand how to use a confusion matrix, we use Equations (15) and (16) to calculate the Precision and Recall from the confusion matrix for the OD class. The computation is available below.
The confusion matrix resulting from the detection of objects presents the numbers of false positives (FPs) and false negatives (FNs), respectively, the image background detected as a lesion, without any corresponding label in the ground truth, and authentic objects not detected by the proposed method and, therefore, considered as background. True positives (TPs) and true negatives (TNs) are found in the confusion matrix as well, respectively, a lesion with a corresponding label in the ground truth detected as an effective lesion by the model, and a result that indicates the absence of a lesion or a feature. The confidence limit established for detecting objects in these images will directly impact the results obtained from background FPs and background FNs. A confidence limit is applied to filter the bounding boxes of a possible object to eliminate the bounding boxes with low confidence scores through a Non-Maximum Suppression algorithm, which disregards detected objects with an IoU less than the defined threshold [
32]. We calculated the results presented in the confusion matrix using a fixed confidence limit of 0.25, which aligns with the default inference configurations of YOLOv8 and v9. With lower confidence limits, such as our default value, the mAP results will be improved but produce a more significant amount of background FPs that will appear in the confusion matrix [
32]. Squares with darker shades of blue indicate a more significant number of samples. The confusion matrix presents the hits in predicting fundus lesions on the main diagonal, while the values off the main diagonal correspond to prediction errors.
We show and compare the results using only the variants of the models with the highest mAPs, choosing between n, s, m, l, and xl options for YOLOv8 and between c and e for YOLOv9. For each selected variant, we calculate and provide the Precision, Recall, accuracy, and F1 score.
To calculate the Precision and Recall for the OD class, according to Equations (15) and (16), we need TPs, FPs, and FNs for the OD class. As we can see from the confusion matrix in
Figure 5, the box connecting the true Optic_disc on the
x-axis and the predicted Optic_disc on the
y-axis contains the number 30. This number represents the TPs for the OD class. Similarly, if we want to find the FPs and FNs for the OD class, we have to look at the boxes which connect the background on the
x-axis to the Optic_disc on the
y-axis and the Optic_disc on the
x-axis to the background on the
y-axis, respectively. Then, the FPs and FNs for the OD class are equal to 1. Now, applying Equations (15) and (16), the Precision and Recall are equal to 0.968. If we calculate the same performance metrics for another class, such as MA, the TPs are 25, the FPs are 6, and the FNs are 118. The Precision and Recall for MA are 0.806 and 0.175, respectively. Repeating this procedure, it is possible to calculate performance metrics for all the desired classes.