**1. Introduction**

Diabetes is a leading global health dilemma. One of its serious complications is diabetic retinopathy (DR), which has a prevalence of 34.6% worldwide and is considered a primary cause of blindness among middle-aged diabetic patients [1,2]. A patient has a high DR risk if he or she has had diabetes for a long time or is poorly managed. The DR treatment at its early stage slows down the retinal microvascular degeneration process. Graders manually screen fundus images to detect DR prognosis, which is time-consuming and subjective [3–5]. On the other hand, screening a large number of diabetic patients for the possible prevalence of DR puts a heavy load on graders and reduces their efficiency. It necessitates intelligent systems for DR screening, and many ML-based systems have been proposed that show good results on public data sets. However, their performance is not certain in real DR screening programs, where there are different ethnicities, and the retinal

**Citation:** Saeed, F.; Hussain, M.; Aboalsamh, H.A.; Al Adel, F.; Al Owaifeer, A.M. Designing the Architecture of a Convolutional Neural Network Automatically for Diabetic Retinopathy Diagnosis. *Mathematics* **2023**, *11*, 307. https:// doi.org/10.3390/math11020307

Academic Editors: Xiang Li, Shuo Zhang and Wei Zhang

Received: 3 November 2022 Revised: 25 December 2022 Accepted: 28 December 2022 Published: 6 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

fundus images are captured using different cameras. These factors affect these systems' performance and remain a challenge in their widespread use [6].

Deep CNN has shown remarkable results in many applications [7–11] and has been employed in DR screening [2,12–15]. A CNN model usually involves a large number of parameters and needs a large amount of data for training. A brute force approach, which has been widely used for DR screening, is to adopt a highly complex CNN model designed for object recognition and pre-trained on the ImageNet dataset and fine-tune them using fundus images [16–19]. As the ImageNet dataset consists of natural images, and the structural patterns of natural images and fundus images are entirely different, the architectures of the fine-tuned models do not adequately encode the fundus images. In addition, the complexity of pre-trained models is very high and not customized to DR screening from fundus images.

Instead, CNN models are manually designed from scratch. The design process starts with a CONV layer of a small width (i.e., the number of filters) and increases the widths of CONV layers by a fixed ratio as the network goes deeper [16–18]. There is no way to know what the depth should be (i.e., the number of layers) of a CNN model; a hit–trial strategy is used to fix the depth. In addition, CNN models are trained using iterative optimization algorithms such as stochastic gradient descent algorithms, and their convergence heavily depends on the initial guess of learnable parameters. Different data-independent [20,21] and data-dependent [22,23] approaches have been proposed to initialize them.

Alternatively, automated machine learning (AutoML) has developed into a significant area of research due to the widespread application of machine learning techniques [24]. AutoML's purpose is to make machine-learning models accessible to those with limited machine-learning prior knowledge. Some of the most commonly used methods for employing machine learning (ML) are easily available and may be used with just one or two lines of code. These systems include Auto-WEKA, Hyperopt-Sklearn, TPOT, Auto-Sklearn, and Auto-Keras [25–32]. Efforts have been made to automate the model selection and tuning hyper-parameters automatically, and so forth. Within the perspective of profound NAS stands for learning, neural architecture search [33], which aims to determine the optimal neural network architecture for a given learning task and dataset, has evolved into a highly effective computational tool for AutoML [34,35]. It achieved competitive performance on the CIFAR-10 and Penn Treebank benchmarks by utilizing a reinforcement learning-based search strategy; consequently, NAS became a mainstream research topic in the machine learning community. NAS is prohibitively expensive and time-consuming in terms of computation [36]. Zoph and Le [33] utilize massive computational resources (800 GPUs for three to four weeks) to achieve their result.

The preceding discussion demonstrates that developing an AutoML-customized lightweight CNN model for DR screening that uses a small subset of the target dataset and consumes fewer resources in a variety of clinical settings is difficult; it entails answering three design questions: (i) what must be the depth of the model, (ii) what must be the width of each of its convolutional (CONV) layer, i.e., the number of its kernels, and (iii) how to initialize the learnable parameters. To address these questions, we propose a constructive data-dependent approach for designing CNN models for DR screening under diverse clinical settings that automatically determine the depth of the model and the width of each CONV layer and initialize the learnable parameters. A custom-designed model takes a fundus image as input and grades it into normal or DR levels. We validated the proposed approach on three datasets: a local DR dataset from King Saud University Medical City, Saudi Arabia, and two benchmark Kaggle datasets: EyePACS [37] and APTOS2019 [38]. Specifically, the main contributions of the paper are as follows:

• We proposed a constructive data-dependent AutoML approach to design lightweight CNN models customized to DR screening under various clinical settings. It automatically determines the depth of the model, and the width of each COVN layer and initializes the learnable parameters using the fundus images dataset.


The layout of the rest of the paper is as follows: the literature view is presented in Section 2, datasets are described in Section 3, the detail of the proposed method is given in Section 4, the detail of experiments and the results are presented in Section 5, and finally, Section 6 concludes the paper.

#### **2. Previous Work**

Different methods have been introduced for automatic DR screening; an extensive literature review is given in [40–43]. There are some efforts to compress and reduce the complexity of existing pre-trained CNN models by weights pruning [44,45] or filters pruning [46–48]. First, we provide an overview of the previous work on building a deep model and initializing its weights and then give an overview of the state-of-the-art techniques for DR diagnosis.

#### *2.1. Data-Dependent and Auto-Deep Models*

Different researchers employed principal component analysis (PCA) in various ways to build deep networks. Chan et al. [49] created an unsupervised two-layer model (PCANet). It is not an end-to-end model and is used only for feature extraction. Philipp et al. [22] used PCA to re-initialize pre-trained CNN models to avoid vanishing or exploding gradient problems. Suau et al. [23] used PCA and correlation to compress the filters of pre-trained CNN models. Seuret et al. [50] employed PCA to initialize the layers of stacked autoencoders (SAEs). The above PCA-based methods have been employed for designing a CNN-like model for feature extraction, data-dependent re-initialization of the pre-trained models, or compressing their weights to reduce their complexity, but not for the datadependent design of end-to-end CNN models.

Zhong et al. [51] introduced a method to build a BlockQNN module automatically using the block-wise setup, Q-Learning paradigm, and epsilon-greedy exploration and stacked them to obtain the automatic CNN model. They evaluated their method using CIFAR-10, CIFAR-100, and ImageNet. It needs a lot of computational resources. They used 32 GPUs and got the best CNN model with BlockQNN after three days and Faster BlockQNN after 20 h.

AutoML's initial effort was led by academia and machine learning practitioners, followed by startups and Auto-Weka (2013) [52] from the Universities of British Columbia (UBC). Following that, the University of Freiburg published Auto-Sklearn (2014) [53]. TPOT was created by the University of Pennsylvania [27] in (2015). Following the success of Zoph and Le in [33] in performing comparably to the CIFAR-10 and Penn Treebank benchmarks, other recent efforts to develop NAS have been made [54,55], they incorporate modern design elements previously associated with handcrafted architectures, such as skip connections, which enable the construction of complex, multi-branch networks. To maximize efficiency, state-of-the-art systems employ cell-search spaces [56], which involves configuring only repeated cell architectures rather than the global architecture, and employ gradient-based optimization [57]. Since 2013, Bayesian optimization has achieved several early successes in NAS, resulting in state-of-the-art vision architectures [58]. Google Cloud AutoML based on NAS method is one of the famous auto deep learning models' autogeneration [59]. It utilizes transfer learning and NAS to determine the optimal network architecture and hyper-parameter configuration for that architecture that minimizes the model's loss function [60]. Another method for autoML-based NAS for generating deep learning models is Auto-Keras (2017) [29] from Texas A&M University, which runs on top of Keras, Tensorflow, and Scikit-learn

#### *2.2. DR Screening Methods*

Clinical DR screening categorizes a patient based on fundus images into different grades: level 0 (normal), level 1 (mild), level 2 (moderate), level 3 (severe), and level 4 (proliferative). In the state-of-the-art on DR screening, various deep learning-based methods have been proposed, which address mainly three image-level DR grading scenarios: (i) scenario 1 (SC1): normal and different levels of DR severity—a multi-class problem, (ii) scenario 2 (SC2): normal (level 0) vs. DR (levels 1~4)—a two-class problem, (iii) scenario 3 (SC3): non-referral (level 0 and 1) vs. referral (levels 2–4)—a two-class problem. In the following paragraphs, we give an overview of the state-of-the-art best methods. Islam et al. [61] built a hand-designed CNN model consisting of 18 layers and 8.9 million learnable parameters and got on the EyePACS dataset a sensitivity of 94.5%, a specificity of 90.2% for SC2, a sensitivity of (98%) and a specificity of (94%) for SC3. Li et al. [62] introduced two hand-designed CNN models with 11 and 14 layers for feature extraction from the EyePACS dataset. The features from both models are fused and classified using an SVM classifier. They achieved an accuracy of 86.17% for SC1 and an accuracy of 91.05%, a sensitivity of 89.30%, and a specificity of 90.89% for SC2 using five-fold cross-validation. Challa et al. [63] built a CNN model consisting of 10 layers for the EyePACS dataset and obtained an accuracy of 86% for SC1. Tymchenko et al. [64] built an ensemble of 20 CNN models. The ensemble used five versions of each of four pre-trained models: SE-ResNetXt50 with input sizes of 380 × 380 and 512 × 512, EfficientNet-B4, and EfficientNet-B5. It was fine-tuned using the APTOS2019 dataset. They got an accuracy of 91.9%, a sensitivity of 84%, a specificity of 98.1%, and a Kappa of 96.9% for SC1 on the APTOS2019 dataset. Sikder et al. [65] used an ensemble learning algorithm called ET classifier to classify the colored information of the fundus images from the APTOS2019 dataset. They filtered the dataset by removing many noisy samples and achieved an accuracy of 91% and a recall of 89.43% for SC1. DR categorization was performed manually by Sikder et al. (2021) [66]. They conducted significant preprocessing to fundus pictures before extracting the histogram and GLCM features. The APTOS2019 dataset was utilized to validate the procedures, with 75% used for training and 25% for testing. The XGBoost algorithm was used to fine-tune and pick the best features for optimal performance. Classification accuracy for DR (five classes) was 94.20% (95% CI: 93.88–94.51%) for the whole set of features and 93.70% (95% CI: 93.48–93.93%) for the subset.

The above overview of the state-of-the-art methods shows that some used handdesigned CNN models, and others employed pre-trained models and fine-tuning. For creating hand-designed models, the architectures of CNN models were fixed empirically using the hit-and-trial approach. In the case of fine-tuning, the complexity of the pre-trained models is very high and is not customized to the structures of fundus images.

#### **3. Materials**

We developed and validated custom-designed CNN models using two Kaggle challenge datasets: EyePACS [37] and APTOS2019 [38], and one local dataset collected at King Saud University Medical City (KSU-DR). Each dataset was preprocessed and augmented using the procedure described in Section 4.2.1. KSU-DR and EyePACS were divided into training (80%), validation (10%), and testing (10%). APTOS2019 consists of two sets: public training and public testing; 90% of the public training data was used for training and the remaining 10% for validation and public testing for testing.

#### *3.1. KSU-DR*

The data were collected after obtaining approval from King Saud University Medical City's local Institutional Review Board committee. The samples were collected randomly from fundus images of diabetic patients acquired during their routine endocrinologist's appointment at the funduscopic screening clinic. Fundus images were captured with a nonmydriatic fundus camera (3D-OCT-1-Maestro non-mydriasis fundus camera); a 45-degree fundus photo was captured from each eye. All patients were from Saudi Arabia, 44% were

males, and 56% were females. The mean patient age was 53 years; 17% had type 1 diabetes, and 83% had type 2 diabetes. The mean duration of diabetes was 18 years (ranging from 4–42 years). Random samples of 1750 images were selected and graded by two expert ophthalmologists; 1024 were graded as normal, 477 as mild non-proliferative DR, 222 as moderate non-proliferative DR, 20 as severe non-proliferative DR, and 7 as proliferative DR (PDR).

#### *3.2. EyePACS*

EyePACS [37] consists of 88,702 color retinal fundus images with varying resolutions up to about 3000 × 2000 pixels [63], collected from 44,351 subjects, but only 35,126 labeled images are available in the public domain; most of the researchers used this set for the proposal of new algorithms [61,67]. We also used 35,126 labeled images to design and evaluate the custom-designed CNN model. The images are graded into normal and 4 DR classes—mild, moderate, severe, and proliferative.

#### *3.3. APTOS2019*

APTOS2019 dataset [38] was published by the Asia Pacific Tele-Ophthalmology Society on the Kaggle competition website. Clinical experts graded the images into normal and 4 DR levels (mild, moderate, severe, and proliferative). The public domain version of this database contains 3662 fundus images for training and 1928 fundus images for testing.

#### **4. Proposed Method**

#### *4.1. Problem Formulation*

The problem is to predict whether a subject has normal vision or suffering from DR (with different levels of severity) using his/her retinal fundus images. Formally, let *R h*×*w*×3 be the space of color retinal fundus images with resolution *h* × *w* and *P*= {1, 2, . . . , *C*} be the set of labels where *C* is the number of classes, which represent different DR grades; in case of two grades (i.e., normal and DR), *C* = 2, such that *c* = 1 means normal and *c* = 2 stands for DR; when there are five grades, *C* = 5, and *c* = 1, 2, 3, 4, 5 are the labels for normal, mild, moderate, severe, proliferative DR, respectively. To predict the grade of a patient, we need to define a mapping *φ* : *R <sup>h</sup>*×*w*×<sup>3</sup> <sup>→</sup> *<sup>P</sup>* that takes a fundus image *x* ∈ *R <sup>h</sup>*×*w*×<sup>3</sup> and associates it to a label *<sup>c</sup>* <sup>∈</sup> *<sup>P</sup>*, i.e., *<sup>φ</sup>*(*x*) <sup>=</sup> *<sup>c</sup>*. We model the mapping function *φ* using a custom-designed CNN model.

#### *4.2. Custom-Designed CNN Model*

The main constituent layer of a CNN model is the CONV layer, and the widely adopted CNN models contain a large number of CONV layers, e.g., VGGNet [68] contains 13 CONV layers. The number of layers (depth) and the number of filters in each layer (width) are fixed manually, keeping in view the ImageNet challenge dataset [69], without following any formal procedure. Retinal fundus images have complex small-scale structures, which form discriminative patterns and are entirely different from those of the natural images in the ImageNet dataset. We design an AutoML CNN model for the DR problem by drawing its architecture from the fundus images; we determine the depth of a model and the widths of its CONV layers in a customized way using the discriminative information in fundus images specific to different DR levels. In this direction, the first design decision is about specifying the search space and extracting discriminatory information. For this purpose, first, we reduce the search space and select the most representative fundus images from the available DR dataset using the K-medoids clustering algorithm [70] and then apply PCA [71] to determine the widths of CONV layers and initialize them. The next design decision is about the depth (i.e., the number of CONV layers). We control the depth using the ratio of the between-class scatter matrix *S<sup>b</sup>* to the within-class scatter matrix *Sw*. Finally, motivated by the design strategy of ResNet [17], we add global pooling layers that follow the last CONV layer, and their outputs are fused and fed directly to a softmax layer. These layers control the drastic increase in the number of learnable parameters (which cause

*Mathematics* **2023**, *11*, x FOR PEER REVIEW 6 of 20

overfitting). The design process is described in detail below, and its overview is shown in Figure 1. rectly to a softmax layer. These layers control the drastic increase in the number of learnable parameters (which cause overfitting). The design process is described in detail below, and its overview is shown in Figure 1.

design decision is about specifying the search space and extracting discriminatory information. For this purpose, first, we reduce the search space and select the most representative fundus images from the available DR dataset using the K-medoids clustering algorithm [70] and then apply PCA [71] to determine the widths of CONV layers and initialize them. The next design decision is about the depth (i.e., the number of CONV layers). We control the depth using the ratio of the between-class scatter matrix *Sb* to the within-class scatter matrix *Sw*. Finally, motivated by the design strategy of ResNet [17], we add global pooling layers that follow the last CONV layer, and their outputs are fused and fed di-

**Figure 1.** Design procedure of DeepPCANet. **Figure 1.** Design procedure of DeepPCANet.

#### 4.2.1. Preprocessing 4.2.1. Preprocessing

The retinal fundus images are usually not calibrated and are surrounded by a black area, as shown in Figure 1a. To center the retina and remove the black area around it, firstly, the retina circle is cropped, and the background is removed using the method presented in [65], and then it is resized to 512 × 512 pixels. Usually, the DR datasets are imbalanced, i.e., the numbers of images of different classes are significantly different; we increase the data of minority classes using data augmentation. We apply affine transformations to randomly rotate the image with an angle θ∈ (−180, 180). The retinal fundus images are usually not calibrated and are surrounded by a black area, as shown in Figure 1a. To center the retina and remove the black area around it, firstly, the retina circle is cropped, and the background is removed using the method presented in [65], and then it is resized to 512 × 512 pixels. Usually, the DR datasets are imbalanced, i.e., the numbers of images of different classes are significantly different; we increase the data of minority classes using data augmentation. We apply affine transformations to randomly rotate the image with an angle *θ* ∈ (−180, 180).

#### 4.2.2. Selection of Representative Fundus Images 4.2.2. Selection of Representative Fundus Images

Using the EyePACS training dataset, we choose the most representative fundus images using (K-means [72] and K-medoids [70], and random samples) selection methods to customize a DeepPCANet model and then test it. The discriminative features are extracted from training fundus images for clustering using the efficient LGDBP descriptor Using the EyePACS training dataset, we choose the most representative fundus images using (K-means [72] and K-medoids [70], and random samples) selection methods to customize a DeepPCANet model and then test it. The discriminative features are extracted from training fundus images for clustering using the efficient LGDBP descriptor proposed in [73]. The number K of clusters for K-means and K-medoids is specified using the gap statistic method [74].

As indicated in Table 1, the K-medoids gave the best results and are the most precise. Due to the fact that K-means gives mean feature vectors as cluster centers, it is inadequate at selecting representative fundus images, and outliers are a serious concern. On the other hand, because the K-medoids algorithm selects representative fundus images as cluster centers, using representative fundus images is appropriate. Both K-medoids and K-means outperform the random fundus image model.


**Table 1.** Comparison between clustering methods based on Eyepacs (SC1).

#### 4.2.3. Designing the Main DeepPCANet Architecture

The design of the AutoML customized architecture of DeepPCANet needs to address two questions, i.e., (i) what should be the depth of the model and (ii) what should be the width of each CONV layer? These questions are addressed by an iterative algorithm, incrementally adding CONV layers, and stopping when a specific criterion is satisfied. It is based on the idea of exploiting discriminative information of fundus images to select the number of kernels in a CONV layer and initialize them. It takes representative fundus images *RI<sup>j</sup>* , *j* = 1, 2, 3, . . . , *K* as input and divides them into patches of size 7 × 7. The patches are vectorized and used to determine the number of kernels and initialize them. One possible idea is to cluster the patches and select the cluster centers as kernels, but the issue is choosing the number of clusters. We go for a simple and effective procedure, i.e., we employ PCA because it reduces the redundancy and helps to determine the kernels and their number, exploiting the discriminative information in the patches. The principal components (PCs), i.e., the eigenvectors along which the maximum energy is preserved, serve as kernels of the first layer. After computing the PCs, the DeepPCANet is initialized with an input layer and a CONV block (BN+ReLU+CONV) with kernels equal to the number of PCs; the kernels are initialized by reshaping the PCs. Please note that we fix the size of patches to 7 × 7 so that the size of kernels of the first CONV layers is 7 × 7 following the convention of most of the existing CNN models such as Inception [16], ResNet [17], and DenseNet [18]. Using the current architecture of DeepPCANet, activations *a<sup>j</sup>* , *j* = 1, 2, 3, . . . , *K* of representative fundus images *RI<sup>j</sup>* , *j =* 1, 2, 3, . . . , *K* are calculated. Inspired by the Fisher ratio [75], using these activations, the ratio of the trace (*TR*) of between-class scatter matrix *Sb* to the trace of within-class scatter matrix *S<sup>w</sup>* is calculated *TR* = *Trace*(*Sb*) *Trace*(*Sw*) and is used to decide whether to stop or add another CONV block. The new CONV blocks continue to be added as long as TR continues to increase. This criterion ensures that the features generated by DeepPCANet have large inter-class variation and small intra-class scatter. To add a CONV block, the above procedure is repeated with activations *a<sup>j</sup>* , *j* = 1, 2, 3, . . . , *K*. To reduce the size of feature maps for computational efficiency, pooling layers are added after the first and second CONV blocks. As the kernels and their number are determined from the fundus images, each layer can have a different number of filters. The detail of the design procedure is elaborated in Algorithm 1. It is to be noted that the PCs (*ui*), which are used to specify the kernels of a CONV layer, are orthogonal and capture most of the variability in input fundus images, without redundancy, in the form of independent features. The PCs are selected so that the maximum energy is preserved. The energy is measured in terms of the corresponding eigenvalues, i.e., *Energy* = ∑ *L l*=1 *λl* ∑ *D j*=1 *λj* [23,76] and a threshold value is used to ensure that a certain percentage of energy (e.g., 99%) is preserved. The threshold value of 99% preserves the maximum energy with 209 (*L*) PCs for CONV1 in the EyePACS dataset, as shown in Figure 2. The depth of the AutoML CNN model and the width of each layer are important factors determining the model complexity. Step 7 of Algorithm 1 adaptively determines the best number of kernels that ensure the preservation of the maximum energy of the input image. Step 9 initializes the kernels to be suitable for the DR domain. The selected kernels extract the features from fundus images (five classes) so that the variability of the structures in fundus images is maximally preserved. It is also essential that the features must be discriminative, i.e., have large inter-class variance and small intra-class scatter as we go deeper in the network; it is ensured using the trace ratio

*TR* = *Trace*(*Sb*) *Trace*(*Sw*) , the larger the value of the trace ratio, the larger the inter-class variance, and the smaller the intra-class scatter [75]. Step 13 in Algorithm 1 allows adding CONV layers as long as *TR* increases and determines the data-dependent depth of DeepPCANet. As shown in Figure 3, the maximum ratio is at layers 4, 5, and 16 for KSU-DR, APTOS2019, and EyePACS, respectively. It means that the suitable depth of the DeepPCANet model for the KSU-DR dataset is four layers (Figure 3a), for APTOS2019 it is five layers ((Figure 3b), and for the EyePACS dataset it is sixteen layers (Figure 3c). The model for EyePACS is deeper because it contains many poor quality fundus images, and there is the possibility of label noise because only one expert graded each image in this dataset. Each dataset was collected from a different region and under different conditions using different cameras, so the architecture of the DeepPCANet model is different for each dataset. *Mathematics* **2023**, *11*, x FOR PEER REVIEW 9 of 20 **Step 12:** Compute the trace ratio between scatter between matrix (*Sb*) and within matrix (*Sw*) as = ்(ௌ) ் (ௌ௪) where ௪ = ∑ ∑ ( − )( − ) ் ୀଵ ୀଵ and = ∑ ( − ) ( − ) <sup>௧</sup> ୀଵ . **Step 13:** If ( ) , = , *W = 3, H = 3, D = L*, and go to Step 3, stop otherwise.

**Figure 2.** Selecting the best threshold. The appropriate threshold is (0.99) and the number of eigenvectors is 209. **Figure 2.** Selecting the best threshold. The appropriate threshold is (0.99) and the number of eigenvectors is 209.

4.2.4. Addition of Global Pool and Softmax Layers

**Figure 3.** Trace ratio between class scatter and within class scatter. The depth for (**a**) APOTOS2019 dataset is 4 layers, (**b**) KSU-DR dataset is 5 layers, and (**c**) EyePACS dataset is 16 layers. **Figure 3.** Trace ratio between class scatter and within class scatter. The depth for (**a**) APOTOS2019 dataset is 4 layers, (**b**) KSU-DR dataset is 5 layers, and (**c**) EyePACS dataset is 16 layers.

The dimension of the activation of the last CONV block is *W×H×L.* If it is flattened and passed to a fully connected (FC) layer, the number of learnable weights and biases of the FC becomes excessively large, which leads to overfitting. To overcome this issue, the activation of the last CONV block is passed simultaneously to global average pooling (GAP), and global max-pooling (GMP) layers [77], which extract the mean and largest feature from each feature map, and these features are fused using a concatenation layer. Both GAP and GMP help to reduce the number of learnable parameters and extract discriminative features from the activation. Finally, a softmax layer is introduced as a classification layer, and the output of the concatenation layer is passed to this layer, as shown in Figure 1c. A dropout layer is also added after the last CONV layer to overcome the

1

overfitting problem.

#### **Algorithm 1.** To design the main DeepPCANet Architecture.

**Input**: Representative fundus images: *RI<sup>j</sup>* , *j* = 1, 2, . . . , *K* of size *W* × *H* and the class labels *c* = 1, 2, . . . , *C*; Energy threshold ε

**Output**: The main architecture of DeepPCANet Architecture

#### **Processing**

**Step 1:** Initialize DeepPCANet with an input layer and set *w* = 7, *h* = 7, *d* = 3, *m* = 0 (*number of layers*)

**Step 2:** Set *a<sup>j</sup>* = *RI<sup>j</sup>* , *j* = 1, 2, 3, . . . , *K*, and *TRP* (*previous TR*) = 0.

**Step 3:** Divide *a<sup>j</sup>* , *j* = 1, 2, 3, . . . , *K*, into blocks *bij*, *i* = 1, 2, 3, . . . , *B*, *j* = 1, 2, 3, . . . , *K*, of size *w* × *h* × *d*, where *d* is the number of channels (feature maps) in *a<sup>j</sup>* and *B* is the number of blocks created from each *a<sup>j</sup>* .

**Step 4:** Flatten *bij* into vectors *x<sup>i</sup>* ∈ *R <sup>D</sup>*, *<sup>i</sup>* <sup>=</sup> 1, 2, . . . *<sup>M</sup>*, where *<sup>M</sup>* <sup>=</sup> *<sup>K</sup>* <sup>×</sup> *<sup>B</sup>*, and *<sup>D</sup>* <sup>=</sup> *<sup>w</sup>* <sup>×</sup> *<sup>h</sup>* <sup>×</sup> *<sup>d</sup>*. **Step 5:** Compute zero-center vectors *φ<sup>i</sup>* , *<sup>i</sup>* <sup>=</sup> 1, 2, . . . *<sup>M</sup>* such that *<sup>φ</sup><sup>i</sup>* <sup>=</sup> *<sup>x</sup><sup>i</sup>* <sup>−</sup> *<sup>x</sup>*, where *<sup>x</sup>* <sup>=</sup> <sup>1</sup> *<sup>M</sup>* ∑ *M*

*l*=1 *xi* **Step 6:** Compute the covariance matrix *C* = *AA<sup>T</sup>* , where *A* = [*φ*<sup>1</sup> *φ*<sup>2</sup> . . . *φM*]. Calculate the

.

eigenvalues *λ<sup>j</sup>* and eigenvectors *u<sup>j</sup>* (*j* = 1, 2, . . . , *D*) of the covariance matrix *C*.

**Step 7:** Select *L* eigenvectors *u<sup>i</sup> i* = 1, 2, . . . , *L* (*L* < *D*) corresponding to the *L* largest eigenvalues such that <sup>∑</sup> *L <sup>l</sup>*=<sup>1</sup> *λ<sup>l</sup>* ∑ *D <sup>j</sup>*=<sup>1</sup> *λ<sup>j</sup>* ≥ *ε*, where *ε* determines the level of energy to be preserved (e.g., *ε* = 0.99, for 99% energy preservation).

**Step 8:** The eigenvectors corresponding to the <sup>∑</sup> *L <sup>l</sup>*=<sup>1</sup> *λ<sup>l</sup>* ∑ *D <sup>j</sup>*=<sup>1</sup> *λ<sup>j</sup>* < *ε* are summed up to form a single eigenvector, and then stacked at the end of the *L* eigenvectors.

**Step 9:** Reshape *u<sup>i</sup>* , *i* = 1, 2, . . . , *L* + 1 to kernels of size *W* × *H* × *D* and add the CONV block to DeepPCANet; Update *m* = *m* + 1.

**Step 10:** If *m* = 1 or 2, add max pool layers with a pooling window of size 2 × 2 and stride 2 to DeepPCANet.

**Step 11:** Compute the activations *a<sup>j</sup>* , *j* = 1, 2, 3, . . . , *K* of representative fundus images *RI<sup>j</sup>* , *j* = 1, 2, 3, . . . , *K* such that *a<sup>j</sup>* = DeepPCANet (*RI<sup>j</sup>* ).

**Step 12:** Compute the trace ratio between scatter between matrix (*S<sup>b</sup>* ) and within matrix (*Sw*) as

*TR* = *Trace*(*Sb*) *Trace*(*Sw*) where *S<sup>w</sup>* = ∑ *C <sup>i</sup>*=<sup>1</sup> ∑ *ni j*=1 *x<sup>j</sup>* − *µ<sup>i</sup> x<sup>j</sup>* <sup>−</sup> *<sup>µ</sup><sup>i</sup> T* and *S<sup>b</sup>* = ∑ *C i*=1 *n<sup>i</sup>* (*µ<sup>i</sup>* − *µ*) (*µ<sup>i</sup>* − *µ*) *t* . **Step 13:** If *TRP*(*previous TR*) ≤ *TR*,*set TRP* = *TR*, *W* = 3, *H* = 3, *D* = *L*, and go to Step 3, stop otherwise.

#### 4.2.4. Addition of Global Pool and Softmax Layers

The dimension of the activation of the last CONV block is *W* × *H* × *L*. If it is flattened and passed to a fully connected (FC) layer, the number of learnable weights and biases of the FC becomes excessively large, which leads to overfitting. To overcome this issue, the activation of the last CONV block is passed simultaneously to global average pooling (GAP), and global max-pooling (GMP) layers [77], which extract the mean and largest feature from each feature map, and these features are fused using a concatenation layer. Both GAP and GMP help to reduce the number of learnable parameters and extract discriminative features from the activation. Finally, a softmax layer is introduced as a classification layer, and the output of the concatenation layer is passed to this layer, as shown in Figure 1c. A dropout layer is also added after the last CONV layer to overcome the overfitting problem.

#### 4.2.5. Finetuning the DeepPCANet Model

After determining the architecture of AutoML's custom-designed DeepPCANet, it is fine-tuned using the training and validation sets. Fine-tuning involves various hyperparameters: the optimization algorithm, learning rate, batch size, activation function, and dropout probability. We employed the Optuna optimization algorithm [78] to determine the best values of the hyper-parameters. We tested three optimizers (Adam, SGD, and RMSprop), a learning rate between 1e-5 and 1e-1, four batch sizes (5, 10, 15, 20), three activation functions (ReLU, LReLU, and Sigmoid), and dropout probability between 0.25 and 0.50. After training for ten epochs, the Optuna returned the best hyper-parameters for each dataset, as shown in Table 2. The number of kernels in each layer of each model is based on an energy threshold of 0.99. The models for APTOS2019, KSU-DR, and EyePACS

datasets (five classes each) are DeepPCANet-4, DeepPCANet-5, and DeepPCANet-16, respectively, and their specifications are shown in Figure 4. Each dataset has different AutoML architecture because each one is from different ethnicities; the EyePACS dataset is from the USA, APTOS2019 is from India, and KSU-DR is from KSA, as well as the use of retinal images captured using different cameras. To confirm the distinct architectures for the three DR datasets, we combined the extracted K-medoids fundus images into a single dataset, generated a custom DeepPCANet, and tested it on the three datasets. As illustrated in Table 3, the outcome is not as good as that obtained using the customized DeepPCANet for each DR dataset, as illustrated in Tables 4 and 5. Each dataset is from different ethnicities; the EyePACS dataset is from the USA, APTOS2019 is from India, and KSU-DR is from KSA; as well as the use of different retinal images captured using different cameras, so each dataset has a different custom-designed model. After fixing the hyper-parameters, each model is fine-tuned using training and validation sets for 100 epochs. The fine-tuned model is tested using the testing set. rameters: the optimization algorithm, learning rate, batch size, activation function, and dropout probability. We employed the Optuna optimization algorithm [78] to determine the best values of the hyper-parameters. We tested three optimizers (Adam, SGD, and RMSprop), a learning rate between 1e-5 and 1e-1, four batch sizes (5, 10, 15, 20), three activation functions (ReLU, LReLU, and Sigmoid), and dropout probability between 0.25 and 0.50. After training for ten epochs, the Optuna returned the best hyper-parameters for each dataset, as shown in Table 2. The number of kernels in each layer of each model is based on an energy threshold of 0.99. The models for APTOS2019, KSU-DR, and EyePACS datasets (five classes each) are DeepPCANet-4, DeepPCANet-5, and DeepPCANet-16, respectively, and their specifications are shown in Figure 4. Each dataset has different AutoML architecture because each one is from different ethnicities; the EyePACS dataset is from the USA, APTOS2019 is from India, and KSU-DR is from KSA, as well as the use of retinal images captured using different cameras. To confirm the distinct architectures for the three DR datasets, we combined the extracted K-medoids fundus images into a single

After determining the architecture of AutoML's custom-designed DeepPCANet, it is fine-tuned using the training and validation sets. Fine-tuning involves various hyper-pa-

**Table 2.** The best hyper parameters found using Optuna algorithm (SC1). trated in Table 3, the outcome is not as good as that obtained using the customized

*Mathematics* **2023**, *11*, x FOR PEER REVIEW 10 of 20

4.2.5. Finetuning the DeepPCANet Model


dataset, generated a custom DeepPCANet, and tested it on the three datasets. As illus-

**Figure 4.** DeepPCANet architecture for (**a**) APOTS2019, (**b**) KSU-DR, and (**c**) EyePACS datasets. **Figure 4.** DeepPCANet architecture for (**a**) APOTS2019, (**b**) KSU-DR, and (**c**) EyePACS datasets.

**Table 2.** The best hyper parameters found using Optuna algorithm (SC1). **Table 3.** Customize the DeepPCANet by combining the extracted K-medoids fundus images to confirm the distinct architectures for the three DR datasets.


**Dataset Model #FLOPs # Parameters ACC % SE % SP % Kappa %** ResNet152 5.6 M 60.19 M 95.25 88.22 96.97 88.15 DenseNet121 1.44 M 7.98 M 96.58 91.55 97.82 89.22 ResNeSt50 5.39 M 27.5 M 97.11 92.29 98.2 90.82 **APTOS2019 DeepPCANet-4 1.36 M 63.7 K 98.21 95.29 98.9 94.32** ResNet152 5.6 M 60.19 M 92.25 80.74 94.9 75.16 DenseNet121 1.44 M 7.98 M 91.14 80.07 95 74.84 ResNeSt50 5.39 M 27.5 M 93.12 82.33 95.21 78 **EyePACS DeepPCANet-16 2.11 M 557.68 K 94.22 86.56 96.30 81.64**

**Table 4.** Comparison between DeepPCANet models and the pretrained models for SC1 scenario, M and K stand for millions and thousands.

**Table 5.** Comparison between DeepPCANet model and the pretrained models for SC2 scenario.


### **5. Experiments and Results**

This section first describes the evaluation protocol and the experiments performed to evaluate the proposed method and then presents the results.

### *5.1. Evaluation Protocol*

We determined the architecture of the DeepPCANet for each DR dataset and finetuned it using the training set of the corresponding DR database; the detail is given in Section 3. After that, the performance of each model was evaluated using the test set of the related database. To validate the usefulness and the superiority of the model design technique, we compared custom-designed models with the widely used state-of-the-art pre-trained CNN models such as ResNet [17], DenseNet [18], and ResNeSt [39], which have shown outstanding performance for various computer vision applications. Additionally, we compared it with AutoML models (Google Cloud AutoML) and Auto-Keras. We finetuned the competing models using the same procedure employed for DeepPCANet on each dataset.

For evaluation, we adopted three scenarios SC1 [62,63], SC2 [61,62], and SC3, as described in Section 5.1. We evaluated the AutoML custom-designed models using SC1 and SC2 on APTOS2019 and EyePACS and SC3 in EyePACS. However, the evaluation of the KSU-DR dataset was performed using SC2 because the number of images for five classes is not enough. In addition, we used four commonly used metrics in medical application and deep learning models: accuracy (ACC), sensitivity (SE), specificity (SP), and Kappa [13,79–82].

#### 5.1.1. Five Class Problem (SC1)

Using the APTOS2019 and EyePACS datasets, we built DeepPCANet-4 and DeepPCANet-16 models, respectively, for SC1 using the respective training sets and finedtuned them using the corresponding training and validation sets (see detail in Section 3). After fine-tuning, the models were evaluated on test datasets of EyePACS and APTOS2019; the results are shown in Table 4. The results of the ResNet152, DenseNet121, and ResNeSt50 models, fine-tuned using the same training set and evaluated using the same testing set as for from EyePACS and APTOS2019, are also shown in Table 4. The results show that DeepPCANet-4 and DeepPCANet-16 outperform ResNet152, DenseNet121, and ResNeSt50 on both datasets in terms of all metrics; in particular, in both cases, the sensitivity and Cohen's Kappa are higher than those of ResNet152, DenseNet12, and ResNeSt50, Cohen's Kappa is considered a more robust statistical measure than accuracy [83,84]. The DeepPCANet-4 has the lowest number of FLOPs (1.36 M) and learnable parameters (63.7 K) among all competing models, as shown in Table 4.

DeepPCANet-16 has fewer learnable parameters than the pertained ResNet152, DenseNet121, and ResNeSt50 and has fewer FLOPs than ResNet152 and ResNeSt50 models, but slightly greater than DenseNet121. In contrast, it has the best performance in terms of metrics on the EyePACS dataset. ResNeSt50 has better performance than ResNet152 and DenseNet121. To compare the AutoML DeepPCANet to the state-of-the-art AutoML methods, we use the most DR-intensive dataset available, the EyePACS, based on scenario SC1. According to the NAS method [24], we test two AutoML methods; the Google Cloud (vision) AutoML [43] and Auto-Keras [23]. We set up and generated the AutoML using Auto-Keras methods locally using the same device and based on the representative set, fine-tuned the generated CNN model using a training and validation set; then, it was evaluated using the test set as with DeepPCANet-16. For Google Cloud AutoML, we upload the representative, training, validation, and test sets to the Google cloud storage and follow the same evaluation procedure. DeepPCANet-16 outperformed the Google Cloud AutoML, and Auto-Keras has fewer number FLOPs, as shown in Table 6, but its performance is lower than both models. The FLOPs and number of parameters of Google Cloud AutoML are hidden, showing only the precision (PR) and recall (SE) metrics. NAS algorithms are time-consuming and resource-intensive; they typically look for the cell structure, including the topology of the connections and the operation (transformation) that connects each cell. After that, the resulting cell is replicated to construct the neural network [85]. We used a basic simple cell structure throughout our AutoML DeepPCANet (LReLU, batch normalization layer, and CONV layer). The filters in the CONV layers are derived automatically from fundus' lesions and require less time. They optimized both the search architecture and hyper-parameters in NAS algorithms. In contrast, we first derived the optimal DeepPCANet architecture and then used Optuna to optimize the hyper-parameters, as shown in Table 2.

**Table 6.** Comparison between DeepPCANeT-16 and AutoML methods.


#### 5.1.2. Two Class Problem (SC2)

We validated the DeepPCANet models' performance using the three datasets for SC2. The custom-designed models DeepPCANet-5, DeepPCANet-4, and DeepPCANet-16

for KSU-DR, APTOS2019, and EyePACS, respectively, which were designed and finetuned using only fundus images, outperform the highly complex CNN models such as ResnNet152, DenseNet121, and ResNeSt50, which were trained using ImageNet dataset and fine-tuned using fundus images, in terms of all metrics, as shown in Table 5. Though DenseNet121 outperforms ResNet152 and ResNeSt50 on the three datasets, its performance is not better than the custom-designed models. DeepPCANet-5 involves 1.375 M FLOPs, which is smaller than the number of FLOPs of ResNet152, DenseNet121, and ResNeSt50. The number of learnable parameters of DeepPCANet-5 is 73.66K which is much smaller than those of the pre-trained models ResNet152 (60.19 M), DenseNet121 (7.98 M), and ResNeSt50 (27.5 M). In Figure 5, we provide illustrations of the ROC curves on the three datasets using the four models (customized DeepPCANet, ResNet152, DenseNet121, and ResNeSt50). It indicates that the DeepPCANet models' performance is better than the three pre-trained models on the three datasets. for KSU-DR, APTOS2019, and EyePACS, respectively, which were designed and finetuned using only fundus images, outperform the highly complex CNN models such as ResnNet152, DenseNet121, and ResNeSt50, which were trained using ImageNet dataset and fine-tuned using fundus images, in terms of all metrics, as shown in Table 5. Though DenseNet121 outperforms ResNet152 and ResNeSt50 on the three datasets, its performance is not better than the custom-designed models. DeepPCANet-5 involves 1.375 M FLOPs, which is smaller than the number of FLOPs of ResNet152, DenseNet121, and Res-NeSt50. The number of learnable parameters of DeepPCANet-5 is 73.66K which is much smaller than those of the pre-trained models ResNet152 (60.19 M), DenseNet121 (7.98 M), and ResNeSt50 (27.5 M). In Figure 5, we provide illustrations of the ROC curves on the three datasets using the four models (customized DeepPCANet, ResNet152, Dense-Net121, and ResNeSt50). It indicates that the DeepPCANet models' performance is better than the three pre-trained models on the three datasets.

**<sup>16</sup>**2.11 M **557.68 K 94.44 94.28 96.12** 

We validated the DeepPCANet models' performance using the three datasets for SC2. The custom-designed models DeepPCANet-5, DeepPCANet-4, and DeepPCANet-16

*Mathematics* **2023**, *11*, x FOR PEER REVIEW 13 of 20

**DeepPCANet-**

5.1.2. Two Class Problem (SC2)

**Figure 5.** ROC curve for custom-designed DeepPCANet and the pretrained models for SC2 scenario and datasets: (**a**) KSU-DR, (**b**) APTOS-2019, (**c**) EyePACS. **Figure 5.** ROC curve for custom-designed DeepPCANet and the pretrained models for SC2 scenario and datasets: (**a**) KSU-DR, (**b**) APTOS-2019, (**c**) EyePACS.

#### *5.2. Visualization 5.2. Visualization*

To understand the decision-making mechanism of the custom-designed CNN models, we created the visual feature maps using the gradient-weighted class activation mapping (GradCam) visualization method [86]. The visual feature maps of two random fundus images generated by the DeepPCANet-5 model customized for the local KSU-DR dataset are shown in Figure 6d,h. The same fundus images were given blindly to two expert ophthalmologists at King Khalid Hospital of KSU, and they independently specified the lesion regions manually. Though there is a slight difference in the annotations of both experts, they agreed on most of the lesions, as shown in Figure 6b,c for the fundus image Figure 6a from class moderate and Figure 6f,g for the fundus image Figure 6e from class PDR. The visual features maps of the DeepPCANet-5 model highlight the lesions annotated by both experts, as shown in Figure 6d,h. The yellow and orange splatter in Figure 6d,h indicates that the DeepPCANet-5 model makes decisions based on the features learned from the lesion regions. To understand the decision-making mechanism of the custom-designed CNN models, we created the visual feature maps using the gradient-weighted class activation mapping (GradCam) visualization method [86]. The visual feature maps of two random fundus images generated by the DeepPCANet-5 model customized for the local KSU-DR dataset are shown in Figure 6d,h. The same fundus images were given blindly to two expert ophthalmologists at King Khalid Hospital of KSU, and they independently specified the lesion regions manually. Though there is a slight difference in the annotations of both experts, they agreed on most of the lesions, as shown in Figure 6b,c for the fundus image Figure 6a from class moderate and Figure 6f,g for the fundus image Figure 6e from class PDR. The visual features maps of the DeepPCANet-5 model highlight the lesions annotated by both experts, as shown in Figure 6d,h. The yellow and orange splatter in Figure 6d,h indicates that the DeepPCANet-5 model makes decisions based on the features learned from the lesion regions.

**Figure 6.** Visualization of the decision-making mechanism of DeepPCANet-5 model. (**a**) Fundus image from class moderate, (**b**,**c**) lesions specified by experts 1 and 2, respectively, (**d**) DeepPCANet-5 map (**e**) fundus image from class PDR, (**f**,**g**) lesions specified by experts 1 and 2, respectively, (**h**) DeepPCANet-5 map. **Figure 6.** Visualization of the decision-making mechanism of DeepPCANet-5 model. (**a**) Fundus image from class moderate, (**b**,**c**) lesions specified by experts 1 and 2, respectively, (**d**) DeepPCANet-5 map (**e**) fundus image from class PDR, (**f**,**g**) lesions specified by experts 1 and 2, respectively, (**h**) DeepPCANet-5 map.

#### **6. Discussions 6. Discussions**

This study proposed a technique to auto-custom-design a DeepPCANet model for a target DR dataset. The depth of the model and the width of each layer is not specified randomly or by exhaustive experiments. The custom-designed DeepPCANet models for DR screening have small depths and varying widths of CONV layers and involve a small number of learnable parameters. The results of the AutoML DeepPCANet models customized for the KSU-DR, APTOS2019, and EyePACS datasets (presented in Tables 4–6) demonstrate that it outperforms the well-known highly complex pre-trained models Res-Net152, DenseNet121, and ResNeSt50, as well as AutoML from Google and Auto-Keras that was fine-tuned using the same DR datasets. Generally, the DeepPCANet got competitive performance with a small number of layers and parameters. As shown in Table 4, the custom-designed DeepPCANet models for the three datasets have a small number of parameters in thousands against that number in millions of ResNet152, DenseNet121, and ResNeSt50. DeepPCANet-4 and DeepPCANet-5 have fewer FLOPs than all pre-trained models and have better performance. The DeepPCANet-16 has fewer FLOPs than that of ResNet152 and ResNeSt50 and also has better performance. Though DenseNet121 has fewer FLOPs than DeepPCANet-16, it has the least performance and a large number of parameters. The reason for the lightweight structures and superior performance of custom-designed DeepPCANet models is that their architectures have been directly drawn from the fundus images, unlike the state-of-the-art CNN models, which are mainly designed for object detection. In addition to comparing the custom-designed DeepPCANet models with famous pre-trained models, it is essential to validate their effectiveness in DR screening by comparing them to the state-of-the-art methods on two challenging datasets (APOTS2019 and EyePACS). DeepPCANet-4 generated for SC1 on the APTOS2019 dataset outperforms the state-of-the-art methods on the same dataset in terms of accuracy, sensitivity, specificity, and Kappa, as shown in Table 7. The DeepPCANet-4 based on the five-class problem (SC1) and APTOS2019 dataset outperforms the method presented in Sikder et al. (2021), which used handcrafted features and needs a long processing time for fundus image preprocessing and extracting features. Though the method by Tymchenko This study proposed a technique to auto-custom-design a DeepPCANet model for a target DR dataset. The depth of the model and the width of each layer is not specified randomly or by exhaustive experiments. The custom-designed DeepPCANet models for DR screening have small depths and varying widths of CONV layers and involve a small number of learnable parameters. The results of the AutoML DeepPCANet models customized for the KSU-DR, APTOS2019, and EyePACS datasets (presented in Tables 4–6) demonstrate that it outperforms the well-known highly complex pre-trained models ResNet152, DenseNet121, and ResNeSt50, as well as AutoML from Google and Auto-Keras that was fine-tuned using the same DR datasets. Generally, the DeepPCANet got competitive performance with a small number of layers and parameters. As shown in Table 4, the custom-designed DeepPCANet models for the three datasets have a small number of parameters in thousands against that number in millions of ResNet152, DenseNet121, and ResNeSt50. DeepPCANet-4 and DeepPCANet-5 have fewer FLOPs than all pre-trained models and have better performance. The DeepPCANet-16 has fewer FLOPs than that of ResNet152 and ResNeSt50 and also has better performance. Though DenseNet121 has fewer FLOPs than DeepPCANet-16, it has the least performance and a large number of parameters. The reason for the lightweight structures and superior performance of customdesigned DeepPCANet models is that their architectures have been directly drawn from the fundus images, unlike the state-of-the-art CNN models, which are mainly designed for object detection. In addition to comparing the custom-designed DeepPCANet models with famous pre-trained models, it is essential to validate their effectiveness in DR screening by comparing them to the state-of-the-art methods on two challenging datasets (APOTS2019 and EyePACS). DeepPCANet-4 generated for SC1 on the APTOS2019 dataset outperforms the state-of-the-art methods on the same dataset in terms of accuracy, sensitivity, specificity, and Kappa, as shown in Table 7. The DeepPCANet-4 based on the five-class problem (SC1) and APTOS2019 dataset outperforms the method presented in Sikder et al. (2021), which used handcrafted features and needs a long processing time for fundus image preprocessing and extracting features. Though the method by Tymchenko et al., 2020 [64] outperforms the DeepPCANet-4 in the Kappa score for the five-class problem (SC1), it has less accuracy, sensitivity, and specificity, and it is based on a highly complex ensemble of 20 CNN models.

For the same scenario, the DeepPCANet-16 designed for EyePACS outperforms the existing methods in accuracy and specificity. The method by Islam et al., 2018 [61] obtained higher sensitivity, but their model is more complex, and it was tested on 4% of the dataset, as shown in Table 7. For the SC2 (normal vs. DR levels), DeepPCANet-4 outperforms the method by Tymchenko et al. [64] in all metrics on APTOS2019. In this scenario, on the EyePACS dataset, as shown in Table 7, the DeepPCANet-16 is better than other methods in accuracy, sensitivity, and specificity; the method by Islam et al., 2018 [61] is slightly better than DeepPCANet-16 in sensitivity, but it was tested only on 4% of the EyePACS dataset. The method of Chetoui et al., 2020 [87] is better than DeepPCANet-16, whereas they used transfer learning based on Inception-Resnet-v2, which has high complexity and a large number of parameters. It consists of five convolutional layers, each followed by batch normalization, two pooling layers, forty-three inception modules, three residual connections, the pooling of global averages, and the use of two fully connected layers in conjunction with the rectified linear unit (ReLU); whereas, DeepPCANet-16 is a 16-layer structure that employs the basic CONV setup. The DeepPCANet-16, based on the EyePACS dataset for the SC3 (0 and 1 vs. DR levels), obtained less accuracy than Colas et al. [67] and Islam et al. [61] but obtained higher sensitivity and specificity, which are more important and robust than accuracy in the medical applications [88].


**Table 7.** Comparison between DeepPCANet models and state-of-the-art methods.

#### **Table 7.** *Cont.*


#### **7. Conclusions**

We introduced an approach to building an AutoML data-dependent CNN model (DeepPCANet) customized for DR screening automatically. This approach tackles the limitations of the available annotated DR datasets and the problem of a vast search space and a huge number of parameters in a deep CNN model. It bauto-lightweightghtweight CNN model customized for a target DR dataset using k-medoid clustering, principal component analysis (PCA), and inter-class and intra-class variations. The DeepPCANet model is data-dependent, and each DR dataset has its appropriate AutoML architecture. The customized models, DeepPCANet-5 for the local KSU-DR dataset, DeepPCANet-4 for APTOS2019, and DeepPCANet-16 for the EyePACS dataset outperform the pre-trained very deep and highly complex ResNet152, DenseNet121, and ResNeSt50 models fine-tuned using the same datasets and procedure. The performance, complexity, and number of parameters of the customized DeepPCANet models are significantly less than ResNet152 and ResNeSt50. Though DenseNet121 has fewer FLOPs than DeepPCANet-16, it has the least performance and a large number of parameters. On the EyePACS dataset, compared to the Google Cloud AutoML and Auto-Keras, DeepPCANet-16 based on SC1 obtained better performance with fewer parameters. Using the EyePACS dataset, DeepPCANet-16 also compared to the state-of-the-art methods (for SC2 and SC3), the DeepPCANet-16 has less complexity and parameters and has competitive performance. The DeepPCANet fails to predict DR grade from fundus images, which have poor quality. It could not correctly grade some poor quality fundus images from the EyePACS dataset; each image in this dataset was graded by only one expert from the geographic region of California, which can potentially lead to annotation bias. How the DeepPCANet can reliably predict the DR

grade from poor quality fundus images is a subject of future work. Additionally, how the DeepPCANet can be generalized with different fundus datasets is a subject of future work.

**Author Contributions:** Conceptualization, F.S. and M.H.; data curation, F.S., F.A.A. and A.M.A.O.; formal analysis, F.S., F.A.A. and A.M.A.O.; funding acquisition, methodology, F.S. and M.H.; project administration, M.H. and H.A.A.; resources, M.H. and H.A.A.; software, F.S.; supervision, M.H. and H.A.A.; validation, F.S.; visualization, F.S.; writing original draft, F.S.; writing—review and editing, M.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors extend their appreciation to the Deputyship for Research and Innovation, Ministry of Education in Saudi Arabia for funding this research work through the project no. (IFKSURG-2-108).

**Data Availability Statement:** Public domain datasets were used for experiments.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

### MDPI

St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Mathematics* Editorial Office E-mail: mathematics@mdpi.com www.mdpi.com/journal/mathematics

Academic Open Access Publishing

www.mdpi.com ISBN 978-3-0365-8023-4