A Wireless Sensor System for Diabetic Retinopathy Grading Using MobileViT-Plus and ResNet-Based Hybrid Deep Learning Framework

Wan, Zhijiang; Wan, Jiachen; Cheng, Wangxinjun; Yu, Junqi; Yan, Yiqun; Tan, Hai; Wu, Jianhua

doi:10.3390/app13116569

Open AccessArticle

A Wireless Sensor System for Diabetic Retinopathy Grading Using MobileViT-Plus and ResNet-Based Hybrid Deep Learning Framework

by

Zhijiang Wan

^1,2,†,

Jiachen Wan

^1,†,

Wangxinjun Cheng

³,

Junqi Yu

¹,

Yiqun Yan

¹,

Hai Tan

^4,* and

Jianhua Wu

^1,*

¹

School of Information Engineering, Nanchang University, Nanchang 330031, China

²

Industrial Institute of Artificial Intelligence, Nanchang University, Nanchang 330031, China

³

Queen Mary College, Nanchang University, Nanchang 330031, China

⁴

School of Computer Science, Nanjing Audit University, Nanjing 211815, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2023, 13(11), 6569; https://doi.org/10.3390/app13116569

Submission received: 19 April 2023 / Revised: 24 May 2023 / Accepted: 26 May 2023 / Published: 29 May 2023

(This article belongs to the Special Issue Recent Advances in Wireless Sensor Networks and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Traditional fundus image-based diabetic retinopathy (DR) grading depends on the examiner’s experience, requiring manual annotations on the fundus image and also being time-consuming. Wireless sensor networks (WSNs) combined with artificial intelligence (AI) technology can provide automatic decision-making for DR grading application. However, the diagnostic accuracy of the AI model is one of challenges that limited the effectiveness of the WSNs-aided DR grading application. Regarding this issue, we propose a WSN architecture and a parallel deep learning framework (HybridLG) for actualizing automatic DR grading and achieving a fundus image-based deep learning model with superior classification performance, respectively. In particular, the framework constructs a convolutional neural network (CNN) backbone and a Transformer backbone in a parallel manner. A novel lightweight deep learning model named MobileViT-Plus is proposed to implement the Transformer backbone of the HybridLG, and a model training strategy inspired by an ensemble learning strategy is designed to improve the model generalization ability. Experimental results demonstrate the state-of-the-art performance of the proposed HybridLG framework, obtaining excellent performance in grading diabetic retinopathy with strong generalization performance. Our work is significant for guiding the studies of WSNs-aided DR grading and providing evidence for supporting the efficacy of the AI technology in DR grading applications.

Keywords:

wireless sensor networks; diabetic retinopathy; deep learning; HybridLG framework; MobileViT-Plus

1. Introduction

Wireless sensor networks (WSNs) are a type of distributed network that use spatially distributed autonomous sensors to monitor physical or environmental conditions [1], such as temperature, pressure, and humidity, and transmit the data wirelessly to a central location [2]. WSNs are widely used in many areas, including environmental monitoring, industrial automation, and military surveillance [3]. Meanwhile, it is increasingly being combined with artificial intelligence (AI) to provide automatic intelligent decision-making for various medical applications. For example, in the medical field of ophthalmic disease diagnosis and treatment, Delia et al. [4] claimed that the cost-effective applications by the remote assessment of the diabetic eye using portable retinal cameras can be an useful tool to support the proactive prevention of diabetes. Das et al. [5] proposed a scalable cloud-based teleophthalmology architecture using the Internet of Medical Things (IoMT) for diagnosing age-related macular degeneration. The studies mentioned above demonstrate the potential of using wearable sensors and wireless networks to support the clinical diagnosis and treatment of ophthalmic disease. Diabetic retinopathy (DR), the most common and serious diabetes complication [6], is the first blindness-causing eye disease in working-age people [7]. About 30% of people with diabetes suffer from DR [7]. By 2030, 191.0 million people will have DR in the world, leading to an exponential growth in the prevalence of the disease [8]. However, due to the lack of qualified ophthalmologists in many regions, some patients cannot receive timely diagnosis and treatment for their condition, which eventually leads to blindness [9]. The WSNs-aided application for DR diagnosis and treatment provide a promising way to utilize the efficacy of medical resources.

DR is mainly caused by high blood sugar and damage to the retinal blood vessels [10]. Fundoscopic examination reveals local pathological changes such as micro angiomas, haemorrhages, exudates, and cotton wool spots, and overall pathological changes such as vascular abnormalities, retinal oedema and atrophy, and neovascularisation [11,12]. DR is categorized into five stages by the degree of its pathology [13]: no DR, mild non-proliferative diabetic retinopathy (NPDR), moderate NPDR, severe NPDR, and proliferative diabetic retinopathy (PDR). Figure 1 shows the subfigures (a)–(e) with no DR, mild NPDR, moderate NPDR, severe NPDR, and PDR, respectively. No DR means people with diabetes do not show any signs of DR. Mild NPDR refers to an early stage of DR characterized by micro-aneurysms, small haemorrhages, or hard exudates in the retina, but without the development of new blood vessels or structural changes in the retina. Moderate NPDR is characterized by micro-aneurysms, retinal haemorrhages, cotton wool spots, tiny yellow deposits [14], venous beading, intraretinal microvascular abnormalities (IRMA), or moderately severe intra-retinal haemorrhages in at least one of the four quadrants. Severe NPDR is characterized by numerous and prominent intra-retinal haemorrhages in at least four quadrants of the retina, cotton-wool spots, venous beading, and IRMA. PDR is a more severe form of DR, characterized by the growth of abnormal blood vessels in the retina [15], which can cause vision loss and even blindness if left untreated. Early, rapid, and automatic grading of DR severity and timely treatment are vital to prevent or slow down the progression of DR [16].

Fundus photography is a common ophthalmic screening method for DR [17,18]. Direct observation of the fundus camera results is more dependent on the examiner’s experience, has problems of limited diagnostic accuracy, and is time-consuming [19]. In recent years, DR grading systems based on machine and deep learning have been widely developed and applied [20,21]. Convolutional neural networks (CNNs) are deep learning models that detect and segment objects from medical images. CNNs can process two-dimensional (2D) data such as images and has good spatial feature extraction capability [22]. CNN models such as Inception V3, Xception, and ResNet 50 can automatically diagnose and stage DR [23]. In addition, Vision Transformer (ViT) [24]-based image processing application attracts much attention due to a self-attention mechanism to concentrate on the global dependence of image patches and achieves superior classification performance.

Although the deep learning-based medical image processing method achieves superior achievements, the DR fundus images collected by WSN possess three characteristics: irregular region distribution, blurred pixels and low contrast, and image diversity, which challenge the performance of deep learning.

(1): Irregular region distribution. The regional distribution of DR is irregular, and there is some individual variability in the vascular pathways of the fundus. Fundus lesions and complete fundus information are required to rate DR [25]. Further, the deep learning model’s ability to capture global lesion locations is equally essential. CNN focuses only on the presence or absence of the target to be detected and not on the location and relative spatial relationships between components. Chen et al. argued that the local receptive wild operations, such as convolution operations, limit the acquisition of long-range pixel relationships [26]. Yang et al. concluded that the CNN algorithm ignored some valuable DR-related lesions [27]. The Transformer, a neural network architecture that relies on a self-attentional mechanism to process sequential data, has a large effective acceptance area and can learn global representation [28,29]. However, since it requires a large amount of training data, the time required for training is relatively long.
(2): Blurred pixels and low contrast. The lesion area may have blurred pixels and low contrast with the surrounding area [30,31]. Lesion pixels that are too close to pixels in other areas will show up as blurred pathological features in fundus images. In image classification tasks, “fuzzy pixels” refer to the situation where an image pixel has an unclear or ambiguous boundary, making it difficult for the model to classify the pixel’s content accurately. “Low contrast” refers to images with a narrow range of brightness or colour values, resulting in reduced visual contrast between objects and making it difficult for the model to distinguish between them. The dataset in their experiment had images with various distortion types, including low-contrast and fuzzy pixels. These could negatively impact the performance of the classification model. The contrast between light and dark is not apparent (Figure 1e), which means the image has low contrast and looks blurry.
(3): Image diversity. The data diversity of fundus images is mainly affected by the following factors: patient demographics, imaging equipment, and image acquisition settings. The age, race, gender, medical history, and other factors of patients will affect the appearance of fundus images [32]. For example, elderly patients may show more signs of age-related macular degeneration, while diabetes patients may show characteristic symptoms of diabetic retinopathy. The imaging device’s type, brand, model, and parameter settings also affect the quality and resolution of fundus images [33]. Image acquisition settings, such as lighting, contrast, and magnification, and image preprocessing steps, such as image enhancement, filtering, and normalization, also affect the appearance and characteristics of images. An unbalanced number of images under each category for a dataset, including multiple sample categories, will also affect the model training effect. In addition, there are different conditions under which fundus images are taken, as demonstrated by differences in the lighting, shooting angles, and varying degrees of noise. Additionally, the radius of the effective area of the image (circular retinal area) varies depending on the picture’s size.

To overcome the challenges mentioned above, we had better to leverage the WSN and deep learning to construct the application of DR diagnosis and treatment services, the main contributions of this study are depicted as follows:

(1): To support health caregivers in the effective management and provide a proactive prevention of diabetic retinopathy, we propose a wireless sensor system architecture to perform the task of grading the DR in the ubiquitous environment and provide clinical services for DR diagnosis and treatment. The architecture employs portable retinal cameras, a blood glucose monitor, and a tablet computer as the data collection nodes of the wireless sensor network. A database server implemented with AI algorithms is responsible for processing the data transmitted via the wireless network and making decisions for various clinical roles (e.g., clinicians, patients, and hospital staff).
(2): To solve the problems of irregular region distribution, fuzzy pixels, and low contrast, we propose a parallel deep learning framework (HybridLG) for learning both the local and global information of 2D fundus images. The framework includes a CNN backbone, a Transformer backbone, a neck network for feature fusion, and a head network. The CNN and Transformer backbones are adopted to extract local and global information, respectively. The neck network is adopted to fuse the local and global information and prepare for making final grading decisions. In addition, to deal with the problem of model performance affected by image diversity, we propose a model training strategy inspired by an ensemble learning strategy. It aims to improve the model generalization ability of the parallel framework. We use a fully connected layer to simulate this weighted voting process by adding weight parameters to each soft-voting model to identify the best result.
(3): Considering the high computation complexity of the Transformer backbone, we propose a novel deep learning model named MobileViT-Plus and use it to implement the Transformer backbone of the HybridLG. Specifically, the MobileViT-Plus model is constructed by introducing a light Transformer block (LTB) into the MobileViT model. In the original Transformer block, multi-head self-attention involves computing the pairwise similarity between all pairs of input elements, which results in a quadratic computational complexity. To mitigate the computation overhead, in LTB, we use a k × k depth-wise convolution with stride k to reduce the spatial size of some parameters before the attention operation. In addition, we use a pre-trained ResNet101 for implementing the CNN backbone of the HybridLG framework.

2. Related Work

2.1. Clinical DR Grading Application

Diagnostic tools and therapies have been developed for DR, including optical coherence tomography (OCT), fluorescein angiography (FA), and laser treatment. These techniques allow the early detection and treatment of the DR, preventing or delaying vision loss in diabetic patients. Virgili G et al. [34] proposed that the OCT produces reliable, reproducible [35], and objective retinal images, especially in diabetic macular oedema, and provides information about vitreoretinal relationships that can clearly only be detected with OCT. It enhances our ability to exactly diagnose diabetic macular oedema, epiretinal membranes, vitreomacular, and vitreoretinal traction. Rabiolo et al. [35] reviewed that FA is useful in the examination of patients suffering from DR. To expand the field of view, wide-field and ultra-wide-field imaging have been developed, allowing us to view up to the retinal surface at 200 degrees in one single shot, as visualization of the peripheral retina is fundamental in order to assess nonperfused areas, vascular leakage, microvascular abnormalities, and neovascularisations. Emily K. Deschler [36] introduced laser treatment for diabetic retinopathy as the first intraocular treatment to provide a highly effective means for preventing visual loss in patients with diabetes. Despite these advancements, DR remains a significant public health concern, particularly in countries with low- and middle-income levels. Thus, there is a continued need for research to develop cost-effective, scalable and accessible solutions for DR detection and management.

2.2. WSNs-Aided DR Grading

WSNs have been widely applied in various fields, including medical image processing, and have shown potential for improving the accuracy and efficiency of retinal image classification tasks. In particular, WSN-based systems have been utilized for the classification of diabetic retinopathy, glaucoma, and other ophthalmic diseases. Delia [4] reviewed the role of retinal imaging and mobile technologies in tele-ophthalmology applications for diabetic retinopathy screening and management. They claimed that telemedicine is a worthy tool to support health caregivers in the effective management and prevention of diabetes. Arun et al. [5] proposed a scalable cloud-based teleophthalmology architecture via the Internet of Medical Things (IoMT) for the diagnosis of AMD; patients wear a head-mounted camera to send their retinal fundus images to the secure and private cloud drive storage for personalized disease severity detection and predictive progression analysis. Saswat et al. [37] introduced a fully wearable, wireless soft electronic system that offers a portable, highly sensitive tracking of eye movements via the combination of skin-conformal sensors and a virtual reality system.

2.3. Deep Learning-Based DR Grading

Deep learning has shown promising results in the field of diabetic retinopathy grading. The use of deep learning can help improve the accuracy and efficiency of DR grading, which is critical for the early detection and timely treatment of DR. For example, Gulshan et al. [38] proposed a deep learning system for the automated grading of DR using a large dataset of retinal images. It presented that the deep learning algorithms had high sensitivity and specificity for detecting diabetic retinopathy and macular edema in retinal fundus photographs. Wang et al. [39] introduced a deep learning system for the automated grading of DR using a multi-level approach. This technology offers the potential to increase the efficiency and accessibility of DR screening programs. Wu et al. [40] proposed a Vision Transformer-based method to recognize the grade of diabetic retinopathy, highlighting that an attention mechanism based on a Vision Transformer model is promising for the diabetic retinopathy grade recognition task. Achieving interesting early results using only a few colour features, Teresa et al. [41] introduced a novel Gaussian-sampling approach based on a multiple instance learning (MIL) framework which enables ophthalmologists to assess the reliability of the system’s decision. Building upon this idea, Wang et al. [42] addressed the multiple instance classification problem in OCT images using MIL techniques and achieved significant success. This is particularly effective because the morphological changes induced by diabetic macular edema (DME) are sparsely distributed in the B-scan images of the OCT volume. Additionally, OCT data are commonly labeled at a coarse level, making MIL an appropriate approach. Eugenio et al. [43] presented preliminary numerical results obtained from the classification of healthy eye fundus images versus those with severe diabetic retinopathy using the MIL algorithm. Zhu et al. [44] tackled the challenge of classifying referable diabetic retinopathy (rDR) and segmenting lesions using self-supervised equivariant learning and attention-based MIL.

3. Methodology

3.1. Methodology Architecture

Figure 1 shows the wireless sensor system architecture for diabetic retinopathy prevention and treatment, aiming at and addressing the existing problems in the DR grading application, such as tedious information entry, resource consumption, and the inadequate diagnosis of retinal lesions based on fundus images. The architecture includes three parts: patient interface, service interface, and user interface. The left block represents the patient side of the system, which utilizes a portable retinal camera and blood glucose monitor to capture retinal images and measure blood glucose concentration, respectively. The tablet computer is used to send the electronic medical record (EMR) of patients to the database server by wireless network. The portal retinal camera can perform independent operations such as retinal illumination and focusing, and does not display retinal images but connects to a computer with a wireless card via a built-in WiFi wireless connection module, displaying the collected fundus image from the camera sensor in real-time on the computer screen. The middle block of Figure 2 shows the database server in our system. The database server is used for data storage, while the deep learning model equipped on the server can analyse the signals online. In addition, the uploaded EMR information is transmitted to hospital information system (HIS), supporting the studies of diabetes prevention and its complications. The right block lists several user interfaces of clinical applications of our system, including the central monitoring station, workstation for doctors and nurses, and automated screening interface for patients.

3.2. Data Preparation and Augmentation

The original data can be directly downloaded from Kaggle’s official website [45]. Each sample in the dataset was resized to 224 × 224 pixels to facilitate its use with many pre-trained deep learning models. Aravind Eye Hospital in India hopes to detect and prevent this disease among people living in rural areas where medical screening is difficult. Aravind technicians travel to these areas to capture images and then rely on highly trained doctors to review the images and provide a diagnosis. Tags have sorted the dataset. Five classes, respectively, represent no DR, mild NPDR, moderate NPDR, severe NPDR, and PDR [46]. The sample sizes are 1805, 370, 999, 193, and 295.

The number of samples in each of the five categories is highly imbalanced. With this imbalanced dataset, an optimal real-time result could not be obtained because the model never adequately examines the implicit categories. Data augmentation is framed by aligning one category to the category with most samples to balance the data among the diabetic retinopathy severity categories. We employed data augmentation techniques, specifically the rotation by multiples of 90 degrees, color jittering, and contrast adjustment. Rotating by multiples of 90 degrees was implemented to preserve the integrity of the image content. In this case, the image dimensions remained unchanged, and the rotated images did not experience stretching or compression. This approach helped maintain the consistency of the object shapes and proportions in the images, thus avoiding information loss. Color jittering and contrast adjustment were employed to address the significant color differences between the data and improve the model’s generalization, for example, in Figure 1c, where elderly patients may exhibit more signs of age-related macular degeneration. Moreover, for some data instances in the dataset that were already blurry or had low contrast, introducing additional noise was not considered beneficial as it could lead to underfitting during model training.

First, we randomly selected 200 images from the no DR samples and applied a 180-degree rotation. For the moderate NPDR samples, all images were rotated 180 degrees, and two images were randomly selected for a 90-degree rotation. The mild NPDR, severe NPDR, and PDR samples were subjected to 90-degree, 180-degree, and 270-degree rotations. Next, we randomly selected 260 images from the mild non-proliferative diabetic retinopathy (NPDR) class, applied a 5% colour dither, and performed a 1.3 contrast adjustment once. For the severe NPDR class, we randomly selected 614 images and performed 5% color dithering once. In the case of proliferative diabetic retinopathy (PDR), we randomly selected 410 images and applied both a 5% color dither and a 1.3 contrast adjustment once. Generally speaking, we augmented the number of samples for each category to 2000, resulting in 10,000 images for the five categories.

3.3. HybridLG Framework Construction

3.3.1. Framework Structure

Figure 3 shows the structure of the HybridLG framework. The simplest approach to simultaneously extract local and global information from retinal fundus images is processing the information separately through two networks, followed by a fusion head to combine the outputs. Thus, we propose the parallel deep learning framework HybridLG, mainly containing four components: a CNN backbone for extracting local information, a Transformer backbone for extracting global information, a feature fusion neck for integrating both local and global information of the DR in the 2D fundus image, and a head for outputting the result. The meaning of the modules will be explained in the following sections.

3.3.2. ResNet Backbone for Learning Local Information

The general structure of ResNet101 is shown in the ResNet [47] backbone of Figure 3. At the network’s beginning, a convolutional layer with a kernel size of 7 × 7 and a stride of 2 is followed by a max pooling layer with a kernel size of 3 × 3 and a stride of 2. Further, the network contains multiple residual blocks consisting of three convolutional layers with different filters. The residual blocks are sequentially connected, with a shortcut connection added to each block to bypass one or more layers. These shortcut connections can identity functions or 1 × 1 convolutional layers depending on the input and output dimensions.

The residual block module includes a set of three blocks with 64 filters each, followed by a set of four blocks with 128 filters each. Further, there is a set of 23 blocks with 256 filters each and two sets of residual blocks with 512 filters each. Each set’s first and last residual blocks have a stride of 2 and use projection shortcuts to adjust the input and output dimensions. The residual blocks in ResNet101 are further categorized into four layers in the ResNet backbone of Figure 2 based on their filter numbers and stride values.

ResNet is a classical model of CNN, and CNNs are designed to be highly effective at learning spatial features in images by utilizing local receptive fields. The local receptive fields allow CNNs to extract spatial features that are increasingly complex and abstract by aggregating the information from the local regions. Additionally, the CNNs’ pooling layers further improve the ability to capture local information by reducing the dimensionality of the feature maps and, thus, enhancing the translation invariance of the model. Thus, CNNs are well-suited for grading DR, where local spatial features are crucial for achieving high accuracy.

3.3.3. MobileViT-Plus Backbone for Learning Global Information

Although the current dataset consists of 10,000 images, which is not considered large in number, it is expected to become more abundant and of higher quality as the user base grows. As a result, the real-time updating of the database and models will also require careful consideration of computational resources as a cost factor. In comparison to the classic Transformer (e.g., ViT), MobileViT-Plus offers not only a faster training speed but also a quicker inference speed. Furthermore, MobileViT-Plus demonstrates better adaptability and stronger generalization performance when trained and tested on smaller-scale datasets. The small model size of MobileViT-Plus, which may limit its ability to capture highly complex and abstract image features, potentially leading to inferior performance in other complex tasks. Thus, a balance between maintaining a lightweight model and meeting task requirements is crucial in practical applications.

As shown in the MobileViT-Plus backbone of the HybridLG framework in Figure 3, the MobileViT-Plus is mainly conformed by depthwise separable convolution and MobileViT-Plus block. The depthwise separable convolutions used here are mainly for feature extraction and reducing computational costs [48]. The corresponding parameters are next to the Dwise Conv Module in Figure 3. The specific structure of MobileViT-Plus is shown in Figure 4.

The Transformer can learn global information better than CNN after being linearly transformed into a query matrix Q, a key matrix K, and a value matrix V by parameter matrices WQ, WK, and WV. According to the structure in Figure 4, the Transformer’s input is a sequence of flattened image patches fed into a Transformer encoder. The patches are arranged in a regular grid, corresponding to a fixed-sized region of the input image.

The process of multi-head self-attention often consumes many computing resources due to the large input feature size, which brings difficulties to the training and deployment of the network. In the lightweight Transformer block of Figure 4, the subsequent matching process of Q and K can be understood as calculating the correlation between the two. The bigger the correlation is, the higher the weight of V is. We use a k × k depth-wise separable convolution to down-sample K and V, respectively, and d_k represents the dimension of K, the obtained two relatively small features K′ and V′, are defined as:

K^{'} = D W C o n v (K) \in R^{\frac{n}{k^{2}} \times d_{k}},

(1)

V^{'} = D W C o n v (V) \in R^{\frac{n}{k^{2}} \times d_{v}} .

(2)

Then, a relative position bias B is added to each self-attention module. The lightweight multi-head self-attention is defined as:

L i g h t A t t n (Q, K, V) = s o f t m a x (\frac{Q {K^{'}}^{T}}{\sqrt{d_{k}}} + B) V .

(3)

We further adapted the MobileViT [49] model by incorporating a lightweight multi-head self-attention block [50]. The resulting model, MobileViT-Plus, is designed to address the limitation of the computational overhead, as shown in Figure 4.

3.3.4. Feature Fusion Network and Head Network

In addition to the training methods for sub-models described above, a feature fusion network is required to combine their outputs, and a head network is required for final classification. The neck part of Figure 3 shows the fully connecting layer for fusing local and global information. Specifically, both ResNet and MobileViT-Plus have been trained to learn the parameters of their respective networks. The outputs of these networks are probabilities obtained by applying the ResNet and MobileViT network to the input images, classifying them into five different categories. In addition, the training of the fully connected layer in the neck part will facilitate the fusion of the outputs from ResNet and MobileViT-Plus. Since each sample’s classification from local and global information may differ, finding a balanced classification boundary and learning more essential features is necessary. Thus, the weighted sum is an excellent choice. Inspired by the idea of weighted soft voting [51] from ensemble learning, we created a weighted voting mechanism for a feature fusion network. This fully connected layer, which uses identity mapping as the activation function and a bias vector to adjust the output offset of the layers, combines each sub-model with learnable parameters. The fully connected layer maps values to a vector, where each vector element represents a specific class’s score. A higher score indicates that the model is more likely to predict the corresponding class.

A softmax layer was utilized in the head part to output the probability of classification of each sample into each category (Figure 3). The category with the highest probability is the final prediction of the model. The formula for Softmax is as follows, where i represents the i-th value in the vector and j represents the other values:

S o f t m a x (z_{i}) = \frac{e^{z_{i}}}{\sum_{j} e^{z_{j}}}

(4)

During training, the model uses the backpropagation algorithm to update the weight parameters of the fully connected and Softmax layers. It minimises the loss function and improves the classification accuracy of the model.

3.4. Model Training Strategy

Before the completion of training for MobileViT-Plus and ResNet, the neck part is not involved in the training process. The results from the submodels are used for individual validation purposes. The model training strategy is divided into three steps:

(1): We first trained ResNet101 on our dataset using the default hyperparameters, and Cross-Entropy Loss was used.

C r o s s E n t r o p y L o s s = - l a b e l \times l o g (s o f t m a x (i n p u t)) .

(5)

Further, we fine-tuned the hyperparameters of ResNet by conducting a grid search over a set of hyperparameters, including the learning rate of 0.0001, 20 epochs, and batch size of 4. Finally, we selected the best hyperparameters based on the performance and used the trained ResNet101 as a sub-model in our fusion approach.

(2): After ResNet101 has been trained, we used a learning rate of 0.0001 and a batch size of 4 and 50 epochs to train MobileViT-Plus. The cross-entropy loss becomes stable after 50 epochs. This process can happen when ResNet101 is trained because of the parallel HybridLG framework. In training ResNet and MobileViT-Plus, we utilized 10-fold cross-validation to evaluate the performance of our proposed model. The performance metrics were averaged across the ten folds to provide a more reliable estimate of the model’s performance.
(3): While the sub-models are being connected, the previous parameters are frozen and only the parameters of the new fully connected layer are trained. Therefore, this process takes place after all the sub-models are trained, but due to the few parameters that can be changed, five epochs are sufficient to find the balanced classification boundary at a learning rate of 0.0001. Then, the classification result will be output by the softmax layer of the head network.

4. Performance Evaluation

4.1. Experimental Setup

The proposed models were built in a hardware environment comprising Intel (R) Core (TM) i7-10510U CPU @1.80 GHz. The workstation’s operating system was Windows 10. The integrated development environment and the deep learning symbolic library were PyCharm-Python 3.6 and PyTorch 1.10.2, respectively. Graphics processing units (GPUs) are well-suited for deep learning tasks due to their parallel computing capabilities, which allow them to perform matrix multiplications and other computationally intensive operations much faster than CPUs. Renting GPUs from public cloud providers, such as the RTX A4000 and RTX 2080 Ti, is essential in accelerating the training process for deep learning models.

4.2. Evaluation Metrics

The performance of the proposed model was evaluated by various metrics, including accuracy, precision, recall, F1-score, receiver operating characteristic (ROC) curve, and area under the curve (AUC). Accuracy is the ratio of correctly classified samples to the total number of samples. Precision measures the proportion of true positive predictions over the total number of positive predictions, while recall is the proportion of true positive predictions over the total number of actual positive samples. F₁ score [52] is the harmonic mean of precision and recall, indicating the balance between the two measures. The ROC curve [53] is a graphical representation of the true positive rate (TPR) against the false positive rate (FPR), which is defined as follows:

F P R = \frac{F P}{F P + T N},

(6)

T P R = \frac{T P}{T P + F N},

(7)

where TP refers to the number of correct positive predictions, FN refers to the number of incorrect negative predictions, TN refers to the number of correct negative predictions, and FP refers to the number of incorrect positive predictions. The AUC is the area under the ROC curve, indicating the ability of the model to distinguish between positive and negative samples.

4.3. Ablation Study

These experiments demonstrated the effectiveness of the sub-models and information fusion because our model defeated the sub-models in all evaluation metrics. Table 1 shows the results of the ablation study, sub-models, and our model on varying evaluation metrics. Our model achieves an accuracy of 93.67%, precision of 93.71%, recall of 93.67%, F₁-score of 0.9366, and AUC of 0.994. Followed by the MobileViT-Plus, it also achieves decent performance, with an accuracy of 88.00%, a precision of 87.97%, a recall of 88.00%, an F₁-score of 0.8797, and an AUC of 0.963. The performance of ResNet101 in terms of its accuracy, precision, recall, and F₁-score is slightly above 0.81, and the AUC is 0.943. The results indicate that our hybrid framework combines the strengths of both sub-models and effectively improves the network’s generalization performance. Figure 5a–c, respectively, show the training loss and test loss of each epoch and the ROC curves of the models. The curve of our model is closer to the ideal state of (0, 1), and its AUC is also the highest. MobileViT-Plus performs the second best. The test performance of ResNet101 is better than its training performance because dropout was used during training to prevent overfitting, whereas all neurons participated in the classification task during testing.

4.4. Comparison Study

Our model is compared with other state-out-of-art models commonly used for classification, including Resnext [54], a variant of ResNet, fine-grained visual categorization [55], which combines the attention mechanism, and MobileViT, used as a base model of MobileViT-Plus. The results are not as good as the model obtained by our model.

Table 2 shows the comparison results between some widely adopted models in medical images on varying evaluation metrics. Among them, Resnext101 performed the worst in all indicators, even worse than ResNet101, indicating that the improvements in Resnext may not be suitable for medical image processing. Se_resnet101, Se_resnext50, and Senet154 are CNNs with attention mechanisms added, which exhibit improvements relative to ResNet. Se_resnext50 and Senet154 also outperform MobileViT, which uses self-attention. The metrics of MobileViT and MobileViT-Plus are similar, but MobileViT-Plus is more economical regarding computational resources. Figure 6 shows the training loss and test loss of each epoch and the ROC curves of the comparison models. The disparity is still evident compared with our model’s ROC curve in Figure 5c.

5. Discussion

WSNs-aided diabetic retinopathy grading can help clinicians to grade and diagnose retinopathy in real time. Specifically, the real-time monitoring system can help clinicians to diagnose the condition and take effective treatment measures in a timely manner, thus reducing the risk of further deterioration of the condition. Moreover, the WSNs-aided retinopathy grading management method can reduce the cost of treatment for diabetic patients. Through remote monitoring, clinicians can provide diagnosis and treatment recommendations to patients remotely, eliminating the need for patients to go to hospitals for complex examinations and treatments. By monitoring and analyzing the patient’s eye condition in real time, doctors can determine the condition more precisely and adopt targeted treatment plans for different conditions. This personalized treatment approach can better meet the needs of different patients and improve treatment outcomes and patient satisfaction.

To construct superior DR grading model based on deep learning technology for the WSN system, we propose a parallel deep learning framework (HybridLG) for actualizing automatic DR grading. We compared our proposed method with recent classification techniques and well-known networks. Our final integrated model outperforms these methods. In the early days of image recognition, traditional machine learning methods were utilized due to the absence of mainstream neural networks. However, the results were unsatisfactory, and ensemble learning was utilized to enhance accuracy rates. With the advent of deep learning, image classification algorithms shifted towards this methodology, and the progress in classification accuracy can be attributed to the advancement of deep learning techniques. Significant performance improvements have been achieved from simple models to deeper ResNet, attention mechanisms, and Transformer-related models. The HybridLG framework draws inspiration from old and new techniques, attempting to extract the best of both worlds. The proposed integrated model combines the strengths of CNN and Transformers, resulting in a more informative representation of image data without compromising the respective strengths of each algorithm.

There is still scepticism surrounding proposals of frameworks in the medical field, primarily due to the perception of deep learning models as black boxes which acts as a deterrent for specialists. Medical choices impact the life of patients, and include risk and responsibility for the clinicians. In order to meet the requirement of a transparent and explainable AI, two main methods are pursued in eXplainable AI: transparent design methods and the post hoc explanation methods [56]. In the transparent design aspect, CNNs cannot use a coordinate system, which humans typically rely on to recognize shapes. Moreover, they cannot handle rotations of shapes. Conversely, the self-attention mechanism must utilize prior knowledge of an image’s scale, translation invariance, and feature localization. As a result, it can only learn these features through large amounts of data. Consequently, the self-attentive mechanism can only establish accurate global relationships based on extensive data and may not perform as effectively as CNNs for small datasets. The HybridLG framework effectively combines the strengths of both models. In the post hoc explanation aspect, we conducted ablation experiments to prove that the effectiveness of the HybridLG framework. The HybridLG is a parallel training framework inspired by ensemble learning, which provides several benefits. Firstly, from a statistical perspective, learning tasks with large hypothesis spaces are more likely to suffer from poor generalization performance. A combination strategy can mitigate this risk by combining multiple models. Secondly, from a computational perspective, combining multiple models can reduce the likelihood of encountering bad local minima. Finally, from a symbolic standpoint, some learning tasks may have yet to be considered in the algorithm’s current hypothesis space.

The proposed model is highly adaptable as different sub-models can be combined to generate hybrid models tailored to specific tasks. Moreover, ensemble learning or combination strategy can employ other models or tasks, such as Vision GNN (ViG) [57], graph neural networks [58], and RNN-CNN hybrid models [59], using alternative voting strategies or stacking methods. Consequently, combining hybrid models with ensemble learning provides ample room for further exploration. A single learner is typically inadequate to approximate such hypotheses effectively, whereas multiple learners can improve the approximation. There is still room for further improvement in our work. For instance, the utilization of Grad-CAM can be employed to examine the information extracted by individual submodels and the complete model, providing supporting evidence via transparent design methods for the relevant theories. In the future work, the system can be expanded to include other diseases or incorporate additional features. The model itself can be replaced or modified, and the model parameters can be retrained based on updated data to facilitate version updates and enhancements. Because the WSNs-aided technology involves a large amount of personal and private information (e.g., EMR data), it is important to ensure the data security and confidentiality. To achieve this, effective encryption and security measures are needed to protect the data from unauthorized access. In addition, the fundus image may become blurry due to the limitation of device quality, influencing the accuracy and precision of the data transmitted via wireless sensor networks. Professional technicians are needed to maintain and manage the network to ensure its proper operation. For the deep learning-based DR grading model, although the HybirdLG framework has effectively addressed the DR grading, some concerns still need to be considered. Firstly, when extending the framework to integrate other models, the time required for training multiple sub-models without pre-trained parameters on different GPUs is still higher than using a single model. Although the performance is improved, it may need to be more cost-effective. Secondly, despite the HybridLG framework demonstrating good generalization performance on the dataset, when applied in practice, the quality of extracted fundus images during diagnosis should be of high quality to prevent misdiagnosis and ensure accurate grading and treatment of the patients.

6. Conclusions

The WSNs-aided DR grading and advanced models and methods in computer vision may facilitate the accurate diagnosis and treatment of diabetic patients. The proposed HybridLG framework demonstrates excellent performance in grading diabetic retinopathy with strong generalization performance. Thus, the model is effective for detecting and treating retinal diseases related to glucose metabolism. Our findings indicate that the combination strategy of analysing local and global information synthetically significantly enhances the model’s generalization performance compared to a single model. Thus, this model can be utilized for fundus examination analysis. Thus, it will help to mitigate the risk of irreversible blindness.

Author Contributions

Methodology, J.W. (Jiachen Wan) and Y.Y.; Investigation, W.C.; Writing—original draft, Z.W.; Writing—review & editing, H.T. and J.W. (Jianhua Wu); Visualization, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Our university’s review board statement was waived for the dataset used was publicly available.

Informed Consent Statement

Patient consent was waived for the dataset used was publicly available.

Data Availability Statement

Publicly available dataset was used in this study, which can be found here: https://www.kaggle.com/datasets/sovitrath/diabetic-retinopathy-224x224-2019-data (accessed on 24 May 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Akyildiz, I.F.; Su, W.; Sankarasubramaniam, Y.; Cayirci, E. Wireless sensor networks: A survey. Comput. Netw. 2002, 38, 393–422. [Google Scholar] [CrossRef]
Juang, P.; Oki, H.; Wang, Y.; Martonosi, M.; Peh, L.S.; Rubenstein, D. Energy-efficient computing for wildlife tracking: Design tradeoffs and early experiences with ZebraNet. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, USA, 5–9 October 2002; pp. 96–107. [Google Scholar]
Aminian, M.; Naji, H.R. A hospital healthcare monitoring system using wireless sensor networks. J. Health Med. Inf. 2013, 4, 121. [Google Scholar] [CrossRef]
DeBuc, D.C. The role of retinal imaging and portable screening devices in tele-ophthalmology applications for diabetic retinopathy management. Curr. Diabetes Rep. 2016, 16, 132. [Google Scholar] [CrossRef]
Das, A.; Rad, P.; Choo, K.-K.R.; Nouhi, B.; Lish, J.; Martel, J. Distributed machine learning cloud teleophthalmology IoT for predicting AMD disease progression. Future Gener. Comput. Syst. 2019, 93, 486–498. [Google Scholar] [CrossRef]
Lin, K.Y.; Hsih, W.H.; Lin, Y.B.; Wen, C.Y.; Chang, T.J. Update in the epidemiology, risk factors, screening, and treatment of diabetic retinopathy. J. Diabetes Investig. 2021, 12, 1322–1325. [Google Scholar] [CrossRef] [PubMed]
Yau, J.W.; Rogers, S.L.; Kawasaki, R.; Lamoureux, E.L.; Kowalski, J.W.; Bek, T.; Chen, S.J.; Dekker, J.M.; Fletcher, A.; Grauslund, J.; et al. Global prevalence and major risk factors of diabetic retinopathy. Diabetes Care 2012, 35, 556–564. [Google Scholar] [CrossRef]
Long, S.; Huang, X.; Chen, Z.; Pardhan, S.; Zheng, D. Automatic Detection of Hard Exudates in Color Retinal Images Using Dynamic Threshold and SVM Classification: Algorithm Development and Evaluation. BioMed Res. Int. 2019, 2019, 3926930. [Google Scholar] [CrossRef] [PubMed]
Ruamviboonsuk, P.; Tiwari, R.; Sayres, R.; Nganthavee, V.; Hemarat, K.; Kongprayoon, A.; Raman, R.; Levinstein, B.; Liu, Y.; Schaekermann, M.; et al. Real-time diabetic retinopathy screening by deep learning in a multisite national screening programme: A prospective interventional cohort study. Lancet. Digit. Health 2022, 4, e235–e244. [Google Scholar] [CrossRef] [PubMed]
Henriques, J.; Vaz-Pereira, S.; Nascimento, J.; Rosa, P.C. Diabetic eye disease. Acta Med. Port. 2015, 28, 107–113. [Google Scholar] [CrossRef]
Chaudhary, S.; Zaveri, J.; Becker, N. Proliferative diabetic retinopathy (PDR). Disease-a-Month 2021, 67, 101140. [Google Scholar] [CrossRef]
Wang, W.; Lo, A.C.Y. Diabetic Retinopathy: Pathophysiology and Treatments. Int. J. Mol. Sci. 2018, 19, 1816. [Google Scholar] [CrossRef]
Liu, Y.; Wu, N. Progress of Nanotechnology in Diabetic Retinopathy Treatment. Int. J. Nanomed. 2021, 16, 1391–1403. [Google Scholar] [CrossRef]
Wilkinson, C.P.; Ferris, F.L., III; Klein, R.E.; Lee, P.P.; Agardh, C.D.; Davis, M.; Dills, D.; Kampik, A.; Pararajasegaram, R.; Verdaguer, J.T. Proposed international clinical diabetic retinopathy and diabetic macular edema disease severity scales. Ophthalmology 2003, 110, 1677–1682. [Google Scholar] [CrossRef] [PubMed]
Ghanchi, F. The Royal College of Ophthalmologists’ clinical guidelines for diabetic retinopathy: A summary. Eye 2013, 27, 285–287. [Google Scholar] [CrossRef]
American Diabetes Association. Microvascular Complications and Foot Care: Standards of Medical Care in Diabetes-2020. Diabetes Care 2020, 43, S135–S151. [Google Scholar] [CrossRef] [PubMed]
Kuwayama, S.; Ayatsuka, Y.; Yanagisono, D.; Uta, T.; Usui, H.; Kato, A.; Takase, N.; Ogura, Y.; Yasukawa, T. Automated Detection of Macular Diseases by Optical Coherence Tomography and Artificial Intelligence Machine Learning of Optical Coherence Tomography Images. J. Ophthalmol. 2019, 2019, 6319581. [Google Scholar] [CrossRef] [PubMed]
Monemian, M.; Rabbani, H. Red-lesion extraction in retinal fundus images by directional intensity changes’ analysis. Sci. Rep. 2021, 11, 18223. [Google Scholar] [CrossRef]
Wu, Z.; Shi, G.; Chen, Y.; Shi, F.; Chen, X.; Coatrieux, G.; Yang, J.; Luo, L.; Li, S. Coarse-to-fine classification for diabetic retinopathy grading using convolutional neural network. Artif. Intell. Med. 2020, 108, 101936. [Google Scholar] [CrossRef]
Li, X.; La, R.; Wang, Y.; Hu, B.; Zhang, X. A Deep Learning Approach for Mild Depression Recognition Based on Functional Connectivity Using Electroencephalography. Front. Neurosci. 2020, 14, 192. [Google Scholar] [CrossRef]
Hazra, D.; Byun, Y.C. SynSigGAN: Generative Adversarial Networks for Synthetic Biomedical Signal Generation. Biology 2020, 9, 441. [Google Scholar] [CrossRef]
Russo, V.; Lallo, E.; Munnia, A.; Spedicato, M.; Messerini, L.; D’Aurizio, R.; Ceroni, E.G.; Brunelli, G.; Galvano, A.; Russo, A.; et al. Artificial Intelligence Predictive Models of Response to Cytotoxic Chemotherapy Alone or Combined to Targeted Therapy for Metastatic Colorectal Cancer Patients: A Systematic Review and Meta-Analysis. Cancers 2022, 14, 4012. [Google Scholar] [CrossRef] [PubMed]
Bhimavarapu, U.; Battineni, G. Deep Learning for the Detection and Classification of Diabetic Retinopathy with an Improved Activation Function. Healthcare 2022, 11, 97. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Tseng, V.S.; Chen, C.L.; Liang, C.M.; Tai, M.C.; Liu, J.T.; Wu, P.Y.; Deng, M.S.; Lee, Y.W.; Huang, T.Y.; Chen, Y.H. Leveraging Multimodal Deep Learning Architecture with Retina Lesion Information to Detect Diabetic Retinopathy. Transl. Vis. Sci. Technol. 2020, 9, 41. [Google Scholar] [CrossRef]
Chen, J.; Frey, E.C.; He, Y.; Segars, W.P.; Li, Y.; Du, Y. TransMorph: Transformer for unsupervised medical image registration. Med. Image Anal. 2022, 82, 102615. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Shang, F.; Wu, B.; Yang, D.; Wang, L.; Xu, Y.; Zhang, W.; Zhang, T. Robust Collaborative Learning of Patch-Level and Image-Level Annotations for Diabetic Retinopathy Grading From Fundus Image. IEEE Trans. Cybern. 2022, 52, 11407–11417. [Google Scholar] [CrossRef] [PubMed]
Zhang, T.H.; Hasib, M.M.; Chiu, Y.C.; Han, Z.F.; Jin, Y.F.; Flores, M.; Chen, Y.; Huang, Y. Transformer for Gene Expression Modeling (T-GEM): An Interpretable Deep Learning Model for Gene Expression-Based Phenotype Predictions. Cancers 2022, 14, 4763. [Google Scholar] [CrossRef]
Chefer, H.; Gur, S.; Wolf, L. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 782–791. [Google Scholar]
Li, C.F.; Xu, Y.D.; Ding, X.H.; Zhao, J.J.; Du, R.Q.; Wu, L.Z.; Sun, W.P. MultiR-Net: A Novel Joint Learning Network for COVID-19 segmentation and classification. Comput. Biol. Med. 2022, 144, 105340. [Google Scholar] [CrossRef]
Albahli, S.; Ahmad Hassan Yar, G.N. Automated detection of diabetic retinopathy using custom convolutional neural network. J. X-Ray Sci. Technol. 2022, 30, 275–291. [Google Scholar] [CrossRef]
Mookiah, M.R.; Acharya, U.R.; Koh, J.E.; Chandran, V.; Chua, C.K.; Tan, J.H.; Lim, C.M.; Ng, E.Y.; Noronha, K.; Tong, L.; et al. Automated diagnosis of Age-related Macular Degeneration using greyscale features from digital fundus images. Comput. Biol. Med. 2014, 53, 55–64. [Google Scholar] [CrossRef]
Shen, D.; Wu, G.; Suk, H.I. Deep Learning in Medical Image Analysis. Annu. Rev. Biomed. Eng. 2017, 19, 221–248. [Google Scholar] [CrossRef] [PubMed]
Virgili, G.; Menchini, F.; Casazza, G.; Hogg, R.; Das, R.R.; Wang, X.; Michelessi, M. Optical coherence tomography (OCT) for detection of macular oedema in patients with diabetic retinopathy. Cochrane Database Syst. Rev. 2015. [Google Scholar] [CrossRef] [PubMed]
Rabiolo, A.; Parravano, M.; Querques, L.; Cicinelli, M.V.; Carnevali, A.; Sacconi, R.; Centoducati, T.; Vujosevic, S.; Bandello, F.; Querques, G. Ultra-wide-field fluorescein angiography in diabetic retinopathy: A narrative review. Clin. Ophthalmol. 2017, 11, 803–807. [Google Scholar] [CrossRef] [PubMed]
Deschler, E.K.; Sun, J.K.; Silva, P.S. Side-effects and complications of laser treatment in diabetic retinal disease. Semin. in Ophthalmology 2014, 29, 290–300. [Google Scholar] [CrossRef]
Mishra, S.; Kim, Y.-S.; Intarasirisawat, J.; Kwon, Y.-T.; Lee, Y.; Mahmood, M.; Lim, H.-R.; Herbert, R.; Yu, K.J.; Ang, C.S. Soft, wireless periocular wearable electronics for real-time detection of eye vergence in a virtual reality toward mobile eye therapies. Sci. Adv. 2020, 6, eaay1729. [Google Scholar] [CrossRef]
Gulshan, V.; Peng, L.; Coram, M.; Stumpe, M.C.; Wu, D.; Narayanaswamy, A.; Venugopalan, S.; Widner, K.; Madams, T.; Cuadros, J. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016, 316, 2402–2410. [Google Scholar] [CrossRef]
Wang, X.-N.; Dai, L.; Li, S.-T.; Kong, H.-Y.; Sheng, B.; Wu, Q. Automatic grading system for diabetic retinopathy diagnosis using deep learning artificial intelligence software. Curr. Eye Res. 2020, 45, 1550–1555. [Google Scholar] [CrossRef]
Wu, J.; Hu, R.; Xiao, Z.; Chen, J.; Liu, J. Vision Transformer-based recognition of diabetic retinopathy grade. Med. Phys. 2021, 48, 7850–7863. [Google Scholar] [CrossRef]
Araújo, T.; Aresta, G.; Mendonça, L.; Penas, S.; Maia, C.; Carneiro, Â.; Mendonça, A.M.; Campilho, A. DR| GRADUATE: Uncertainty-aware deep learning-based diabetic retinopathy grading in eye fundus images. Med. Image Anal. 2020, 63, 101715. [Google Scholar] [CrossRef]
Wang, X.; Tang, F.; Chen, H.; Cheung, C.Y.; Heng, P.-A. Deep semi-supervised multiple instance learning with self-correction for DME classification from OCT images. Med. Image Anal. 2023, 83, 102673. [Google Scholar] [CrossRef]
Vocaturo, E.; Zumpano, E. Diabetic retinopathy images classification via multiple instance learning. In Proceedings of the 2021 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), Washington, DC, USA, 16–18 December 2021; pp. 143–148. [Google Scholar]
Zhu, W.; Qiu, P.; Lepore, N.; Dumitrascu, O.M.; Wang, Y. Self-supervised equivariant regularization reconciles multiple-instance learning: Joint referable diabetic retinopathy classification and lesion segmentation. In Proceedings of the 18th International Symposium on Medical Information Processing and Analysis, Valparaiso, Chile, 9–11 November 2022; pp. 100–107. [Google Scholar]
SOVIT RANJAN RATH. Homepage. Available online: https://www.kaggle.com/datasets/sovitrath/diabetic-retinopathy-224x224-2019-data (accessed on 16 April 2023).
Wang, Z.; Xin, J.; Wang, Z.; Yao, Y.; Zhao, Y.; Qian, W. Brain functional network modeling and analysis based on fMRI: A systematic review. Cogn. Neurodyn. 2021, 15, 389–403. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Timmerman, V.; Strickland, A.V.; Züchner, S. Genetics of Charcot-Marie-Tooth (CMT) disease within the frame of the human genome project success. Genes 2014, 5, 13–32. [Google Scholar] [CrossRef]
Cao, J.; Kwong, S.; Wang, R.; Li, X.; Li, K.; Kong, X.J.N. Class-specific soft voting based multiple extreme learning machines ensemble. Neurocomputing 2015, 149, 275–284. [Google Scholar] [CrossRef]
Lipton, Z.C.; Elkan, C.; Narayanaswamy, B. Thresholding classifiers to maximize F1 score. arXiv 2014, arXiv:1402.1892. [Google Scholar]
Hajian-Tilaki, K. Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation. Casp. J. Intern. Med. 2013, 4, 627–635. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Huang, S.; Xu, Z.; Tao, D.; Zhang, Y. Part-stacked CNN for fine-grained visual categorization. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1173–1182. [Google Scholar]
Caroprese, L.; Vocaturo, E.; Zumpano, E. Argumentation approaches for explanaible ai in medical informatics. Intell. Syst. Appl. 2022, 16, 200109. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Guo, J.; Tang, Y.; Wu, E. Vision GNN: An image is worth graph of nodes. arXiv 2022, arXiv:2206.00272. [Google Scholar]
Hu, Z.; Dong, Y.; Wang, K.; Chang, K.-W.; Sun, Y. Gpt-gnn: Generative pre-training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 1857–1867. [Google Scholar]
Suganthi, M.; Sathiaseelan, J. An exploratory of hybrid techniques on deep learning for image classification. In Proceedings of the 2020 4th International Conference on Computer, Communication and Signal Processing (ICCCSP), Chennai, India, 28–29 September 2020; pp. 1–4. [Google Scholar]

Figure 1. Five clinical grades of DR depending on the fundus features: (a) no DR, (b) mild DR, (c) moderate DR, (d) severe DR, and (e) PDR.

Figure 2. The proposed WSN architecture for DR grading.

Figure 3. HybridLG framework.

Figure 4. MobileViT-Plus and Blocks.

Figure 5. The loss and ROC curves of three models for comparison. (a) MobileViT-Plus, (b) ResNet101, and (c) Our model.

Figure 6. The loss and ROC curves of five models for comparison. (a) Resnext101, (b) Se_resnet101, (c) Se_resnext50, (d) Senet154, and (e) MobileViT.

Table 1. Our model and ablation study.

Model	Accuracy	Precision	Recall	F₁ Score	AUC
MobileViT-Plus	88.00%	87.97%	88.00%	0.8797	0.963
ResNet101	81.33%	81.31%	81.33%	0.8123	0.943
Our model	93.67%	93.71%	93.67%	0.9366	0.994

Table 2. Comparing base models and other classic models.

Model	Accuracy	Precision	Recall	F₁ Score	AUC
Resnext101	79.83%	80.40%	79.83%	0.7943	0.947
Se_resnet101	86.50%	86.47%	86.50%	0.8640	0.963
Se_resnext50	90.17%	90.30%	90.16%	0.9020	0.973
Senet154	88.00%	88.17%	88.00%	0.8796	0.961
MobileViT	87.33%	87.54%	87.33%	0.8721	0.967

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wan, Z.; Wan, J.; Cheng, W.; Yu, J.; Yan, Y.; Tan, H.; Wu, J. A Wireless Sensor System for Diabetic Retinopathy Grading Using MobileViT-Plus and ResNet-Based Hybrid Deep Learning Framework. Appl. Sci. 2023, 13, 6569. https://doi.org/10.3390/app13116569

AMA Style

Wan Z, Wan J, Cheng W, Yu J, Yan Y, Tan H, Wu J. A Wireless Sensor System for Diabetic Retinopathy Grading Using MobileViT-Plus and ResNet-Based Hybrid Deep Learning Framework. Applied Sciences. 2023; 13(11):6569. https://doi.org/10.3390/app13116569

Chicago/Turabian Style

Wan, Zhijiang, Jiachen Wan, Wangxinjun Cheng, Junqi Yu, Yiqun Yan, Hai Tan, and Jianhua Wu. 2023. "A Wireless Sensor System for Diabetic Retinopathy Grading Using MobileViT-Plus and ResNet-Based Hybrid Deep Learning Framework" Applied Sciences 13, no. 11: 6569. https://doi.org/10.3390/app13116569

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Wireless Sensor System for Diabetic Retinopathy Grading Using MobileViT-Plus and ResNet-Based Hybrid Deep Learning Framework

Abstract

1. Introduction

2. Related Work

2.1. Clinical DR Grading Application

2.2. WSNs-Aided DR Grading

2.3. Deep Learning-Based DR Grading

3. Methodology

3.1. Methodology Architecture

3.2. Data Preparation and Augmentation

3.3. HybridLG Framework Construction

3.3.1. Framework Structure

3.3.2. ResNet Backbone for Learning Local Information

3.3.3. MobileViT-Plus Backbone for Learning Global Information

3.3.4. Feature Fusion Network and Head Network

3.4. Model Training Strategy

4. Performance Evaluation

4.1. Experimental Setup

4.2. Evaluation Metrics

4.3. Ablation Study

4.4. Comparison Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI