Convolution Network Enlightened Transformer for Regional Crop Disease Classification

Wang, Yawei; Chen, Yifei; Wang, Dongfeng

doi:10.3390/electronics11193174

Open AccessArticle

Convolution Network Enlightened Transformer for Regional Crop Disease Classification

by

Yawei Wang

^1,*

,

Yifei Chen

^1,2 and

Dongfeng Wang

³

¹

College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China

²

Engineering Practice Innovation Center, China Agricultural University, Beijing 100083, China

³

School of Information Science and Engineering, Hebei University of Science and Technology, Shijiangzhuang 050018, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(19), 3174; https://doi.org/10.3390/electronics11193174

Submission received: 1 August 2022 / Revised: 26 September 2022 / Accepted: 28 September 2022 / Published: 2 October 2022

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The overarching goal of smart farming is to propose pioneering solutions for future sustainability of humankind. It is important to recognize the image captured for monitoring the growth of plants and preventing diseases and pests. Currently, the task of automatic recognition of crop diseases is to research crop diseases based on deep learning, but the existing classifiers have problems regarding, for example, accurate identification of similar disease categories. Tomato is selected as the crop of this article, and the corresponding tomato disease is the main research point. The vision transformer (VIT) method has achieved good results on image tasks. Aiming at image recognition, tomato plant images serve as this article’s data source, and their structure is improved based on global ViT and local CNN (convolutional neural network) networks, which are built to diagnose disease images. Therefore, the features of plant images can be precisely and efficiently extracted, which is more convenient than traditional artificial recognition. The proposed architecture’s efficiency was evaluated by three image sets from three tomato-growing areas and acquired by drone and camera. The results show that this article method garners an average counting accuracy of 96.30%. It provides scientific support and a reference for the decision-making process of precision agriculture.

Keywords:

deep learning; regional agriculture; visual transformer; disease diagnosing; precision agriculture

1. Introduction

Precision and smart agriculture for managing crop planting land is currently the basic agronomic practice for improving food productivity, security, and environmental protection in the context of sustainable agriculture [1,2]. The advanced methods of artificial intelligence help agriculturists and experts to use innovative technology rather than traditional site monitoring, which is a time-consuming process [3]. Therefore, the development of intelligent applications and efficient data processing methods plays an important role in precision agriculture. In general, precision agriculture includes optimizing farm inputs, involving many technologies in sensor deployment, data collection, data processing, data analysis, and disease recognition [4]. The intelligence technologies based on deep learning help farmers move towards smart agriculture with innovative technologies to improve the quality of products.

In recent years, with the innovation of visual algorithms, it is possible to analyze leaf disease through computer vision without farmers’ interference, improving the accuracy and timely protection of crops. Tomato is one of the most important vegetables for humanity, and it faces the challenge of achieving high yield production. However, tomato diseases can cause severe yield reductions. Therefore, the correct classification of different diseases is an important issue. Tomato supplies are scarce in large cities, such as Beijing, which was chosen to study tomato diseases based on regional research. Tomato growth is environmentally related, so tomato disease studies based on geographical areas in the same city have important guiding significance. Tomato disease datasets are collected by multiple regions of Beijing, and their automated tomato disease classification algorithm is studied, as shown in Figure 1.

Manual classification for different diseases and pests is difficult for crop growers without professional plant pathology. With the rapid development of information and intelligence, artificial intelligence technology is widely used in agriculture to solve practical production problems. In particular, the excellent performance of deep learning methods and convolutional neural networks solve complex problems in the field of computer vision, and they have been widely applied in object recognition in the whole field. For the purpose of precision agriculture, many studies are mainly divided into ML-based (machine learning) and DL-based (deep learning methods). Often, the ML-based approach [5,6,7] requires complex prediction and disease-specific feature extraction before analyzing diseases. Image recognition methods of deep learning have gradually garnered particular attention of crop disease recognition. Deep convolutional neural network (DCNN) has a powerful performance for image disease diagnosis. Therefore, a machine vision system can help farmers to analyze plant diseases and protect crops in time. In order to solve the problem of tomato disease image characteristics, many DCNN models are exploited with high accuracy in tomato disease diagnosis applied to high precision plant disease detection, such as NasNet [8], faster-RCNN [9], SSD method [10], mask-RCNN [11], and EfficientNet [12]. DCNN models have advantages in local feature extraction but difficulty in capturing global features.

This restriction stimulates the exploitation of vision transformer (ViT) [13], which extracts global features more significantly than previous models. In ViT, an encoder unit is directly exploited to sequences of image patches to complete image classification responsibilities [14]. The superiority of ViT involves focus regarding acquiring the global contextual information that enables creating a long-distance dependency regarding the target features.

Therefore, this article proposed a fusion relationship between CNN and transformer to strengthen the capability of the network, focusing on both local and global features with contextual information, namely ShuFormer. Tomato disease datasets with 10 disease categories were derived from many vegetable greenhouses in Beijing. The UAV and mobile phone were used to collect images of tomato diseases and provide data support for algorithm research. The datasets comprehensively reflect the growth status and disease status of tomatoes in the whole region. The recognition accuracy of ShuFormer is better than the traditional models and reached 96.30%. The major contributions for the exploited techniques in this research are:

We processed the disease datasets based on traditional data augmentation.
We exploited the fusion relationship between CNN and transformer to strengthen the capability of the network, focusing on both local and global features with contextual information, namely ShuFormer. Channel shuffle was used to establish associations between the groups of feature maps with grouped convolutions.
The results show that the method is suitable for many vegetable fields in Beijing and is generally suitable for studying plant diseases in similar geographical environments.

2. Related Work

With the continuous development of deep learning, many researchers work on optimizing deep neural networks for diagnosis of tomato diseases. For instance, Agarwal et al. [15] detected a deep-learning-based approach for nine tomato diseases with an accuracy of 91.2%. Bhujel et al. [16] raised the performance of a convolutional block with an attention module for tomato leaf diseases, and the average accuracy reached 99.69%. Sembiring et al. [17] developed a simpler CNN architecture with four layers of convolution to diagnose nine target classes of tomato diseases, with a validation accuracy of 97.15%. Combined with local geographical conditions, it can more effectively detect tomato crop diseases and their severity and distinguish them from healthy crops. Hettiarachchi et al. [18] utilized a deep learning YOLOv3 model to help Sri Lankan urban farmers to detect and control common tomato diseases and achieved an average accuracy of 92%. Gao et al. [19] realized the joint detection of environmental information by ground sensors and UAV, where Internet of things and UAV can, respectively, join the occurrence of diseases and insects from the ground micro perspective and the space macro perspective.

Transformer is a novel attention structure, which was first applied to a natural language processing (NLP) method to obtain the semantics and global features of the context. In recent years, with the development of the transformer technique, many researchers have tried to extract the global features for image recognition tasks by this method. Transformer has provided a new way of constructing neural networks for machine vision. As an innovative task, ViT proved the capability of transformer models for machine vision work. To exploit the long-distance dependency, transformer units as independent blocks join a convolutional neural network (CNN) for the tasks of object recognition [20], semantic segmentation [21], data augmentation [22], and image translation [23]. CVT [24] introduced convolution to the yield of ViT to enhance the ability of vision tasks. T2TViT [25] applied a token module to recursively reorganize the image to tokens considering aggregating neighboring pixels. Wu et al. [26] proposed a feature extraction model based on vision transformer by combining an image block for tomato leaf disease recognition with an accuracy of 88.1%. Thai et al. [27] exploited vision transformer in place of a convolution neural network for classifying cassava leaf diseases and showed that the transformer model can obtain competitive accuracy at least 1% higher than popular CNN models. Hirani et al. [28] compared the transformer method with token embedding representations with different dimensions in the task of plant disease detection, and the accuracy of a large transformer network of 256 dimensions was 97.98%.

The rest of this paper is as follows: Section 2 discusses related works for plant disease diagnosis methods. Section 3 presents the datasets processing method. Section 4 demonstrates the tomato disease diagnosis models based on the ShuFormer model. Section 5 represents the experiments and performance of the proposed tomato disease diagnosis methods. Finally, the summary and conclusion are illustrated in Section 6.

3. Architecture

The Internet of things (IoT) promotes agricultural informatization and automation, which is an indispensable factor to improve the quantity and quality of agriculture at a low cost. In the next few years, the use of smart solutions driven by the Internet of things will increase in agriculture. Air drones are widely used to monitor crop health, such as tomato plant monitoring, and field analysis in smart agriculture. This paper presents a UAV-based architecture for detecting diseases, as shown in Figure 2. In this paper, the collected dataset was used to study the tomato disease identification algorithm, which uses transformer and CNN to realize local feature and global feature optimization to improve the accuracy of the model.

3.1. Drone Sensor Node

A wireless sensor node is an important part of the traditional Internet of things (IoT) monitoring system [29]. A drone is an image acquisition sensor node that is systematically arranged in the regulating field to collect growth environments. In the plant disease monitoring system arrangement in Beijing (China), a drone of a sensor node was placed in the field, with the drone embedded with a camera, wireless GPS (Global Positioning System), and CMOS (complementary metal oxide semiconductor) to monitor the tomato growth state. In the framework of IoT systems based on edge computing, wireless sensor nodes are mainly used to monitor crop growth and communicate with edge computing nodes. This can not only reduce the demand for wireless sensor nodes but also take full advantage of the data processing power of the edge computing node [30]. Dji min2 was selected as the image collector, whose parameters are as follows:

The UAV type is Mavic Mini Propellers, the fly max height is 2000 m, the min–max speed is between 6m/s and 16 m/s, the maximum wind speed is between 8.5 m/s and 10.5 m/s, the maximum flight time is 31 min, the weight is 249 g, the camera resolution is 12 million effective pixels, and the maximum bit rate of real-time image transmission is 8 Mbps. During the image acquisition process, the drone flew at an altitude of 6 m and operated from 8 am to 10 am. Tomato diseases mainly occur on fruit and leaves, which belong to local diseases of plants, so drones are used to collect images at low altitude. The tomato disease samples are manually screened by drone-collected images. The tomato disease datasets were annotated by an artificial method in the tomato disease samples.

3.2. Edge Node

The base stations in the IoT sensing systems are deployed near the monitoring nodes, and they are known as edge computing nodes. Edge computational nodes play a crucial role in the proposed disease identification methods. In the initial stage of the system, the DL model trained with the cloud computing platform was exploited to the edge computing nodes [31]. The collected data were identified during the normal operating phase of the system. When an abnormality is detected, the edge computing node promptly reports the exception to the drone cloud platform and drives the controller to supply a corresponding resolution.

3.3. Cloud Platform

The training of a model for DL networks demands a large amount of computation that is difficult for edge computing nodes. The pre-training model was performed on the cloud platform [32] and deployed the generated model for edge computing nodes. Based on these pre-training model’s parameters, the edge nodes can run the proposed algorithm, which will significantly reduce the runtime and improve the forecasting accuracy. ShuFormer is a tomato disease identification network with computer calculation and improved feature extraction capability, which facilitates disease identification on cloud platforms.

4. Datasets Preprocessing

4.1. Dataset

The tomato disease datasets came from drone collection and mobile phone collection, collected from 8 am to 10 am in each area. For example, blossom end rot is widespread in Changping district and Tongzhou district, and the tomato disease images were obtained by mobile phone photography. The images are mainly observable views on tomatoes by the UAV, containing very few images of tomato diseases. The phone camera has 4k pixels and collected a large number of tomato side- and bottom views. To show the effectiveness of the approach, this paper conducted comprehensive experiments on three different tasks: Changping, Shunyi, and Thouzhou datasets. The details of these datasets are listed in Table 1. Tomatoes in different regions are not in the same growth periods, so the diseases are various. Each region has 7 different categories of tomato disease and nonrepetitive disease in the three regions, with the tomato disease datasets having a total of 10 categories of tomato disease.

The tomato disease datasets have a total of 10 classes, which are bacterial spot, gray mold, striped rot, cracked fruit, blossom end rot, late blight disease, anthracnose, citrus canker, early blight disease, and leaf mold.

The tomato disease datasets in Changping district have 7 classes, which are bacterial spot, gray mold, striped rot, blossom end rot, anthracnose, citrus canker, and leaf mold.

The tomato disease datasets in Shunyi district have 7 classes, which are bacterial spot, gray mold, striped rot, anthracnose, citrus canker, early blight disease, and leaf mold.

The tomato disease datasets in Tongzhou district have 7 classes, which are gray mold, striped rot, cracked fruit, blossom end rot, late blight disease, anthracnose, and citrus canker.

Here, this paper divides the disease into the training set and the test set based on the collected information.

The selections of images were completed from three regions of tomato leaf disease, and each disease is caused by a different type of pathogen. After that, comparisons between all categories of disease were completed by humans, and their representative maps were generated to obtain spatial information about visual interpretable hotspots. There is no metric that has been devised yet to evaluate how accurate computer-generated visualized maps are with human knowledge. Therefore, even if gradient-based methods provide meaningful results to understand the process of training a network to learn to classify tomato disease, there are barriers in using neural networks such that humans still require evaluating computer-generated visualization results with professional knowledge.

4.2. Data Augmentation

Data augmentation is a method of neural network optimization, taking advantage of augmented data methods to increase the accuracy of neural network identification [33]. The model has more robustness and generalization ability in more image data. In this paper, the tomato leaf disease datasets of the experiment are expanded by using acquired tomato images by data augmentation. The image augmentation methods used in the study include Gaussian filter, median filtering, cutout adjustment, mixup adjustment, brightness adjustment, and rotation adjustment.

According to experience and experiments, the parameters used for the image augmentation methods mentioned are shown in Table 2. An example of the images obtained from data augmentation is shown in Figure 3.

5. Proposed Method

5.1. Overall

Computer vision focuses on image feature mining, and both local and global features are important features of images. Local feature combined by the neighboring vectors of features representing the texture of objects is an important part of computer vision research. In this research, a bi-directional fusion structure was established between the transformer branch and CNN branch, which is exploited to fuse global and local features. This structure of parallel decoupling between transformer and convolution takes advantage of the efficiency of extracting interaction ability. Global features represent the correlation between the contour of objects with long distance dependencies. Transformer has become the main direction of exploring the global features of computer vision. This paper exploited improved convolutional neural network module, transformer attention module, and feature fusion module to extract and pay attention to image local and global features, maximize the retention of local and global features, and realize the fusion of local and global features in an interactive way. Figure 4 describes a convolution network enlightened transformer attention and feature fusion model consisting of the following units: group shuffle depth-wise block (GSDW), transformer block, fusion block, and detection block, namely ShuFormer.

Where

T

indicates the extracted token representations for the transformer block,

F

extracts shallow features from input image, and

\oplus

is the element-wise summation. Especially regarding the fusion of local and global features, the two branches of the CNN module and transformer model complement each other with the fusion module.

The tomato image is, respectively, fed to model constructed with CNN and transformer branch. The CNN branch mainly extracts local features, and the transformer branch mainly extracts global representations and actively exchanges beneficial feature information during the feature fusion. Final classification result is acquired by aggregating both representations. The shallow feature extracts from

Head

of a given image, where

H_{h e a d}

comprising two residual blocks is:

F = H_{h e a d} (I)

(1)

Then, the image-like shallow feature

F \in ℝ^{H \times W \times C}

is tokenized into non-overlapping

n

tokens

T \in ℝ^{N \times D}

, where

D

represents the dimension of token vector. In particular,

N = \frac{H}{t} \times \frac{W}{t}

and

D = C \cdot t^{2}

when

t

is the token size.

5.2. Local Representations Branch Based on Depthwise Separable Convolutional Network

In the body block, as shown in Figure 5, the local block convolution kernels in CNN branch slide over the image features, with CNN branch compensating for the lack of intrinsic inductive bias in the transformer branch. The block module exploits group convolution (Gconv), channel shuffle, and deep-wise separable convolution (DWSC) for local feature extraction. The depth-wise aggregately implements local space with sparse connections on the channels. Image structure is exploited to learn convolutional features, and the feature map size decreases while the number of channels increases.

The computing resources of edge node devices are limited; in order to optimize network structure, a deep-wise separable convolution network (DWSC) is adopted to improve the local network based on standard convolution and reduce the number of parameters of the model feature extractors. DWSC is the main structure of the lightweight neural network to compress the nonlinearity and improve the operation speed of the network and make full use of the characteristic information. The traditional CNN method convolves the input image with a convolution kernel of the same depth to obtain image feature, while the DWSC consists of depth-wise and point-wise; the structure is shown in Figure 6.

First, this paper employed variable group convolution to design the network and establish associations between the features with channel shuffle technique. Depth-wise convolution is implemented by two kernels,

3 \times 1

and

1 \times 3

convolution processing channel and spatial features, respectively. For the DWSC Bottleneck, the stride was set as 2 to generate down-sampling feature map. The CNN block at layer

l - 1_{t h}

,

g_{l} (\cdot)

of the depth-wise separable encoding path takes from feature

F

,

X_{l - 1} \in ℝ^{H_{l - 1} \times W_{l - 1} \times C_{l - 1}}

, of height

H_{l - 1}

, width

W_{l - 1}

with

C_{l - 1}

features as input, and outputs an intermediate image array with decreased features,

F_{l} \in ℝ^{H_{l} \times W_{l} \times C_{l}}

, given in Equation (7). These features are also exploited along the skip connections to increase the features at each stage, similar to the ResNet architecture at the CNN-encoding path:

F_{l} = g_{l} (F_{l - 1}), \in ℝ^{H_{l} \times W_{l} \times C_{l}}

(2)

5.3. Global Representations Branch Based on Transformer Structure

In the body block, the acquired tokens

T \in ℝ^{n \times d}

are fed into the

n

blocks of transformer branch; each block is symmetric to block of CNN branch:

T_{l} = h_{l} (T_{l - 1}), \in ℝ^{N_{l} \times D_{l}}

(3)

where

T_{l}

represents the extractive token at the

l - 1_{t h}

transformer block

h_{l} (\cdot)

. Figure 7 depicts each transformer block and includes two sequential attention operations: multi-head self-attention (MSA) and shifted multi-head self-attention (SW-MSA), as in Equation (4):

T_{l - 1}^{'} = W - M S A (L N (T_{l - 1})) + T_{l - 1} T_{l - 1}^{″} = M L P (L N (T_{l - 1}^{'})) + T_{l - 1}^{'} T_{l - 1}^{‴} = S W - M S A (L N (T_{l - 1}^{″})) + T_{l - 1}^{'} T_{l} = M L P (L N (T_{l - 1}^{‴}) + T_{l - 1}^{‴}

(4)

5.4. Feature Fusion Block

As in Figure 4, the intermediate features extracted from independent branches are fused in feature fusion block. In contrast, the features and computational mechanisms of the two branches are very different, so they are exploited to fuse both representations regarding convolution and transformer to let both branches advantageously complement feature representations from each other. Concretely, the given intermediate features of the

T_{l}

from the global representations branch and the

F_{l}

from the local representations branch and the aggregate feature maps of fusion block by

b_{l} (\cdot)

are:

M_{l} = b_{l} (r e a r r a n g e (T_{l}) | | F_{l}) \in ℝ^{H_{l} \times W_{l} \times 2 C_{l}}

(5)

where

M_{l}

denotes the fused features, rearrange and

| |

represent image-like rearrangement and concatenation, respectively. Next, this paper built our Fusion Block with

1 \times 1

convolutional blocks to focus on the channel-wise fusion. Except for the last Fusion Block, the fused features

M_{l}

are split into two features along the channel dimension, followed by MLP blocks and convolutional blocks, respectively.

We analyzed detailed architecture hyper-parameters of ShuFormer in Table 3. Supposing the input feature is of size

H \times W \times C

,

k \times k

is the convolutional kernel size,

G

is the number of groups,

C

is the number of output for CNN branch,

D

is the number of output for transformer branch,

H

is the head number of transformer branch multi-head attention, and

W \times 7

is the window size of W-MSA. Params and FLOPs refer to the number of parameters and the number of FLOPs, respectively. The calculation is based on the number of parameters and FLOPs assessed by Imagenet-1K.

6. Experiments

The methodology of the research can be divided into four steps. A flowchart of a crop disease and insect pest detection system is shown in Figure 8. First, based on transfer learning, the cloud computing platform uses the Imagenet-1K database to train the ShuFormer network and obtain the model parameters pre-training. Second, pre-training model parameters are used to train the disease recognition network model, update the network parameters by back propagation, and calculate the error by forward propagation. The program performs a preset number of iterations, and the operation of the model ends immediately following the maximum number of iterations. Third, deploy a trained network to an embedded mobile platform. Finally, when the relevant crop disease is found, the data and information should be sent back to the servers through the node to handle the crop growth situation in time by the monitoring platform. The tomato disease datasets number 21,400 and contain 10 classes. Here, the tomato disease dataset is divided into the training set and the test set based on the collected information.

6.1. Fine Tuning with Transfer Learning

Fine tuning is the process of training a new dataset using the pre-training model and changing the last layer to the prediction module of the new model. This method is faster and more accurate than training the whole model from scratch [34]. ImageNet dataset contains approximately 1.2 million images, which have 1000 class categories. Here, in this experiment, the pre-training models have been used that are trained on ImageNet dataset. The tomato disease dataset used in this research contains 21,400 images categorized into 10 classes. As it is considered a small dataset problem for deep learning, the model of ImageNet dataset is shared by the model of ShuFormer.

6.2. Exprerimental Enviroment and Evaluation/Measurement Metrics

During training procedure, this paper resized the input images to

224 \times 224

and performed image augmentations, such as random filling, cropping, and flipping. For the optimizer, this paper used Adam optimizer to train our network; we trained our network for 300 epochs with a batch size of 64. For the initial learning rate setting, the initial learning rate is

6 \times 10^{- 5}

(

β_{1} = 0.9, β_{2} = 0.999, ϵ = 10^{- 8}

) and reduces the rate by half every 50 epochs with four NVIDIA RTX 2080Tis (11 GB) GPUs. The Params and FLOPs were based on the results of ImageNet-1K evaluation, pre-training by ImageNet-1K, and then fine-tuning on Changping, Shunyi, and Tongzhou datasets, and then predictions.

In the collected datasets, each image has been manually classified into one of the categories: bacterial spot, gray mold, striped rot, cracked fruit, blossom end rot, late blight, anthracnose, citrus canker, early blight disease, and leaf mold, called ground-truth data. The predicted classes were obtained by running the classifier on the test set and obtaining the labels on the test images. Classification performance is evaluated by measuring the true labels and the predicted labels, resulting in the classification probabilities of true positives (TP), false positives (FP), and false negative (FN). The metrics exploited during the evaluation were Precision, Recall, and

F_{1 - S c o r e}

[35], which balances both false positives and false negatives with the weighted average of Precision and Recall.

F_{1 - S c o r e} = \frac{2 \times (R e c a l l \times P r e c i s i o n)}{R e c a l l + P r e c i s i o n}

(6)

Cross-validation techniques take advantage of evaluating the performance of the model, which requires calculating the mean

μ

to evaluate its performance. The equations used are as follows:

μ_{F_{1 - S c o r e}} = \frac{\sum_{i = 1}^{N} (F_{1 - S c o r e_{i}})}{N}

(7)

where N represents the number of copies of the dataset, which is generated by the process of cross-validation, as shown in Figure 9. In terms of loss index, this study uses the cross-entropy loss function between the ground real class and the prediction class.

The model was trained on ImageNet-1K datasets from Google. Then, this model on the tomato disease datasets from different regions was evaluated.

6.3. Results

Comparing the experimental results obtained by our method with several state-of-the-art network architectures, the CNN-based ones, such as ResNet-152 [36], Inception-v4 [37], and EfficientNet-B5 [38], and transformer-based ones, such as CeiT-S [39], TNT-S [40], and DeiT-S [41], were trained with ShuFormer in order to compare their performance on our custom datasets, comprising 10 classes. All the models were trained using the five-fold cross-validation, leaving out one technique. In this method, the models were trained using 12,900 samples, validated with 4300, and tested on 4200 image samples. The accuracy and losses of the models tend to be flat after the 30th epoch. This paper also reported the numbers of Params and FLOPs of each model. The results of average

F_{1 - S c o r e}

and losses regarding the considered models are reported in Table 4, and the

F_{1 - S c o r e}

was calculated using Equation (6) with

N = 5

and

K = 1

.

Through the experimental results, we found that ShuFormer-base was the most effective compared with other methods, with strong recognition ability in different regions, and the

F_{1 - S c o r e}

in it was 96.69%, 95.83%, and 96.39% in Changping, Shunyi, and Tongzhou districts, respectively. ShuFormer is higher than other methods in different regions and has strong feature extraction power. ShuFormer-small has the number of small parameters and the calculation amounts, which has the same effect as EfficientNet-B5, CeiT-S, and TNT-S. It is best recognized in Shunyi and Tongzhou districts and is 0.09 worse than the

F_{1 - S c o r e}

of DeiT-S in Changping area. Although reducing the model size, ShuFormer has the best accuracy performance for plant disease identification.

We used a smaller training dataset to answer the question of optimal performance. We carried out a five-fold cross-validation leaving

K

out, where

K

is a variable parameter from 1 to 4, and the test datasets were used to evaluate the performance of the model. Altering the parameter of training images has a direct influence on the performance of the trained ShuFormer model, as shown in Table 5. The results demonstrate the five-fold cross-validation in disease classification.

The five-fold representation divided the dataset into 5 copies, with the overall dataset number of 17,200, and 3440 copies for each copy. When K = 1, leave one as validation set and training set number is 12,900. When

K =

2, leave two as validation set, number 6880, training set guaranteed number 10,320. Leave one validation set and

F_{1 - S c o r e}

has the highest evaluation; the average

F_{1 - S c o r e}

reaches 96.3%. In Changping district and Tongzhou district, at

K = 1

, it was 96.69% and 96.39%, respectively. Although the

F_{1 - S c o r e}

of the model decreased at

K

> 1, the overall reduction was within 0.1%, which proved that the model is robust at different training data levels and can solve the problem of small sample data identification.

These experimental results show the performance of vision transformer models with small datasets and transfer learning.

Furthermore, this paper used 5-fold cross-validation, set

K = 1

, and used one as the test set to predict each type of disease in multiple regions to verify the robustness of the model in multiple diseases in different regions, as shown in Figure 10. Changping district is mainly early or middle tomato planting; Shunyi district is mainly the middle of tomato diseases and also contains some early and late diseases; Tongzhou district is mainly late tomato diseases. As shown in Table 6, ShuFormer exceeded 95% in

F_{1 - S c o r e}

in different regions and also exceeded 95% in

F_{1 - S c o r e}

in different diseases, verifying the robustness of the model. In the case of similar geographical location and geology, the plant disease characteristics have similar characteristics, and the disease identification model ShuFormer can be extended to similar geographical location and geology.

The performance of ShuFormer is contrasted with the CNN-based models, Resnet, Xception-v4, EfficentNet-B5, and the transformer-based Cei-T, TNT-S, and DeiT-S with a decreasing number of training images. The experimental results are shown in Figure 11. We notice the

F_{1 - S c o r e}

for Resnet, Inception-v4, and EfficientNet-B5 are deacreased, while the

K

number is increased. In contrast, the ShuFormer model maintained high performance in the training set, especially after the number of training images decreased: the

F_{1 - S c o r e}

was 93.35%, relative highest 96.3%, and only 1% reduction. The EfficientNet-B5 performance was reduced by 4.51% (from 94.73% -with

K = 1

to 90.22% -with

K = 4

). CeiT-S, TNT-S, and DeiT-S decreased as

K

number increased, with an average decrease of approximately 2%. The experimental results demonstrate that ShuFormer is involved in the image classification of plant diseases due to the current CNN-based and transformer-based models when dealing with small training datasets.

7. Conclusions

In this study, a multi-grained feature extraction model based on visual transformer was proposed for the identification task of identifying tomato leaf diseases in Beijing. For the example of planting bases in different regions, similar disease characteristics were demonstrated in a similar geographical environment and climate and were applied to regional disease identification. This paper learns and classifies tomato disease images by visual models. The results obtained using the datasets collected from different regions show that the use of visual transformer and CNN learning in agricultural problems is a promising direction. In comparison to the current state-of-the-art CNN-based models, such as ResNet and EfficientNet, the ShuFormer model with high recognition accuracy is the preferred classification model.

To efficiently obtain important features, we leveraged the fusion relationship between CNN and transformer block to the strength of the network’s ability for utilizing contextual information based on both local and global features. Depth-wise separable convolution was used to establish associations between feature mapping groups and reduce the number of parameters and computation of the model feature encoders. Our method can more finely identify fine-grained differences between plant diseases as compared to traditional CNN and ViTs models, thus greatly improving the accuracy of disease identification.

Moreover, the high performance of the ShuFormer models was demonstrated, especially on small training datasets, with which high accuracy was achieved. In this regard, we conclude that the method of visual transformer can change the approach to handling visual tasks in agricultural recognition by combining classical CNN-based models. Despite these promising results, problems remain, such as the reduced feasibility of visual transformers in image recognition tasks after significant changes in image acquisition conditions (resolution, illuminance, plant growth stages, etc.) in the field. In order to solve this problem, a new method, hrgan, is proposed to learn how to remove highlights from unpaired training data and learn the potential relationship between the highlighted image domain and the non-highlighted image domain.

As a future development direction, we will deploy the model on edge-embedded devices to realize the research and development of an intelligent agricultural crop pest identification system, reference resource protection, and a real-time detection model. Once implemented, it can be applied in smart agriculture, and follow-up control and monitoring applications can be developed to facilitate farmers to monitor crop growth and pest operation in time. For ordinary farmers, they just need to master the use of smartphones, so the learning cost is very low.

Author Contributions

Conceptualization, Y.W. and Y.C.; methodology, Y.W.; software, Y.W. and D.W.; validation, Y.W. and D.W.; formal analysis, Y.W. and Y.C.; investigation, Y.W. and Y.C.; resources, Y.C.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, Y.C.; visualization, D.W.; supervision, Y.C.; project administration, Y.W.; funding acquisition, Y.W. and Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This project is supported by the national key science and technology infrastructure project “National Research Facility for Phenotypic and Genotypic Analysis of Model Animals”, grant number “4444-10099609”.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The image data used in this study are available from the email: yaweiwanghb@163.com upon request.

Acknowledgments

The authors also thank the editor and anonymous reviewers for providing helpful suggestions for improving the quality of this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ramankutty, N.; Mehrabi, Z.; Waha, K.; Kremen, C.; Herrero, M.; Rieseberg, L.H. Trends in global agricultural land use: Implications for environmental health and food security. Annu. Rev. Plant Biol. 2018, 69, 789–815. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sishodia, R.P.; Ray, R.L.; Singh, S.K. Applications of remote sensing in precision agriculture: A review. Remote Sens. 2020, 12, 3136. [Google Scholar] [CrossRef]
Ayaz, M.; Ammad-Uddin, M.; Sharif, Z.; Mansour, A.; El-Hadi, M.A. Internet-of-Things (IoT)-based smart agriculture: Toward making the fields talk. IEEE Access 2019, 7, 129551–129583. [Google Scholar] [CrossRef]
Davis, R.L.; Greene, J.K.; Dou, F.; Jo, Y.K.; Chappell, T.M. A practical application of unsupervised machine learning for analyzing plant image data collected using unmanned aircraft systems. Agronomy 2020, 10, 633. [Google Scholar] [CrossRef]
Seetharaman, K.; Mahendran, T. Leaf Disease Detection in Banana Plant using Gabor Extraction and Region-Based Convolution Neural Network (RCNN). J. Inst. Eng. Ser. A 2022, 103, 501–507. [Google Scholar] [CrossRef]
Shah, D.; Trivedi, V.; Sheth, V.; Shah, A.; Chauhan, U. ResTS: Residual deep interpretable architecture for plant disease detection. Inf. Process. Agric. 2021, 9, 212–223. [Google Scholar] [CrossRef]
Bhimte, N.R.; Thool, V.R. Diseases Detection of Cotton Leaf Spot using Image Processing and Svm Classifier. In Proceedings of the 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 14–15 June 2018; pp. 340–344. [Google Scholar]
Adedoja, A.; Owolawi, P.A.; Mapayi, T. Deep Learning Based on Nasnet for Plant Disease Recognition using Leave Images. In Proceedings of the 2019 International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD), Bucharest, Romania, 5–6 August 2019; pp. 1–5. [Google Scholar]
Murugeswari, R.; Anwar, Z.S.; Dhananjeyan, V.R.; Karthik, C.N. Automated Sugarcane Disease Detection Using Faster RCNN with an Android Application. In Proceedings of the 2022 6th International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 28–30 April 2022; pp. 1–7. [Google Scholar]
Ghoury, S.; Sungur, C.; Durdu, A. Real-Time Diseases Detection of Grape and Grape Leaves using Faster R-CNN and SSD MobileNet Architectures. In Proceedings of the International Conference on Advanced Technologies, Computer Engineering and Science (ICATCES 2019), Antalya, Turkey, 26–28 April 2019; pp. 39–44. [Google Scholar]
Afzaal, U.; Bhattarai, B.; Pandeya, Y.R.; Lee, J. An instance segmentation model for strawberry diseases based on mask R-CNN. Sensors 2021, 21, 6565. [Google Scholar] [CrossRef] [PubMed]
Atila, Ü.; Uçar, M.; Akyol, K.; Akyol, K.; Uçar, E. Plant leaf disease classification using EfficientNet deep learning model. Ecol. Inform. 2021, 61, 101182. [Google Scholar] [CrossRef]
DAlexey, D.; Lucas, B.; Alexander, K.; Dirk, W.; Xiaohua, Z.; Thomas, U.; Mostafa, D.; Matthias, M.; Georg, H.; Sylvain, G.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. J. Comput. Vis. Pattern Recognit. 2021. [Google Scholar]
Qian, X.; Zhang, C.; Chen, L.; Ke, L. Deep learning-based identification of maize leaf diseases is improved by an attention mechanism: Self-Attention. Front. Plant Sci. 2022, 13, 864486. [Google Scholar] [CrossRef] [PubMed]
Agarwal, M.; Singh, A.; Arjaria, S.; Sinha, A.; Gupta, S. ToLeD: Tomato leaf disease detection using convolution neural network. Procedia Comput. Sci. 2020, 167, 293–301. [Google Scholar] [CrossRef]
Bhujel, A.; Kim, N.E.; Arulmozhi, E.; Basak, J.K.; Kim, H.T. A lightweight Attention-based convolutional neural networks for tomato leaf disease classification. Agriculture 2022, 12, 228. [Google Scholar] [CrossRef]
Sembiring, A.; Away, Y.; Arnia, F.; Muharar, R. Development of concise convolutional neural network for tomato plant disease classification based on leaf images. In Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2021; Volume 1845, p. 012009. [Google Scholar]
Hettiarachchi, D.; Fernando, V.; Kegalle, H.; Halloluwa, T. UrbanAgro: Utilizing Advanced Deep Learning to Support Sri Lankan Urban Farmers to Detect and Control Common Diseases in Tomato Plants. In Application of Machine Learning in Agriculture; Academic Press: Cambridge, MA, USA, 2022; pp. 263–282. [Google Scholar]
Demin, G.; Quan, S.; Bin, H.; Shuo, Z. A framework for agricultural pest and disease monitoring based on internet-of-things and unmanned aerial vehicles. Sensors 2020, 20, 1487. [Google Scholar]
Reedha, R.; Dericquebourg, E.; Canals, R.; Hafiane, A. Transformer Neural Network for Weed and Crop Classification of High Resolution UAV Images. Remote Sens. 2022, 14, 592. [Google Scholar] [CrossRef]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
Hirose, S.; Wada, N.; Katto, J.; Sun, H. ViT-GAN: Using Vision Transformer as Discriminator with Adaptive Data Augmentation. In Proceedings of the 2021 3rd International Conference on Computer Communication and the Internet (ICCCI), Nagoya, Japan, 25–27 June 2021; pp. 185–189. [Google Scholar]
Dmitrii, T.; Yi, H.; Haiwang, Y.; Jin, H.; Shinjae, Y.; Meifeng, L.; Brett, V.; Yihui, R. UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation. arXiv 2022, arXiv:2203.02557. [Google Scholar]
Haiping, W.; Bin, X.; Noel, C.; Mengchen, L.; Xiyang, D.; Lu, Y.; Lei, Z. CvT: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 22–31. [Google Scholar]
Li, Y.; Yunpeng, C.; Tao, W.; Weihao, Y.; Yujun, S.; Zihang, J.; Francis, E.H.T.; Jiashi, F.; Shuicheng, Y. Tokens-to-token VIT: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 558–567. [Google Scholar]
Shupei, W.; Youqiang, S.; He, H. Multi-granularity Feature Extraction Based on Vision Transformer for Tomato Leaf Disease Recognition. In Proceedings of the 2021 3rd International Academic Exchange Conference on Science and Technology Innovation (IAECST), Guangzhou, China, 10–12 December 2021; pp. 387–390. [Google Scholar]
Thai, H.T.; Tran-Van, N.Y.; Le, K.H. Artificial Cognition for Early Leaf Disease Detection using Vision Transformers. In Proceedings of the 2021 International Conference on Advanced Technologies for Communications (ATC), Ho Chi Minh, Vietnam, 14–16 October 2021; pp. 33–38. [Google Scholar]
Hirani, E.; Magotra, V.; Jain, J.; Bide, P. Plant Disease Detection Using Deep Learning. In Proceedings of the 2021 6th International Conference for Convergence in Technology (I2CT), Mumbai, India, 2–4 April 2021; pp. 1–4. [Google Scholar]
Chandra, S.R.; Suraj, S.; Deepak, P.; Zomaya, A.Y. Location of Things (LoT): A Review and Taxonomy of Sensors Localization in IoT Infrastructure. IEEE Commun. Surv. Tutor. 2018, 20, 2028–2061. [Google Scholar]
Chen, J.; Wang, S.; Ouyang, M.; Xuan, Y.; Li, K.-C. Iterative Positioning Algorithm for Indoor Node Based on Distance Correction in WSNs. Sensors 2019, 19, 4871. [Google Scholar] [CrossRef] [Green Version]
Akhtar, M.N.; Shaikh, A.J.; Khan, A.; Awais, H.; Bakar, E.A.; Othman, A.R. Smart sensing with edge computing in precision agriculture for soil assessment and heavy metal monitoring: A review. Agriculture 2021, 11, 475. [Google Scholar] [CrossRef]
Kalyani, Y.; Collier, R. A systematic survey on the role of cloud, fog, and edge computing combination in smart agriculture. Sensors 2021, 21, 5922. [Google Scholar] [CrossRef]
Jin, H.; Li, Y.; Qi, J.; Feng, J.; Tian, D.; Mu, W. GrapeGAN: Unsupervised image enhancement for improved grape leaf disease recognition. Comput. Electron. Agric. 2022, 198, 107055. [Google Scholar] [CrossRef]
Ganatra, N.; Patel, A. Performance analysis of fine-tuned convolutional neural network models for plant disease classification. Int. J. Control. Autom. 2020, 13, 293–305. [Google Scholar]
Tharwat, A. Classification assessment methods. Appl. Comput. Inform. 2020, 17, 168–192. [Google Scholar] [CrossRef]
Kaiming, H.; Xiangyu, Z.; Shaoqing, R.; Jian, S. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Christian, S.; Sergey, I.; Vincent, V.; Alexander, A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the AAAI Conference on Artificial 13 Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Mingxing, T.; Quoc, L. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Kun, Y.; Shaopeng, G.; Ziwei, L.; Aojun, Z.; Fengwei, Y.; Wei, W. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 579–588. [Google Scholar]
Kai, H.; An, X.; Enhua, W.; Jianyuan, G.; Chunjing, X.; Yunhe, W. Transformer in transformer. Adv. Neural Inf. Processing Syst. 2021, 34, 15908–15919. [Google Scholar]
Hugo, T.; Matthieu, C.; Matthijs, D.; Francisco, M.; Alexandre, S.; Hervé, J. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]

Figure 1. Beijing district map. The district number consists of two parts, namely the location regions and the functional areas. The four location regions were numbered as 1–4, in order from the city center to the suburbs. The upper right part shows the map appendix for Beijing regions and affiliated districts; the lower right part shows the tomato diseases. The * matches represent the sampled districts in the map, respectively.

Figure 2. The monitoring agriculture scenario for tomato plants.

Figure 3. Traditional datasets augmentation method to expand tomatoes’ disease images.

Figure 4. Overall flow of ShuFormer.

Figure 5. Structure of the block of convolutional neural network branch.

Figure 6. Structure of the block of depth-wise separable convolution.

Figure 7. Structure of the block of transformer branch.

Figure 8. Tomatoes’ diseases detection system flow chart.

Figure 9. Five-fold cross-validation, leaving one for validation and the remaining four for training.

Figure 10. Multiple diseases in different regions. * sign represents tomato data collection area in Beijing.

Figure 11. Comparison between ShuFormer, Inception-v4, EfficientNet-B5, CeiT-S, TNT-S, DeiT-S, and ResNet-152 on their respective performance with different number of

K

.

Figure 11. Comparison between ShuFormer, Inception-v4, EfficientNet-B5, CeiT-S, TNT-S, DeiT-S, and ResNet-152 on their respective performance with different number of

K

.

Table 1. Datasets used for image classification tasks.

Dataset	Train Size	Test Size	Classes
Changping	6720	1580	7
Shunyi	4800	1200	7
Tongzhou	5680	1420	7

Table 2. Data augmentation techniques and parameters used in this experiment.

#	Data Augmentation Techniques	Parameters
1	Cutout adjustment	Delta = 0.5
2	Mixup adjustment	Alpha = 0.4
3	Gaussian filtering	Sigma_range = [0.5, 1.5]
4	Median filtering	Kernel_size_range = [3, 5]
5	Brightness adjustment	Gamma = 2.0
6	Rotation adjustment	Angle_range = [−150, 150]

Table 3. Network configurations of ShuFormer (the parameters of building blocks are shown in brackets, with the numbers of blocks stacked).

Stage	Output	CNN Branch	Fusion Block	Transformer Branch
Head	$56 \times 56 \times 48$	K = 3, S = 3, C = 48
Stage 1	$56 \times 56 \times 48$	$[\begin{matrix} K 1 = 3 \\ G 1 = 48 \\ C 1 = 96 \end{matrix}] \times$ b1	Rearrange Channel shuffle Split	$[\begin{matrix} W 1 = 7 \\ H 1 = 3 \\ D 1 = 96 \end{matrix}] \times b 1$
Stage 2	$28 \times 28 \times 96$	$[\begin{matrix} K 1 = 3 \\ G 1 = 48 \\ C 1 = 96 \end{matrix}] \times$ b2	Rearrange Channel shuffle Split	$[\begin{matrix} W 1 = 7 \\ H 1 = 6 \\ D 1 = 96 \end{matrix}] \times b 2$
Stage 3	$14 \times 14 \times 192$	$[\begin{matrix} K 1 = 3 \\ G 1 = 48 \\ C 1 = 96 \end{matrix}] \times$ b3	Rearrange Channel shuffle Split	$[\begin{matrix} W 1 = 7 \\ H 1 = 12 \\ D 1 = 24 \end{matrix}] \times$ b3
Stage 4	$7 \times 7 \times 384$	$[\begin{matrix} K 1 = 3 \\ G 1 = 48 \\ C 1 = 96 \end{matrix}] \times b 4$	Rearrange Channel shuffle Split	$[\begin{matrix} W 1 = 7 \\ H 1 = 48 \\ D 1 = 96 \end{matrix}] \times b 4$
FC	$1 \times 1 \times 100$	Linear 1028-d, Softmax, Linear 1000-d
		ShuFormer-Small	ShuFormer-Base
Block_num [b1,b2,b3,b4]		[2,2,6,2]	[2,2,18,2]
Params		7.81 M	13.21 M
FLOPs		1.13 G	2.19 G

Table 4. Comparison of agricultural image classification models based on state-of-the-art CNN-based and transformer models.

Model	Params	FLOPs	Changping	Shunyi	Tongzhou
ResNet-152 [36]	60.19 M	11.56 G	94.3	92.6	93.5
Inception-v4 [37]	41.1 M	16.1 G	93.8	91.5	93.1
EfficientNet-B5 [38]	28.0 M	9.9 G	95.5	93.1	95.6
CeiT-S [39]	24.2 M	4.5 G	94.8	92.8	94.2
TNT-S [40]	23.8 M	17.3 G	94.5	93.1	94.3
DeiT-S [41]	22.34 M	4.26 G	94.7	93.8	93.8
ShuFormer-Small (ours)	7.81 M	1.13 G	93.6	94.5	95.9
ShuFormer-Base (ours)	13.21 M	2.19 G	96.69	95.83	96.39

Table 5. The classification reports generated from 5-fold.

K = 1

represents the highest number of training images (12,900), and

K = 4

represents the lowest number of training images (3440). The average

F_{1 - S c o r e}

is reported for each district obtained with the ShuFormer-base model.

Table 5. The classification reports generated from 5-fold.

K = 1

represents the highest number of training images (12,900), and

K = 4

represents the lowest number of training images (3440). The average

F_{1 - S c o r e}

is reported for each district obtained with the ShuFormer-base model.

Area K-Fold	K = 1	K = 2	K = 3	K = 4
Area K-Fold	$μ_{F 1 - S c o r e}$	$μ_{F 1 - S c o r e}$	$μ_{F 1 - S c o r e}$	$μ_{F 1 - S c o r e}$
Changping	96.69	96.48	96.09	95.74
Shunyi	95.83	95.64	95.39	95.08
Tongzhou	96.39	95.92	95.74	95.22
Avg	96.30	96.01	95.74	95.35

Table 6. Comparison of plant disease classification reports generated from 5-fold cross-validation. The average precision, recall, and

F_{1 - S c o r e}

are reported for each class obtained with the ShuFormer-base model.

Table 6. Comparison of plant disease classification reports generated from 5-fold cross-validation. The average precision, recall, and

F_{1 - S c o r e}

are reported for each class obtained with the ShuFormer-base model.

Disease Area	Changping			Shunyi			Tongzhou
Disease Area	$μ_{P r e c i s i o n}$	$μ_{R e c a l l}$	$μ_{F 1 - S c o r e}$	$μ_{P r e c i s i o n}$	$μ_{R e c a l l}$	$μ_{F 1 - S c o r e}$	$μ_{P r e c i s i o n}$	$μ_{R e c a l l}$	$μ_{F 1 - S c o r e}$
bacterial spot	96.1	97.3	96.79	95.1	96.3	95.70	-	-	-
gray mold	96.8	95.5	96.14	94.5	96.2	95.34	95.6	97.3	96.44
striped rot	96.2	97.4	96.80	95.4	96.6	96.00	96.8	95.1	95.94
cracked fruit	-	-	-	-	-	-	96.6	97.8	97.20
Blossom end rot	96.6	97.4	97.00	-	-	-	95.8	96.4	96.10
late blight disease	-	-	-	-	-	-	97.3	95.6	96.44
anthracnose	97.4	95.6	96.49	95.4	96.8	96.09	96.5	97.2	96.85
citrus canker	96.8	97.8	97.30	95.3	96.4	95.85	95.4	96.1	95.75
early blight disease	-	-	-	96.4	95.6	96.00	-	-	-
leaf mold	96.2	96.6	96.40	95.5	96.2	95.85	-	-	-
avg	96.59	96.8	96.69	95.37	96.3	95.83	96.29	96.50	96.39

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Chen, Y.; Wang, D. Convolution Network Enlightened Transformer for Regional Crop Disease Classification. Electronics 2022, 11, 3174. https://doi.org/10.3390/electronics11193174

AMA Style

Wang Y, Chen Y, Wang D. Convolution Network Enlightened Transformer for Regional Crop Disease Classification. Electronics. 2022; 11(19):3174. https://doi.org/10.3390/electronics11193174

Chicago/Turabian Style

Wang, Yawei, Yifei Chen, and Dongfeng Wang. 2022. "Convolution Network Enlightened Transformer for Regional Crop Disease Classification" Electronics 11, no. 19: 3174. https://doi.org/10.3390/electronics11193174

APA Style

Wang, Y., Chen, Y., & Wang, D. (2022). Convolution Network Enlightened Transformer for Regional Crop Disease Classification. Electronics, 11(19), 3174. https://doi.org/10.3390/electronics11193174

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Convolution Network Enlightened Transformer for Regional Crop Disease Classification

Abstract

1. Introduction

2. Related Work

3. Architecture

3.1. Drone Sensor Node

3.2. Edge Node

3.3. Cloud Platform

4. Datasets Preprocessing

4.1. Dataset

4.2. Data Augmentation

5. Proposed Method

5.1. Overall

5.2. Local Representations Branch Based on Depthwise Separable Convolutional Network

5.3. Global Representations Branch Based on Transformer Structure

5.4. Feature Fusion Block

6. Experiments

6.1. Fine Tuning with Transfer Learning

6.2. Exprerimental Enviroment and Evaluation/Measurement Metrics

6.3. Results

7. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI