*Article* **Deep Learning Application for Crop Classification via Multi-Temporal Remote Sensing Images**

**Qianjing Li 1,2, Jia Tian 1,3,\* and Qingjiu Tian 1,2**


**Abstract:** The combination of multi-temporal images and deep learning is an efficient way to obtain accurate crop distributions and so has drawn increasing attention. However, few studies have compared deep learning models with different architectures, so it remains unclear how a deep learning model should be selected for multi-temporal crop classification, and the best possible accuracy is. To address this issue, the present work compares and analyzes a crop classification application based on deep learning models and different time-series data to exploit the possibility of improving crop classification accuracy. Using Multi-temporal Sentinel-2 images as source data, time-series classification datasets are constructed based on vegetation indexes (VIs) and spectral stacking, respectively, following which we compare and evaluate the crop classification application based on time-series datasets and five deep learning architectures: (1) one-dimensional convolutional neural networks (1D-CNNs), (2) long short-term memory (LSTM), (3) two-dimensional-CNNs (2D-CNNs), (4) three-dimensional-CNNs (3D-CNNs), and (5) two-dimensional convolutional LSTM (ConvLSTM2D). The results show that the accuracy of both 1D-CNN (92.5%) and LSTM (93.25%) is higher than that of random forest (~ 91%) when using a single temporal feature as input. The 2D-CNN model integrates temporal and spatial information and is slightly more accurate (94.76%), but fails to fully utilize its multi-spectral features. The accuracy of 1D-CNN and LSTM models integrated with temporal and multi-spectral features is 96.94% and 96.84%, respectively. However, neither model can extract spatial information. The accuracy of 3D-CNN and ConvLSTM2D models is 97.43% and 97.25%, respectively. The experimental results show limited accuracy for crop classification based on single temporal features, whereas the combination of temporal features with multi-spectral or spatial information significantly improves classification accuracy. The 3D-CNN and ConvLSTM2D models are thus the best deep learning architectures for multi-temporal crop classification. However, the ConvLSTM architecture combining recurrent neural networks and CNNs should be further developed for multi-temporal image crop classification.

**Keywords:** crop type classification; deep learning; multi-temporal; remote sensing

#### **1. Introduction**

Detailed and accurate information on crop-type cultivation is essential for developing economically and ecologically sustainable agricultural strategies in a changing climate, and for satisfying human food demands [1]. Multi-temporal remote sensing (RS) images acquired throughout the growing season provide an effective method for acquiring crop cover information over large areas [1,2]. Multi-temporal images can be used to distinguish crop growth states and the phenological characteristics of crops. In addition, they provide enriched features that allow more complex and stable crop classification tasks. They have thus seen wide use in the field of agricultural RS [3,4].

Two main strategies are available for multi-temporal crop classification. The first strategy is to stack multi-temporal images by time sequence and classify them with classifiers

**Citation:** Li, Q.; Tian, J.; Tian, Q. Deep Learning Application for Crop Classification via Multi-Temporal Remote Sensing Images. *Agriculture* **2023**, *13*, 906. https://doi.org/ 10.3390/agriculture13040906

Academic Editor: Roberto Alves Braga Júnior

Received: 1 April 2023 Revised: 17 April 2023 Accepted: 19 April 2023 Published: 20 April 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

such as support vector machine (SVM), random forest and maximum likelihood [5,6]. However, this approach does not model temporal correlations and uses features independently, ignoring possible temporal dependencies [6,7]. Most classifiers such as SVM rely heavily on features that are not designed for time-series data, making it difficult to exploit any inherent time-series variability features. In addition, the stacked images increase redundancy and lead to the dimensionality catastrophe with increasing time-series length, which negatively affects classification performance [6,8]. The second strategy is to obtain new images from reflectance images by using spectral indices, such as the normalized difference vegetation index (NDVI) and the enhanced vegetation index (EVI), and then construct time-series data to reveal the temporal pattern of the different features. With this method, crops and other vegetation are classified with high accuracy. However, the classification results of this method are limited strictly by the number of images in the time-series. If the number is too small, then the temporal pattern has little effect on classification performance [8]. In addition, manual feature engineering based on human experience and prior knowledge is essential with this approach, which increases the complexity of processing and computation [7,9]. Moreover, the construction of VIs based on the specific spectral features ignores other spectral bands, which in turn affects the classification performance.

Current multi-temporal RS images are multi-spectral, multi-temporal and multispatial. In multi-temporal images, crops are represented via variations in temporal, spectral, and spatial features. These features can be comprehensively included in four-dimensional (4D: time, height, width, and band) data that require classification models to learn and represent temporal, spectral, and spatial features. Multi-temporal images thus pose new challenges to the models used for data processing, so integrating multi-temporal images and continuously improving crop classification accuracy requires continued attention.

Deep learning is a breakthrough technique in machine learning that outperforms traditional algorithms in terms of feature extraction and representation [5–7], which has led to its application in numerous RS classification tasks [8–10]. Convolutional neural networks (CNNs) produce more accurate results than other models in most RS image classification problems [8,9,11]. The one-dimensional CNN (1D-CNN) model is commonly used to extract spectral features from hyperspectral images or temporal features from time-series images, providing an effective and efficient method for crop identification in time-series RS images [12]. The CNN learning process is computationally efficient and insensitive to data shifts such as image translation, allowing CNN models to recognize image patterns in two dimensions (2D) [13]. Three-dimensional (3D) CNN models use the spatial, temporal, and spectral information in multi-temporal images, and therefore are widely used in multi-temporal crop classification [11,14]. Long short-term memory (LSTM), a variant of recurrent neural networks (RNNs), is a natural candidate to represent temporal dependency over various temporal periods with gated recurrent connections [9,15]. LSTM models have been widely used for multi-temporal crop classification because they can analyze sequential data [9,16,17]. For multi-temporal crop classification, both CNN and RNN provide more accurate results than machine learning and traditional classification [5,9,11]. However, various deep learning architectures produce different results when applied to multi-temporal crop classification, feature learning and representation of crop spectral, spatial, and temporal information.

Convolutional LSTM (ConvLSTM) is a type of RNN with internal matrix multiplication replaced by convolution operations [18]. ConvLSTM, integrating both LSTM and CNN structures, shows unexpected adaptability to multi-temporal images [19–21]. However, due to the prevalence of CNNs and RNNs and the requirement for higher data dimensions, the ConvLSTM model is less commonly used in multi-temporal crop classification. Nevertheless, the potential of the ConvLSTM model deserves further exploration.

To summarize, multi-temporal images pose a new challenge to classification models in terms of data processing and feature extraction, but also open new opportunities for using data-driven deep learning to classify RS images. In this work, we use multi-temporal Sentinel-2 RS images as input data, and analyze the advantages of using such data and the

structural advantages of various deep learning models. This research investigates (1) the possibility of using multi-temporal images for more accurately classifying crops; (2) the contribution of spectral, temporal, and spatial information to multi-temporal crop classification; and (3) the potential and requirements of using deep learning for multi-temporal crop classification. We also (4) search for a feasible and suitable deep learning model that provides optimum classification accuracy from multi-temporal images. Although such deep learning models have long been used for RS applications, this work compares and analyzes multi-temporal crop classification based on the deep learning architectures of CNN, LSTM, and ConvLSTM.

#### **2. Materials**

#### *2.1. Study Area*

The study area, Norman county, is located in northwestern Minnesota (Figure 1), which is a highly productive agricultural state in the United States. Minnesota is in the Great Plains of the central United States, and agricultural land covers the vast majority of the study area. The continental climate of the region is cold in the winter and hot and humid in the summer, with 600 mm/year of precipitation. The highest temperatures occur in July, and the lowest in January, with an average of 197 sunny days per year. The climatic and temperature conditions make single-season crop cultivation the main cropping system. The major crops in this region are corn, soybeans, sugarbeets, and spring wheat, which are planted in about 89% of the study area. Corn begins being planted at the end of April, matures in September, and is harvested through October. Soybeans are planted in May and harvested from mid-September through the end of October. Spring wheat is sown in early April and harvested from mid-July through August. Sugarbeets are planted in mid-April, mature in September, and are harvested by the end of October.

**Figure 1.** False color image and the Cropland Data Layer (CDL) of study areas.

#### *2.2. Data*

#### 2.2.1. Remote Sensing Images

Sentinel-2 images were downloaded from the Sentinel Hub (https://www.sentinelhub.com/ (accessed on 28 October 2022)). Cloud-free images from April 2021 to October 2021 were selected to encompass the entire crop growing season. A total of 13 Sentinel-2 images (Tables 1 and 2) were selected as the main input data of the experiment. Data preparation involved stacking and resampling the 20 m spectral bands to 10 m and the removal of the coastal band, water vapor, and the cirrus band, accomplished through the Sentinel Application Platform (SNAP).


**Table 1.** Spectral bands of Sentinel-2 images.

**Table 2.** Acquisition time of Sentinel-2 images.


2.2.2. Training and Validation Samples

The Cropland Data Layer (CDL) is a crop-type distribution product published by the United States Department of Agriculture and the National Agricultural Statistics Service. The 2021 CDL (Figure 1) for Norman County has a spatial resolution of 30 m, and was obtained from the CropScape website portal (https://nassgeodata.gmu.edu/CropScape/ (accessed on 20 October 2022)). Although the CDL is not the absolute ground truth, it is the most accurate crop-type product available, especially for corn and soybeans, with over 95% accuracy [22]. In Minnesota, the accuracies for several major crop types are close to or above 95% [23]. Therefore, a result of visual interpretation of multi-temporal Sentinel-2 images based on CDL data was used to select the crop samples for training and testing our crop classification model.

Based on the CDL, crop types in the study area were classified as corn, soybeans, sugar beets, spring wheat, and "other." The latter category ("other") includes all surface cover types except for the four major crops. To ensure the representativeness of the samples and the data size requirements of the deep learning model, the samples are selected to ensure that the sample points are distributed throughout the study area, that the central sample pixel type is consistent with the type of surrounding pixels, and that the sample pixel type is the dominant type in the local neighborhood. The sample points were created from a function of randomly created points and labeled by visual interpretation. Table 3 details the samples used for training the classification model and evaluating the accuracy. To train the model, the training and validation samples in Table 3 are randomly divided into training samples and validation samples in a ratio of 7:3.

**Table 3.** The five categories used in the present study for classification and the number of samples.


#### **3. Methodology**

#### *3.1. Methodological Overview*

The overall workflow of this study is shown in Figure 2. Firstly, we selected samples as described in Section 2.2.2. Next, different time-series images were constructed for the subsequent classification experiments (Section 3.2). Multiple deep learning models were constructed (Section 3.5), in which random forest was used as benchmark model. Details of the experiments can be found in Section 3.6. Finally, all classification results were validated, compared and analyzed.

**Figure 2.** General workflow of this study.

#### *3.2. Temporal Phenological Patterns*

Two main strategies are available to represent the temporal patterns of crops for multi-temporal image crop classification: (1) time-series VIs constructed from spectral characteristics, and (2) time-series multi-spectral bands based on spectral stacking [5], which means stacking multi-temporal images by time sequence. Both strategies have been used to construct time-series data to represent the temporal characteristics of crops. Given the sensitivity of the NDVI [24] and EVI [25] to the physiological state of vegetation and their wide application [5,9], these indices have been used to construct time-series data. Their formulas are as follows:

$$\text{NDVI} = (NIR - RED) / (NIR + RED) \tag{1}$$

$$\text{EVI} = \text{G} \times (NIR - RED) / (NIR + \text{C}\_1 \times RED - \text{C}\_2 \times BLILE + L), \tag{2}$$

where G = 2.5, *C*<sup>1</sup> = 6.0, *C*<sup>2</sup> = 7.5, and *L* = 1.0. *NIR*, *RED* and *BLUE* represent the spectral reflectance bands of B8(NIR), B4(Red) and B2(Blue) in Sentinel-2 (Table 1).

#### *3.3. Deep Learning Models*

A CNN is a multilayer feed-forward neural network. The advantages of local connectivity and weight sharing not only decrease the number of parameters but also reduce the complexity of the model and make CNNs more suitable for processing numerous images [9,26]. CNNs may be one-dimensional (1D-CNN), two-dimensional (2D-CNN), or three-dimensional (3D-CNN), by having convolution kernels of different dimensions. Sequence data are fed into 1D-CNNs for learning and representing sequence relationships. Patch-based 2D-CNNs can be used for learning and representing spatial and spectral features in images. Cube-based 3D-CNNs correspond to the spectral, spatial, and temporal features in multi-temporal images [12,14]. The LSTM solves the problems of vanishing gradient, exploding gradient, and deficiencies in long-term dependency representation

that appear in RNNs. In LSTM, the gate mechanisms, which include the input gate, output gate, and forget gate, enhance or weaken the state of the data in the cell for information protection and control [16,17]. The ConvLSTM model is an improvement and extension of the LSTM model, wherein matrix multiplication in LSTM is replaced by a convolution at each gate [20]. The ConvLSTM model combines the structural advantages of LSTM and CNN, and not only captures the spatial context of the image, but also models the long-term dependencies in the spectral domain. In addition, inter- and intra-layer data transfer enables the ConvLSTM to extract features more efficiently than a CNN or LSTM [18,19].

#### *3.4. Sample Dimensions*

Limited by the size and dimensions of samples in multi-temporal RS images, classification samples contain different spectral, temporal, and spatial information. This study uses various deep learning models to learn and represent spectral, temporal, and spatial information from multi-temporal images. The time-series classification data constructed from VI have only temporal characteristics [9], and their samples are one-dimensional vectors (Figure 3a). The time-series data constructed directly using multi-spectral, multitemporal images are two-dimensional matrices with the shape of (band, time) (Figure 3b). The time-series data constructed from VIs including the spatial neighborhood are threedimensional matrices (Figure 3c) with the shape of (height, width, time). The multi-spectral features combined with the spatial neighborhood in multi-temporal images produce fourdimensional matrices with the shape of (time, height, width, band) (Figure 3d). The "time" in three- or four-dimensional matrices means the number of temporals in time-series.

**Figure 3.** Time-series samples with different dimensions. (**a**) 1-D time-series, (**b**) 2-D time-series, (**c**) 3-D time-series, (**d**) 4-D time-series.

#### *3.5. Deep Learning Architectures*

The main deep learning classification models used in the study are 1D-CNN, LSTM, 2D-CNN, 3D-CNN, and ConvLSTM2D. The temporal, spectral, and spatial information of multi-temporal images can be learned and represented by different deep learning models corresponding to different types of samples. Both 1D-CNN and LSTM models can represent temporal features, and the model input corresponds to 1D and 2D samples (Figure 3a,b). 1D-CNN (Conv1D) models acquire the temporal patterns of sequence data through a 1D convolution, and Conv1D layers learn local features by stacking in a shallow network, whereas a deeper network synthesizes more pattern features within a larger receptive field. The representation of sequence patterns by LSTM models at different temporal frequencies is advantageous for analyzing the temporal characteristics within a crop growing season. 3D times-series samples (Figure 3c) are used as 2D-CNN input, and the Conv2D layer captures the crop temporal and spatial variations through convolution of the spatial domain and through time sequences of the multi-temporal images. 3D-CNN convolves multi-temporal images from different dimensions and represents features of shallow and deep temporal, spatial, and spectral information of crops by stacking convolutional layers (Conv3D). Like

LSTM, ConvLSTM2D is sensitive to temporal patterns, and convolutional operations inside the ConvLSTM2D cell efficiently capture spatial information. The structure (ConvLSTM2D) learns and represents temporal, spectral, and spatial information similar to that of the 3D-CNN models. Both 3D-CNN and ConvLSTM2D models use 4D time-series samples (Figure 3d) as model input. Figure 4 shows the different network architectures.

**Figure 4.** Architectures of (**a**) LSTM, (**b**) 1D-CNN, (**c**) 2D-CNN, (**d**) 3D-CNN, and (**e**) ConvLSTM2D.

Because of the versatility and complexity of deep learning architectures, no standard procedure exists to search for the optimal combination of hyperparameters and the associated layers [18,19]. As a result, an extremely large number of potential network architectures must be considered, making it impossible to try them all. In this paper, the hyperparameter setting and optimization of model are based on strategies from the literature [8,9]. The hyperparameters of the deep learning models include the type and number of hidden layers and the number of neurons in each layer. The layer channels are 16, 32, 64, 128, 256 and the sample sizes are 3, 5, 7, 9. The learning rate is 0.01 or 0.05. The length of the time series is 13. The convolution kernel width is 3 [26,27]. Pooling layers are fixed as max-pooling, with a window size of 2. Dropout with probabilities of 0.3, 0.5, and 0.8 is a regularization technique that randomly drops neurons in a layer during training to prevent the output of the layer from relying on only a few neurons. Each model contains two fully connected layers at the output end. The last layer contains five neurons corresponding to the probability of the five classes.

The hyper-parameters are selected and determined step-by-step, based on numerous training experiments. Each deep learning model (Figure 4) is determined by stepwise optimization and adjustment [9]. A large number of training experiments have shown that the epoch of 400 can meet the training requirements of the model. All deep learning architectures are trained by a backpropagation algorithm, where the stochastic gradient descent is used as the optimizer for model training. The parameters of the stochastic gradient descent are decay = 10 − 5 and momentum = 0.99. The sample size of the architectures is 9. The learning rate and batch size are 0.01 and 32, respectively. The dropout probability in LSTM is 0.8. Binary cross entropy serves as the loss function. Deep learning models were built using the Keras library and TensorFlow. Finally, the confusion matrix and kappa coefficient from Scikit-learn are metrics for evaluating the accuracy of crop classification. The calculation of VIs and the construction of time-series data are implemented in Python.

#### *3.6. Experiment Design*

The multi-temporal images are divided into different experimental groups based on the multiple sample types presented in Section 3.2, and the different deep learning models are used to classify the crops based on multi-temporal images. Additionally, random forest is used as a benchmark model in E1, E2, E3 and E6. See Table 4 for details. The B2348 (Table 4) corresponds to the four spectral bands in Table 1. The same applies to the other features (Table 4).



E1 and E2 are time-series VI datasets with temporal features constructed from a single VI. E3–E6 are multi-temporal images acquired with different spectral combinations. E3 is a conventional spectral combination of red–green–blue and near-infrared bands. E4 and E5 add shortwave infrared (SWIR) and red-edge spectral bands to E3, respectively. E6 contains the 10 spectral bands of Sentinel-2 images. 1D-CNN and LSTM models are used for crop classification with different spectral combinations and to analyze how multi-spectral and temporal information affect classification accuracy. E7 and E8 are used to classify crops with a 2D-CNN model, and the comparison with E1 and E2 is designed to quantify the contribution of spatial information in multi-temporal crop classification. E9 and E10 are used to classify crops with 3D-CNN and ConvLSTM2D models; E9 uses conventional spectral bands as input and E10 uses the 10 spectral bands of Sentinel-2 images. The comparison and analysis of crop classification with the different experimental groups show how temporal, spectral, and spatial information affect classification accuracy.

#### **4. Results**

The accuracy of crop classification via multi-temporal images mainly depends on three factors: time-series data construction, feature extraction, and classification method. Our experiments verify the contribution of time-series data and deep learning models. Various time-series data are constructed based on the strategy presented in Section 3.2 and feed

into the deep learning architectures (Figure 4) of Section 3.3 for different experiments. The classification results and accuracies are given in subsequent sections.

#### *4.1. Classification Based on VI Time Series*

E1 and E2 in Figure 5 and Table 5 show the results of time-series crop classification based on NDVI and EVI. The classification accuracies produced by the 1D-CNN (Figure 4b) and LSTM (Figure 4a) models for E1 and E2 exceed 92%, and the kappa coefficient is greater than 0.9. The highest overall accuracy (OA) for E2 (LSTM) is close to 94%. Compared with random forest, deep learning models based on 1D-CNN and LSTM have higher accuracy (Table 5) and better performance in local regions (Figure 5). These results show that the 1D-CNN and LSTM models constructed herein are suitable for multi-temporal crop classification based on VI. Compared with E1, the OA for E2 increases by 0.26% and 0.69% for the 1D-CNN and LSTM models, respectively. This reflects the variability of different VIs and the similarity of time-series VI for crop classification. Compared with the 1D-CNN model, the LSTM model is more accurate; the OA improves by 0.75% and 1.18% for E1 and E2, respectively. These results show that both the LSTM and 1D-CNN models can capture temporal features, although the LSTM model is more accurate.

**Figure 5.** Crop classification results based on VI time-series (see red boxes for more detail).


**Table 5.** Classification accuracy produced by various models with VI time series.

Differences in architecture also affect classification accuracy. Compared with the other results in Figure 5, the RF-based results (Figure 5a,e) are worse locally, while almost no salt-and-pepper noises appear in Figure 5c,h. Compared with E1 and E2, the accuracy of E7 and E8 improved by 0.82% to 2.24%, and the improvement exceeds RF by 3.5%. E7 and E8 classified by the 2D-CNN model (Figure 4c) produce a favorable overall classification accuracy of above 94.7% and a kappa coefficient of 0.934, which is attributed to the effective learning and representation of temporal and spatial information in patch-based time-series VI data by 2D-CNN.

Figure 5 and Table 5 also show that the classification results based on deep learning outperform the random forest. However, the misclassification of crop types in Figure 5 indicates that further optimization is still needed. Based on the same model, there is no significant accuracy difference in E1 and E2. This indicates that improving accuracy solely using time-series data (temporal features) constructed from a single VI is difficult. However, the addition of spatial information not only improves crop classification accuracy but also eliminates salt-and-pepper noise. In addition, the 1D-CNN and LSTM architectures limit the possibility of exploiting spatial information in multi-temporal crop classification, whereas the 2D-CNN model produces more accurate crop classification based on single VI time-series data.

#### *4.2. Classification Based on Multi-Spectral Time Series*

Figure 6 and Table 6 show the classification results of E3–E6 based on the timeseries data constructed from multi-spectral, multi-temporal images. The crop classification accuracy of the 1D-CNN model is less than that of the LSTM model applied to E3–E6, which is similar to the results of the LSTM model. Therefore, hereinafter, we consider only the crop classification results based on LSTM.

The input data in E3–E6 have both multi-spectral and -temporal features, differing only in the number of multi-spectral bands, as explained in Section 3.4. Table 6 shows that the accuracy of RF-based is lower than deep learning, and Figure 6 also shows that results of deep learning are better in local areas. The OA of E3–E6 is 95.31%, 96.72%, 96.37%, and 96.94%, respectively. Compared with E3, the addition of spectral bands, especially red-edge bands (E5) or SWIR bands (E4), improves the crop classification accuracy, with SWIR bands contributing slightly more than red-edge bands. Using the LSTM model with E6 surprisingly remains the most accurate configuration, with the crop-classification accuracy improving by 1.63% with respect to E3. This indicates that the advantage of the number of spectral bands in multi-spectral images cannot be neglected. With the addition of spectral bands, salt-and-pepper noise is eliminated to varying degrees, with the least salt-and-pepper noise coinciding with the most accurate crop classification (Figure 6f), indicating that the salt-and-pepper phenomenon is weakened but hardly eliminated by using multi-spectral bands. Combined with the presentation in Section 4.1, these results further demonstrate how spatial information affects multi-temporal crop classification.

**Figure 6.** Crop classification results based on multi-spectral time series.

Furthermore, the addition of different spectral bands in E3–E6 increases the diversity of input classification data. In the same experimental group, the accuracy difference between 1D-CNN and LSTM varies from 0.1% to 0.44%, with the minimum difference of 0.1% presented in E6. However, in the different experimental groups, the accuracy difference of the same model varies from 1.06% to 1.95%, with E6 showing an accuracy improvement of nearly 2% compared to E3. In E9 and E3, the spatial information causes differences in the input data. The accuracy difference between different deep learning models with the same input data is small, ranging from 0.21% to 0.42%. In contrast, the accuracy difference between the same models with different input data is larger, ranging from 1.88% to 1.25%. This indicates that increasing the diversity of input data is more important for improving crop classification accuracy than using different deep learning models.


**Table 6.** Classification accuracy produced by various models and multi-spectral time-series data.

Figure 7 and Table 6 present the classification results of E9 and E10 using the 3D-CNN (Figure 4d) and ConvLSTM2D (Figure 4e) models. The OA of 3D-CNN in E9 and E10 was 96.77% and 96.56%, respectively, with kappa coefficients of 0.960 and 0.957. The OA of ConvLSTM2D in E9 and E10 was 97.43% and 97.25%, respectively, with kappa coefficients of 0.968 and 0.966. The accuracy is slightly greater when using the 3D-CNN model than when using the ConvLSTM2D model. The use of the 3D-CNN model on E10 produces the greatest crop classification accuracy of 97.43%, which translates into an OA improved by 3.69%, 2.67%, 0.49%, and 4.93% with respect to E2 (LSTM), E8 (2D-CNN), E6 (LSTM), and E1 (1D-CNN), respectively. Compared with the E6 (LSTM), the salt-and-pepper noise is eliminated in E9 and E10 (Figure 7b,d), although the improvement in accuracy is not obvious. E10 produces more accurate results than E9 because it contains more spectral bands in the input data.

The classification results of the different experiments verify the feasibility of the model constructed herein (Figure 4) for multi-temporal crop classification. The comparison of the results of the different experiments shows that both the construction of the time-series data and that of the classification model influence the crop classification accuracy. The LSTM model produces more accurate crop classification results than the 1D-CNN model. However, when using time-series data constructed from VIs, the 2D-CNN model produces more accurate results than the 1D-CNN and LSTM models after the elimination of the salt-and-pepper noise. When using time-series data constructed by stacking spectral bands, increasing the number of bands in the input data improves the crop classification accuracy while somewhat reducing the salt-and-pepper noise. Additionally, the LSTM model again produces slightly more accurate crop classifications than the 1D-CNN model, which indicates that the LSTM model is more able to capture temporal features.

E10 treated by the 3D-CNN and ConvLSTM2D models (Figure 4) produces the most accurate crop classification of all experiments. In addition, the architectures of the 3D-CNN and ConvLSTM2D models lead to better learning and representation for multitemporal crop features, making these models more suitable for crop classification from multi-temporal images.

**Figure 7.** Crop classification results based on temporal, spectral, and spatial information.

Combined with the previous analysis of classification accuracy, VI time-series data using only temporal information only slightly improves the crop classification accuracy. The addition of multi-spectral data based on temporal information improves crop classification accuracy, and the salt-and-pepper noise is more easily alleviated upon increasing the number of spectral bands. As the number of input features increases, the contribution of spatial information in improving classification accuracy decreases. However, the elimination of salt-and-pepper noise through the use of spatial information remains a clear advantage in crop mapping. Therefore, making full use of the temporal, spectral, and spatial information is a more feasible strategy for multi-temporal crop classification. The deep learning architecture fed with 4D data involving multi-temporal images is thus the best model for accurate crop classification based on multi-temporal images.

#### **5. Discussion**

#### *5.1. Analysis of Time-Series Profile*

Figure 8 shows the temporal profiles of crops produced by VIs and spectra. The buffer areas of crop profiles overlap throughout the growing season, despite the difference in average reflectance or VI values. In the middle of the growing season, the spectral overlap within the crop becomes smaller (~DOY 200–220) than in the early or late growing season. During this period, the temporal curves of crops with one standard deviation are more stable and distinguishable, which indicates that this feature should be useful for differentiating between crops. In addition, the temporal windows always serve for single-temporal crop classification [28]. However, the similarity and overlap of profiles over the whole growing season make it difficult to distinguish crops such as corn and soybeans based solely on single images [1,2].

**Figure 8.** Time-series spectral band and vegetation indices are aggregated for crop fields. The buffers indicate one standard deviation calculated from the fields.

The differences in the time series curves (Figure 8) between crops in different spectral ranges and time periods make it possible to distinguish between crops [29]. For example, the gap in B8 (Figure 8) during the middle growing season (≈DOY 180–220) makes it possible to distinguish between spring wheat and sugar beets. Figure 8 shows that almost no spectral overlap occurs between corn and soybeans in B11 and B12 during the period of time (≈DOY 170–200). The gap observed in the profiles of sugarbeets and other crops in bands B6-B8 and B8A, as shown in Figure 8, occurs during two periods of time, which are around DOY 180–220 and 250–270. Spring wheat can be directly distinguished from profiles in B2–B5 (Figure 8) around DOY 225 and in B11 and B12 in the period DOY 210–240. Corn and soybeans can be differentiated with greater probability in the period DOY 170–200 in B11 and B12. In addition, the overlap in temporal profile based on the NDVI is similar to the other spectra in Figure 8. The profiles of corn and soybeans almost overlap over the entire growing season, which explains the difficulty of distinguishing between these two crops [3,4]. The profiles of sugarbeets and spring wheat clearly differ between DOY 260 and 170.

As previously mentioned, time-series images based on single VI or band are insufficient to accurately distinguish between different crops. However, different crops exhibit spectral differences in the time-series curves of each spectral band (Figure 8), indicating the potential of each spectral band to distinguish between different crops. Better utilization of the advantages of multi-spectral bands has greater potential to improve the accuracy of crop classification [30]. The addition of different types of spectral bands such as red-edge and SWIR has reinforced this conclusion in classification experiments [9].

#### *5.2. Effects of Temporal, Spectral, and Spatial Feature*

The effects of temporal, spectral, and spatial information on crop classification are revealed in the different time-series data. The crop classification results due to the different time-series classification data are shown in Figures 5–7 and Tables 5 and 6. Using only temporal features may not be sufficient for accurate crop classification due to salt-andpepper noise (Figures 5 and 6), which can affect pixel-based classification. Fully exploiting the abundant spectral and spatial information in multi-temporal images can be challenging when using only VI, but it provides more possibilities for improving accuracy. [5,31] pointed out that spatial features such as texture can lead to good classification performance, and a similar result occurs for 2D-CNN classification (Figure 5). In addition, based on the analysis in the previous sections, the contribution to the accuracy of spatial information such as texture [9,32] decreases as the number of input features increases. Moreover, the spatial information contributes significantly to the classification accuracy for a feature input of a single VI. [8] also suggested that more information-dense data are required to improve the crop-classification accuracy based on multi-temporal images. The diversity of information and the differences in time-series data depicted in Figure 8 provide more possibilities for accurate classification and can alleviate the salt-and-pepper phenomenon. Nevertheless, spatial information remains a vital ingredient to eliminate salt-and-pepper noise.

#### *5.3. Comparison of Deep Learning Models*

The temporal dependencies in multi-temporal images are long term and complex, and crops have unique temporal, spectral, and spatial features (Figure 8). Sufficient model complexity and automated feature learning and representation satisfy the data-processing needs of models in multi-temporal crop classification [9,12]. Differing from the result that 1D-CNN accuracy is higher than that of LSTM [9], increasing the number of spectral bands in this work causes the accuracy of 1D-CNN to be close to that of LSTM. This indicates that input features and application scenarios (more crop types) may also affect the accuracy of the classification. The architecture of 2D-CNN models is limited by their structure, meaning that they can only accept time-series data constructed by a single VI or spectral band as input. This prevents 2D-CNN models from exploiting multi-spectral information. The analysis in Section 4 also points out that 2D-CNN models are less accurate than 1D-CNN and LSTM models using multiple spectral bands. In contrast with 2D-CNN models, both 3D-CNN and ConvLSTM2D models require 4D data that perfectly fit the temporal, spectral, and spatial features. The classification results (Figures 7 and 9) of 3D-CNN and ConvLSTM2D models are also significantly more accurate and stable than other comparative models. [30] also pointed out that models such as 3D-CNN should be considered for crop classification from multi-temporal images.

**Figure 9.** The OA of different deep learning models.

As described in Section 3.5, each model is trained extensively to achieve the best classification results. Therefore, the parameters of deep learning models in this work will likely need to be adjusted to achieve satisfactory accuracy for other classification tasks. Additionally, numerous model training experiments are necessary in this process.

#### *5.4. Potential of 3D-CNN and ConvLSTM2D for Crop Classification from Multi-Temporal Images*

Crop classification from multi-temporal RS images often has a time lag due to data acquisition [5,6]. However, time-series data can alleviate this issue, whereby different objects have the same spectrum, and the same objects have different spectra in the background of relatively complex crop cultivations. Previous analyses also revealed that fully exploiting the temporal, spectral, and spatial information in multi-temporal images should be a major avenue to improve classification accuracy. 3D-CNN and ConvLSTM2D models can integrate multi-temporal information and have advantageous structures not found in other models such as 2D-CNN and SVM [11,19]. The best classification accuracies are provided by 3D-CNN and ConvLSTM2D models, and exceed 97% (Table 6). Figure 10 shows the strong correlation between the results obtained herein and the CDL for the area ratio of different crops. It also shows potential applications for crop classification based on multi-temporal images.

**Figure 10.** Correlation of crop-area ratio. ((**a**–**d**) correspond to four experiments, as shown in the vertical label. The scatter points mean the fraction of different crop over the study area. The red line reflects the consistency of crop area between the classification results and the CDL.)

Different network structures in deep learning models such as inception [33], dropout [8], and transformer [34] all enhance the feature learning and representation capabilities of the network. Deep learning models (Figure 4) are constructed by simple stacking of modules, so they lack special design for multi-temporal images and cannot treat scale effects [35] in images. In addition, information redundancies (Figure 8) with high inter-band similarity must be considered. Both architectures have inherent advantages for processing multi-temporal images. Although ConvLSTM2D has fewer applications in multi-temporal image crop classification than 3D-CNN [14], the results of this study show that this model approaches the classification capability of 3D-CNN. References [13,36] pointed out that 3D-CNN is not suitable for establishing long-term dependencies of time-series data due to locally computed convolutions, whereas ConvLSTM2D combines the sequence processing capability of LSTM and the structure of CNN, which facilitates the addition of multiple special structures and modules so that it can be exploited to classify crops from multi-temporal images.

#### **6. Conclusions**

This paper constructs various time-series datasets based on Sentinel-2 multi-temporal images by VI or spectral stacking, and develops deep learning models with different structures for classifying crops from multi-temporal images. The results lead to the following conclusions:

(1) Greater data diversity (temporal, spectral and spatial information) is effective in improving crop classification accuracy. The temporal feature only provides limited improvement in the accuracy of crop classification from multi-temporal images. As more spectral information is added, the accuracy can be further improved, and the impact of salt-and-pepper noise can be alleviated. The inclusion of spatial information can eliminate salt-and-pepper noise, and its contribution to accuracy decreases as the number of input features increases.


In this paper, smaller areas and simple crop types are used for deep learning multitemporal crop classification application studies. In future research, crop classification based on deep learning is still needed for large-scale study areas and complex planting systems, such as crop rotation and more crop types. In addition, the impact of clouds on image acquisition is difficult to avoid. While the acquisition of synthetic aperture radar (SAR) is not affected by clouds, which can also increase the diversity of classification data. Therefore, research into crop classification by synergistic SAR and optical images with different acquisition frequencies will be carried out. Additionally, the ConvLSTM model will be used as the classification model to explore its potential in multi-source image crop classification.

**Author Contributions:** Conceptualization, Q.L. and J.T.; Methodology, Q.L. and J.T.; Software, Q.L.; Validation, Q.L.; Formal analysis, Q.L. and J.T.; Writing—original draft preparation, Q.L.; Writing review and editing, Q.L., J.T. and Q.T.; Visualization, Q.L.; Project administration, Q.T.; Funding acquisition, J.T. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (grant number: 42101321 and 41771370) and the Open Fund of State Key Laboratory of Remote Sensing Science (grant number: OFSLRSS202119).

**Institutional Review Board Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are available on request from the author.

**Acknowledgments:** The authors acknowledge the support provided by their respective institutions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

**Huishan Li, Lei Shi, Siwen Fang and Fei Yin \***

College of Information and Management Science, Henan Agricultural University, Zhengzhou 450046, China **\*** Correspondence: yin.fei@henau.edu.cn

**Abstract:** Aiming at the problem of accurately locating and identifying multi-scale and differently shaped apple leaf diseases from a complex background in natural scenes, this study proposed an apple leaf disease detection method based on an improved YOLOv5s model. Firstly, the model utilized the bidirectional feature pyramid network (BiFPN) to achieve multi-scale feature fusion efficiently. Then, the transformer and convolutional block attention module (CBAM) attention mechanisms were added to reduce the interference from invalid background information, improving disease characteristics' expression ability and increasing the accuracy and recall of the model. Experimental results showed that the proposed BTC-YOLOv5s model (with a model size of 15.8M) can effectively detect four types of apple leaf diseases in natural scenes, with 84.3% mean average precision (mAP). With an octa-core CPU, the model could process 8.7 leaf images per second on average. Compared with classic detection models of SSD, Faster R-CNN, YOLOv4-tiny, and YOLOx, the mAP of the proposed model was increased by 12.74%, 48.84%, 24.44%, and 4.2%, respectively, and offered higher detection accuracy and faster detection speed. Furthermore, the proposed model demonstrated strong robustness and mAP exceeding 80% under strong noise conditions, such as exposure to bright lights, dim lights, and fuzzy images. In conclusion, the new BTC-YOLOv5s was found to be lightweight, accurate, and efficient, making it suitable for application on mobile devices. The proposed method could provide technical support for early intervention and treatment of apple leaf diseases.

**Keywords:** smart agriculture; detection of apple leaf diseases; YOLOv5; transformer; CBAM

#### **1. Introduction**

As one of the top four popular fruits in the world, apple is highly nutritious and provides significant medicinal value [1]. In China, apple production has expanded, making it the world's largest apple producer. However, a variety of diseases hamper the healthy growth of apple, seriously affecting the quality and yield of apple and causing significant economic losses. According to statistics, there are approximately 200 types of apple diseases, most of which occur in apple leaf areas. Therefore, to ensure the healthy development of the apple planting industry, accurate and efficient leaf disease identification and control measures are needed [2].

In traditional disease identification techniques, fruit farmers and experts rely on visual examination based on their experience, a method which is inefficient and highly subjective. With the advance of computer and information technology, image recognition technology has been gradually applied in agriculture. Many researchers have applied machine vision algorithms to extract features such as color, shape, and texture from disease images and input them into specific classifiers to accomplish plant disease recognition tasks [3]. Zhang et al. [4] processed apple disease images using HSI, YUV, and gray models; then, the authors extracted features using genetic algorithms and correlation based-feature selection, and ultimately discriminated apple powdery mildew, mosaic, and rust diseases using an SVM classifier with an identification accuracy of more than 90%. However, the complex image background and the feature extraction, dominated by strong experience,

**Citation:** Li, H.; Shi, L.; Fang, S.; Yin, F. Real-Time Detection of Apple Leaf Diseases in Natural Scenes Based on YOLOv5. *Agriculture* **2023**, *13*, 878. https://doi.org/10.3390/ agriculture13040878

Academic Editor: Filipe Neves Dos Santos

Received: 15 March 2023 Revised: 10 April 2023 Accepted: 14 April 2023 Published: 15 April 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

make the labor and time costs much higher, as well as makingthe system difficult to promote and popularize.

In recent years, deep learning convolutional neural networks have been widely used in agricultural intelligent detection, with faster detection speeds and higher accuracy compared to traditional machine vision techniques [5]. There are two types of target detection models; the first is the two-stage detection algorithm represented by R-CNN [6] and Faster R-CNN [7]. Xie et al. [8] used an improved Faster R-CNN detection model for real-time detection of grape leaf diseases, introducing three modules (Inception v1, Inception-ResNet-v2, and SE) in the model, and mean average precision (mAP) achieved 81.1%. Deng et al. [9] proposed a method for large-scale detection and localization of pine wilt disease using unmanned remote sensing and artificial intelligence technology, and a series of optimizations to improve detection accuracy to 89.1%. Zhang et al. [10] designed a Faster R-CNN (MF3R-CNN) model with multiple feature fusion for soybean leaf disease detection, achieving an average accuracy of 83.34%. Wang et al. [11] used the RFCN ResNet101 model to detect potato surface defects and achieved an accuracy of 95.6%. This two-stage detection model was capable of identifying crop diseases, but its large network model and slow detection speed made it difficult to apply in real planting industry.

Another type of target detection algorithm is the one-stage algorithm represented by SSD [12] and YOLO [13–16] series. Unlike the two-stage detection algorithm, it does not require the generation of candidate frames. By converting the boundary problem into a regression problem, features extracted from the network are used to predict the location and class of lesions. Due to its high accuracy, fast speed, short training time, and low computational requirement, it is more suitable for agricultural applications. Wang et al. [17] used the SSD-MobileNet V2 model for the detection of scratches and cracks on the surface of litchi, which eventually achieved 91.81% mAP and 102 frame per second (FPS). In the experiments of Chang-Hwan et al. [18], a new attention-enhanced YOLO model was proposed for identifying and detecting plant foliar diseases. Li et al. [19] improved the CSP, feature pyramid networks (FPN), and non-maximum suppression (NMS) modules in YOLOv5 to detect five vegetable diseases and obtained 93.1% mAP, effectively reducing missing and false detections caused by complex background. In complex orchard environments, Jiang et al. [20] proposed an improved YOLOX model to detect sweet cherry fruit ripeness. In improving the model, mAP and recall were both improved by 4.12% and 4.6%, respectively, which effectively solved the interference caused by fruit overlaps and shaded branches and leaves. Li et al. [21] used the improved YOLOv5n model to detect cucumber diseases in natural scenes and achieved higher detection accuracy and speed. While the development of intelligent crop disease detection using one-stage detection algorithms has matured, less research has been carried out for apple leaf disease detection. Small datasets and simple image backgrounds pose problems for most existing studies. Consequently, it is crucial to develop an apple leaf disease detection model with high recognition accuracy and fast detection speed for mobile devices with limited computing power.

Considering the complex planting environment in apple orchards and the various shapes of lesions, this study proposed the use of an improved target detection algorithm based on YOLOv5s. The proposed algorithm aimed to reduce false detections caused by multi-scale lesions, dense lesions, and inconspicuous features in apple leaf disease detection tasks. As a result, the accuracy and efficiency of the model could be enhanced to provide essential technical support for apple leaf disease identification and intelligent orchard management.

#### **2. Materials and Methods**

*2.1. Materials*

2.1.1. Data Acquisition and Annotation

In this study, three datasets were used to train and evaluate the proposed model: the Plant Pathology Challenge 2020 (FGVC7) [22] dataset, the Plant Pathology Challenge 2021 (FGVC8) [23] dataset, and the PlantDoc [24] dataset.

FGVC7 and FGVC8 [22,23] consist of apple leaf disease images used in the Plant Pathology Fine-Grained Visual Categorization competition hosted by Kaggle. The images were captured by Cornell AgriTech using Canon Rebel T5i DSLR and smartphones, with a resolution of 4000 × 2672 pixels for each image. There are four kinds of apple leaf diseases, namely rust, frogeye leaf spot, powdery mildew, and scab. These diseases occur frequently and cause significant losses in the quality and yield of apples. Sample images of the dataset are shown in Figure 1.

**Figure 1.** FGVC7 and FGVC8 disease images. (**a**) Frogeye leaf spot; (**b**) Rust; (**c**) Scab; (**d**) Powdery mildew.

PlantDoc [24] is a dataset of non-laboratory images constructed by Davinder Singh et al. in 2020 for visual plant disease detection. It contains 2598 images of plant diseases in natural scenes, involving 13 species of plants and as many as 17 diseases. Most of the images in PlantDoc have low resolution, large noise, and an insufficient number of samples, making detection more difficult. In this study, apple rust and scab images were used to enhance and validate the generalization of the proposed model. Examples of disease images are shown in Figure 2.

**Figure 2.** PlantDoc disease images. (**a**) Rust; (**b**) Scab.

From the collected datasets, we selected (1) images with light intensity varying with the time of day, (2) images capture using different shooting angles, (3) images with different disease intensities, and (4) images from different disease stages to ensure the richness and diversity of the dataset. Finally, a total of 2099 apple leaf disease images were selected. LabelImg software was used to label the images with categories including disease type, center coordinates, width, and height of each disease spots. In total, we annotated 10,727 lesion instances, and annotations are shown in Table 1. The labeled dataset was randomly divided into training and test sets at a ratio of 8:2. This dataset was called ALDD (apple leaf disease data) and was used to train and test the model.



#### 2.1.2. Data Enhancement

The actual apple orchard in a complex environment contains many disturbances and the currently selected data is far from sufficient. To enrich the image dataset, mosaic image enhancement [16] and online data enhancement were chosen to expand the dataset. Mosaic image enhancement involves a random selection of 4 images from the training set, which are finally combined into one image after rotation, scaling, and hue adjustment. This approach not only enriches the image background and increases the number of instances, but also indirectly boosts the batch size. This accelerates model training and is favorable to improving small target detection performance. Online augmentation is the use of data augmentation in model training, which ensures the invariance of the sample size and the diversity of the overall sample and improves the model's robustness by continuously expanding the sample space. Mainly includes alterations to hue, saturation, brightness transformation, translation, rotation, flip, and other operations. The total number of the dataset is constant; however, the amount of data input to each epoch is changing, and it is more conducive to fast convergence of the model. Examples of enhanced images are shown in Figure 3.

**Figure 3.** Original and enhanced images. (**a**) Original; (**b**) Flip horizontal; (**c**) Rotation transformation; (**d**) Hue enhancement; (**e**) Saturation enhancement; (**f**) Mosaic enhancement.

#### *2.2. Methods*

#### 2.2.1. YOLOv5s Model

Depending on the network depth and feature map width, YOLOv5 can be divided into YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x [25]. As the depth and width increase, the number of layers of the network increases as well as the structure becomes more complex. In order to meet the requirements of lightweight deployment and real-time detection, reduce storage space occupied by the model and improve the identification speed, YOLOv5s was selected as the baseline model in this study.

The YOLOv5s was composed of four parts: input, backbone, neck, and prediction. The input section included mosaic data enhancement, adaptive calculation of the anchor box, and adaptive scaling of images. The backbone module performed feature extraction and consisted of four parts: focus, CBS, C3, and spatial pyramid pooling (SPP). There were two types of C3 [26] modules in YOLOv5s for backbone and neck, as shown in Figure 4. The first one used the residual units at the backbone layer, while the second one did not. SPP [27] performed the maximum pooling of feature maps using convolutional kernels of

different sizes in order to fuse multiple sense fields and generate semantic information. The neck layer used a combination of (FPN) [28] and path aggregation networks (PANet) [29] to fuse the image features. The prediction included three detection layers, corresponding to 20 × 20, 40 × 40, and 80 × 80 feature maps, respectively, for detecting large, medium, and small targets. Finally, the distance between the predicted boxes and the true boxes was calculated using the complete intersection over union (CIOU) [30] loss function, and the NMS was applied to remove the redundant boxes and retain the detection boxes with the highest confidence. The YOLOv5s network model is shown in Figure 4.

**Figure 4.** YOLOv5s method architecture diagram.

#### 2.2.2. Bidirectional Feature Pyramid Network

The YOLOv5s combines FPN and PANet for multi-scale feature fusion, with FPN enhancing semantic information in a top-down fashion and PANet enhancing location information from the bottom up. This combination enhances the feature fusion capability of the neck layer. However, when fusing input features at different resolutions, the features are simply summed and their contributions to the fused output features are usually inequitable. To address this problem, Tan et al. [31] developed the BiFPN based on efficient bidirectional cross-scale connections and weighted multiscale feature fusion. The BiFPN introduced learnable weights in order to learn the importance of different input features, while topdown and bottom-up multi-scale feature fusion was applied iteratively. The structure of BiFPN is shown in Figure 5.

**Figure 5.** BiFPN network structure diagram, where (**a**) FPN introduces a top-down path to fuse multi-scale features from P3 to P6; (**b**) PANet adds an additional bottom-up path on top of the FPN; (**c**) BiFPN removes redundant nodes and adds additional connections on top of PANet.

The BiFPN removes the node with only one input edge because it does not perform feature fusion. The contribution to the network aim of fusing different features is minimal, and so it is removed and the bidirectional network is simplified. Additionally, an extra edge is added between the input and output nodes that are at the same layer to obtain higher-level fusion features through iterative stacking. The BiFPN introduces a simple and efficient weighted feature fusion mechanism by adding a learnable weight that assigns different degrees of importance to feature maps of different resolutions. The formulas are shown in (1) and (2):

$$P\_i^{td} = \mathbb{C}on\upsilon \left(\frac{w\_1 \cdot P\_i^{in} + w\_2 \cdot Resize\left(P\_{i+1}^{in}\right)}{w\_1 + w\_2 + \epsilon}\right) \tag{1}$$

$$P\_i^{out} = Conv\left(\frac{w\_1' \cdot P\_i^{in} + w\_2' \cdot P\_i^{td} + w\_3' \cdot Ressize\left(P\_{i-1}^{out}\right)}{w\_1' + w\_2' + w\_3' + \epsilon}\right) \tag{2}$$

where *Pi in* is the input feature of layer *i*, *Pi td* is the intermediate feature on the top-down pathway of layer *i*, *Pi out* is the output feature on the bottom-up pathway of layer *i*, *ω* is the learnable weight, *ε* = 0.0001 is a small value to avoid numerical instability, Resize is a downsampling or upsampling operation, and *Conv* is a convolution operation.

The neck layer with BiFPN added a fusion of multi-scale features to provide powerful semantic information to the network. It helped to detect apple leaf diseases of different sizes and alleviated the network's inaccurate identification of overlapping and fuzzy targets.

#### 2.2.3. Transformer Encoder Block

There was a high density of lesions on apple leaves. In order to avoid the problem that the number of lesions and background information increased after mosaic data enhancement, which caused the inability to accurately locate the area where the diseases, the transformer [32] attention mechanism was added to the end of the backbone layer. The transformer module was employed to capture global contextual information and establish long-range dependencies between feature channels and disease targets. The transformer encoder module used a self-attentive mechanism to explore the feature representation capability and an had excellent performance in highly dense scenarios [33]. The self-attention mechanism was designed based on the principles of human vision and allocated resources according to the importance of visual objects. The self-attentive mechanism had a global sensory field, which modeled long-range contextual information, captured rich global semantic information, and assigned different weights to different semantic information to make the network focus more on key information [34]. It was calculated as (3), and contained three basic elements: query, key, and value, denoted by *Q*, *K*, and *V*, respectively.

$$Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d\_k}}\right)V\tag{3}$$

where *dk* is the number of input feature map channel sequences, using normalized data to avoid gradient increment.

Each transformer encoder is composed of a multi-head attention and a feed-forward neural network. The structure of multi-head attention mechanism is shown in Figure 6. It differs from the self-attentive mechanism in that the self-attentive mechanism uses only one set of *Q*, *K*, and *V* values, while it uses multiple sets of *Q*, *K*, and *V* values to compute and stitch multiple matrices together. The different linear transformations feature different vector spaces, which can help the current code to focus on the current pixels and acquire semantic information about the context [35]. The multi-head attention mechanism enhances the ability to extract disease features by capturing long-distance dependent information without increasing the computational complexity and improves the model's detection performance.

**Figure 6.** Structure of multi-headed attention mechanism.

#### 2.2.4. Convolutional Block Attention Module

Determining the disease species relies more on local information in the feature map, while the localization of lesions is more concerned with the location information. This model used the CBAM [36] attention mechanism in the improved YOLOv5s to weight the features in space and channels and enhance the model's attention to local and spatial information.

As shown in Figure 7, the CBAM contained two sub-modules: the channel attention module (CAM) and the spatial attention module (SAM), for spatial and channel attention, respectively. The input feature map *<sup>F</sup>*∈RC×H×<sup>W</sup> was first passed through the one-dimensional convolution operation *Mc*∈RC×1×<sup>1</sup> of the CAM, and the convolution result was multiplied with the input features. The output result of CAM was then used as input, the two-dimensional convolution operation *Ms*∈R1×H×<sup>W</sup> of the SAM was performed, and then the result was multiplied with the CAM output to obtain the final result. The calculation formulas are as (4) and (5).

$$F' = M\_{\mathfrak{c}}(F) \otimes F \tag{4}$$

$$F'' = M\_s \begin{pmatrix} F' \\ \end{pmatrix} \otimes F' \tag{5}$$

where *F* denotes the input feature map, *Mc* denotes the one-dimensional convolution operation of CAM, *Ms* denotes the two-dimensional convolution operation of SAM, and ⊗ denotes element multiplication.

**Figure 7.** Convolutional block attention module (CBAM).

The CAM in CBAM focused on the weights of different channels and multiplied the channels with the corresponding weights to increase attention to important channels. The feature map *F* of size H × W × C was averaged and maximally pooled to obtain two 1 × 1 × C channel mappings, respectively, and then a two-layer shared multi-layer perception (*MLP*) operation was performed. The two outputs were summed element by element, and then a sigmoid activation function was applied to output the final result. The calculation process is shown in Equation (6).

$$M\_{\mathbb{C}}(F) = \sigma(MLP(AvgPool(F)) + MLP(MaxPool(F)))\tag{6}$$

As shown in Equation (7), the SAM was more concerned with the location information of the lesions. The CAM output was averaged and maximally pooled to obtain two H' × W' × 1 channel maps. The final result was obtained by concatenating the two feature maps, followed by a 7 × 7 convolution operation and a Sigmoid activation function.

$$M\_s(F) = \sigma\left(f^{7 \times 7}([AvgPool(F); MaxPool(F)])\right) \tag{7}$$

#### 2.2.5. BTC-YOLOv5s Detection Model

Based on the original advantages of the YOLOv5s model, this study proposed using an improved BTC-YOLOv5s algorithm for detecting apple leaf diseases. While ensuring the speed of the procedure, it improved the accuracy of identifying apple leaf diseases in a complex environment. The proposed algorithm was improved mainly in three parts: the BiFPN, transformer, and CBAM attention mechanism. Firstly, the CBAM module was added in front of the SPP in the YOLOv5s backbone layer to highlight useful information and suppress useless information in the disease detection task, thereby improving the model's detection accuracy. Secondly, the C3 was replaced with the C3TR module with transformer and improved the ability to extract apple leaf disease features. Thirdly, we replaced the concat layer with the BiFPN layer, and a path from the 6th layer was added to the 20th layer. The features generated by the backbone at the same layer were bidirectionally connected with the features generated by the FPN and the PANet to provide stronger information representation capability. Figure 8 shows the overall framework of the BTC-YOLOv5s model for this study.

**Figure 8.** BTC-YOLOv5s model structure diagram.

#### *2.3. Experimental Equipment and Parameter Settings*

The model was trained and tested on a Linux system running under the PyTorch 1.10.0 deep learning framework, using the following device specifications: Intel(R) Xeon(R) E5-2686 v4 @ 2.30 GHz processor, 64 GB of memory, and NVIDIA GeForce RTX3090 graphics card with 24 GB of video memory. The software was executed on cuda 11.3, cudnn 8.2.1, and python 3.8.

During training, the initial learning rate was set to 0.01, and the cosine annealing strategy was employed to decrease the learning rate. Additionally, the neural network parameters were optimized using the stochastic gradient descent (SGD) method, with a momentum value of 0.937 and a weight decay index score of 0.0005. The training epoch was 150, the image batch size was set to 32, and the input image resolution was uniformly adjusted to 640 × 640. Table 2 shows the tuned training parameters.

**Table 2.** Model training parameters.


#### *2.4. Model Evaluation Metrics*

The evaluation metrics are divided into two aspects: performance assessment and complexity assessment. The model performance evaluation metrics include precision, recall, mAP, and F1 score. The model complexity evaluation metrics include model size, floating point operations (FLOPs), and FPS, which evaluate the computational efficiency and image processing speed of the model.

Precision is the ratio of the correctly predicted positive samples to the total number of samples predicted as positive and is used to measure the classification ability of a model, while the recall measures the ratio of the correctly predicted positive samples to the total number of positive samples. The AP is the integral of precision and recall, and the mAP is the average of AP, which reflects the overall performance of the model for target detection and classification. F1 score is the harmonic mean of precision and recall, and it uses both precision and recall to evaluate the performance of the model. The calculation formulas are shown in Equations (8)–(12).

$$Precision = \frac{TP}{TP + FP} \tag{8}$$

$$Recall = \frac{TP}{TP + FN} \tag{9}$$

where *TP* is the number of positive samples with correct detection, *FP* is the number of positive samples with incorrect detection, and *FN* is the number of negative samples with incorrect detection.

$$\text{AP} = \int\_0^1 P(R)dR \tag{10}$$

$$\text{mAP} = \frac{\sum\_{i=1}^{n} AP\_i}{n} \tag{11}$$

where *n* is the number of disease species.

$$\text{F1} = \frac{2 \times Precision \times Recall}{Precision + Recall} \tag{12}$$

The model size refers to the amount of memory required for storing the model. FLOPs is used to measure the complexity of the model, which is the total number of multiplication and addition operations performed by the model. The lower the FLOPs value, the less computation is required for model inference, and the faster model computation will be. The formula for FLOPs is shown in Equations (13) and (14). The FPS indicates the number of pictures processed per second by the model, which can assess the processing speed and is crucial for real-time disease detection. Considering that the model can be implemented on mobile devices with low computational cost, an octa-core CPU without a graphics card was selected to run the test.

$$\text{FLOPs(Conv)} = \left(2 \times \mathbb{C}\_{in} \times K^2 - 1\right) \times \mathbb{W}\_{out} \times H\_{out} \times \mathbb{C}\_{out} \tag{13}$$

$$\text{FLOPs(Limer)} = (2 \times \mathbb{C}\_{\text{in}} - 1) \times \mathbb{C}\_{\text{out}} \tag{14}$$

where *Cin* represents the input channel, *Cout* represents the output channel, *K* represents the convolution kernel size, and *Wout* and *Hout* represent the width and height of the output feature map, respectively.

#### **3. Results**

#### *3.1. Performance Evaluation*

The proposed BTC-YOLOv5s model was validated using the constructed ALDD test set. Additionally, the same optimized parameters were used to compare results with YOLOv5s baseline model. As shown in Table 3, the improved model achieved similar AP scores for frogeye leaf spots as the original model, while significantly improving the detection performance for the other three diseases. Notably, scab disease, with its irregular lesion shape, was the most issue to detect, and the improved model achieved a 3.3% increase in AP, which was the largest improvement. These results indicated that the proposed model effectively detected all four diseases with improved accuracy.

**Table 3.** Comparison of detection results of YOLOv5s and BTC-YOLOv5s.


Figure 9 shows evaluation results of precision, recall, mAP@0.5, and mAP@0.5:0.95 for the baseline model YOLOv5s and the improved model BTC-YOLOv5s trained with 150 epochs.

In Figure 9, it is displayed that the precision and recall curves fluctuated within a narrow range after 50 epochs, but that the BTC-YOLOv5s curve remained consistently above the baseline model curve. From the mAP@0.5 curve, it can be seen that the mAP@0.5 curve of the improved model intersected with the baseline model at around 60 epochs. Although the mAP@0.5 of the baseline model increased rapidly in the early stage, the BTC-YOLOv5s model improved steadily in the later stage and showed better results. The mAP@0.5:0.95 curve also demonstrated a similar behavior.

As apple leaf diseases were small and densely distributed, for further verification of the BTC-YOLOv5s model's accuracy, the test sets were divided into two groups based on lesion density, namely sparse distribution and dense distribution of lesions. We compared the detection results of the baseline model and the improved model. The mAP@0.5 of BTC-YOLOv5s model for sparse and dense lesions images was 87.3% and 81.4%, respectively, which was 1.7% and 0.7% higher than that of the baseline model.

**Figure 9.** Evaluation metrics of different models, where (**a**) is a comparison of precision curves before and after model improvement; (**b**) comparison of recall curves before and after model improvement; (**c**) comparison of mAP@0.5 curves before and after model improvement; (**d**) comparison of mAP@0.5:0.95 curves before and after model improvement.

As shown in Figure 10, yellow circles represent missed detections and red circles represent false detections. It can be seen that, irrespective of whether the disease is sparse or dense, the baseline model YOLOv5s missed small or blurred lesions (the first row of images in Figure 10a,b). However, the improved model resolved this issue and detected small lesions or diseases on the leaves that were not in the focus range (the second row of images in Figure 10a,b). Additionally, the BTC-YOLOv5s model had higher confidence levels. The baseline model also mistakenly detected the non-diseased parts such as apples, background, and other irrelevant objects (Figure 10(a3,b1)), and there was a false detection whereby the scab was mistakenly detected as rust (Figure 10(b5)). The improved model could concentrate more on diseases and extract the gap characteristics between different diseases at a deeper level to avoid the above errors. Furthermore, the lesions of frogeye leaf spot, scab, and rust were small, dense, and distributed in different parts of the leaves, while powdery mildew typically affected the whole leaf. This led to the scale of the model detection box changing from large to small, and the proposed model was able to adapt well to the scale changes of different diseases.

Therefore, the BTC-YOLOv5s model could not only adapt to the detection of different disease distributions but could also adapt to the changes in apple leaf diseases with different scales and characteristics, showing excellent detection results.

**Figure 10.** Comparison of detection effect of lesion (sparse and dense) before and after model improvement. (**a**) Sparse distribution; (**b**) Dense distribution. Where yellow circles represent missed detections and red circles represent false detections. Lines 1 and 3 are the YOLOv5s baseline model, and lines 2 and 4 are the improved BTC-YOLOv5s model. Numbers 1 and 2 are frogeye leaf spot, numbers 3 and 4 are rust, numbers 5 and 6 are scab, and numbers 7 and 8 are powdery mildew.

#### *3.2. Results of Ablation Experiments*

This study verified the effectiveness of different optimization modules via ablation experiments. We constructed several improved models by adding the BiFPN module (BF), transformer module (TR), and CBAM attention module sequentially to the baseline model YOLOv5s and compared the results on the same test data. The experimental results are shown in Table 4.

In Table 4, the precision and mAP@0.5 of the baseline model YOLOv5s were 78.4% and 82.7%. By adding three optimization modules, namely the BiFPN module, transformer module, and CBAM attention module, both precision and mAP@0.5 were improved compared to the baseline model. Specifically, the precision increased by 3.3%, 3.3%, and 1.1%, respectively, and the mAP@0.5 increased by 0.5%, 1%, and 0.2%, respectively. The final combination of all three optimization modules achieved the best results, with precision, mAP@0.5 and mAP@0.5:0.95 all reaching the highest values, which were 5.7%, 1.6%, and 0.1% higher than those of the baseline model, respectively. By fusing cross-channel information with spatial information, the CBAM attention mechanism focused on important features while suppressing irrelevant ones. Additionally, the transformer module used the

self-attention mechanism to establish a long-range feature channel with the disease features. The BiFPN module fused the above features across scales to improve the identification of overlapping and fuzzy targets. As a result of the combination of three modules, the BTC-YOLOv5s model achieved the best performance.


**Table 4.** Results of ablation experiments.

Where BF and TR represent the BiFPN module and transformer module, respectively.

#### *3.3. Analysis of Attention Mechanisms*

In order to assess the effectiveness of the CBAM attention mechanism module, other structures of the BTC-YOLOv5s model were retained as experimental parameter settings, and only the CBAM module was replaced with other mainstream attention mechanism modules, such as SE [37], CA [38], and ECA [39] modules, for comparison purposes.

Table 5 shows that the attention mechanism could significantly improve the accuracy of the model. The mAP@0.5 of SE, CA, ECA, and CBAM models reached 83.4%, 83.6%, 83.6%, and 84.3%, respectively, which was 0.4%, 0.6%, 0.6%, and 1.3% higher than that of YOLOv5s + BF + TR model. Each attention mechanism improved the mAP@0.5 to varying degrees, with the CBAM model performing the best and reaching 84.3%, which was 0.9%, 0.7%, and 0.7% higher than that of SE, CA, and ECA models, respectively, and the mAP @ 0.5: 0.95 was also the highest among the four attention mechanisms. The SE and ECA attention mechanisms only took into account the channel information in the feature map, while the CA attentional mechanism encoded the channel relations using the location information. In contrast, the CBAM attention mechanism combined spatial and channel attention, emphasizing the information on disease features in the feature map, which was more conducive to disease identification and localization.


**Table 5.** Performance comparison of different attention mechanisms.

Moreover, the attention module did not increase the model size or FLOPs, indicating that it was a lightweight module. The BTC-YOLOv5s model with the CBAM module achieved improved recognition accuracy while maintaining the same model size and computational cost.

#### *3.4. Comparison of State-of-the-Art Models*

The current mainstream two-stage detection model Faster R-CNN and the one-stage detection models SSD, YOLOv4-tiny, and YOLOx-s were selected for comparison experiments. The ALDD dataset was used for training and testing, with the same experimental parameters across all models. The experimental results are shown in Table 6.


**Table 6.** Performance comparison of mainstream detection models.

Among all models, the mAP@0.5 and F1 score of Faster R-CNN were lower than 50%, with a large model size and computational effort, resulting in only 0.16 FPS, making it unsuitable for real-time detection of apple leaf diseases. The one-stage detection model SSD had an mAP@0.5 value of 71.56% and a model size of 92.1 MB, which did not meet the detection requirements in terms of model accuracy and complexity. In the YOLO model series, YOLOv4-tiny had an mAP@0.5 of only 59.86%, and the accuracy was too low. The YOLOx-s achieved 80.1% mAP@0.5, but the FLOPs were 26.64 G, and there were only 4.08 pictures per second. Neither of them was not conducive to mobile deployment. The proposed BTC-YOLOv5s model had the highest mAP@0.5 and F1 score among all models, exceeding SSD, Faster R-CNN, YOLOv4-tiny, YOLOx-s, and YOLOv5s by 12.74%, 48.84%, 24.44%, 4.2%, and 1.6%, respectively. The model size and FLOPs were similar to the baseline model, and FPS reached 8.7 frames per second to meet real-time detection of apple leaf diseases in real scenarios.

As seen in Figure 11, the BTC-YOLOv5s model outperformed the other five models in terms of detection accuracy. Additionally, the BTC-YOLOv5s model exhibited comparable model size, computational effort, and detection speed to the other lightweight models. In summary, the overall performance of the BTC-YOLOv5s model was excellent and could accomplish accurate and efficient apple leaf disease detection tasks in real-world scenarios.

**Figure 11.** Performance comparison of different detection algorithms.

#### *3.5. Robustness Testing*

In the actual production, the detection of apple leaf diseases may be interfered with by various objective environmental factors such as overexposure, dim light, and low-resolution images. In this study, the test set images were simulated by enhancing brightness, reducing brightness, and adding Gaussian noise, resulting in a total of 1191 images (397 images per case). We evaluated the robustness of the optimized BTC-YOLOv5s model under a variety of interference environments to determine its detection effectiveness. Additionally, we tested the model's ability to detect concurrent diseases by adding 50 images containing multiple diseases. Experimental results are shown in Figure 12.

**Figure 12.** Robustness test results under three extreme conditions. (**a**) Original; (**b**) Bright light; (**c**) Dim light; (**d**) Blurry. Where first to fifth rows show results for apple frogeye leaf spot, rust, scab, powdery mildew, and multiple diseases, respectively.

From the detection results, the model could accurately detect frogeye leaf spot, rust, and powdery mildew images under all three noise conditions (bright light, dim light, and blurry), with few missing detections. The scab disease was also correctly identified, but a certain degree of missing detections occurred in dim light and blurry conditions. This is mainly because the scab lesions appeared to be black, the overall background of the image has similar color to the lesions under dim light conditions. As shown in the fifth row of Figure 12, the model also demonstrated detection capabilities for images with concurrent onset, although a few missing detections occurred in the blurry condition. The experimental

results achieved more than 80% of mAP. Overall, the BTC-YOLOv5s model still exhibited strong robustness under extreme conditions, such as blurred images and insufficient light.

#### **4. Discussion**

#### *4.1. Multi-Scale Detection*

Multi-scale detection is a challenging task in apple leaf disease detection due to the varying sizes of the lesions. In this study, frogeye leaf spot, scab, and rust lesions are typically small and dense, while powdery mildew is a whole lesion distributed over the leaf. The size of the spots that need to be detected relative to the proportion of the whole image can vary widely between images or even within the same image. To address this issue, this study introduced the BiFPN into YOLOv5s based on the idea of multi-scale feature fusion to improve the model's ability. The BiFPN stacks the entire feature pyramid framework multiple times, providing the network with strong feature representation capabilities. It also performs weighted feature fusion, allowing the network to learn the significance of different input features. In the field of agricultural detection, multi-scale detection has been a popular research topic. For example, Li et al. [21] accomplished multi-scale cucumber disease detection by adding a set of anchors matching small instances. Cui et al. [40] used a squeeze-and-excitation feature pyramid network to fuse multi-scale information, retaining only the 26 × 26 detection head for pinecone detection. However, the current study still faces the challenge of significantly degraded detection accuracy for very large- or very small-scale targets. Future studies will focus on exploring how models can be applied to different scales of disease spots.

#### *4.2. Attentional Mechanisms*

The attention mechanism assigns weight to the image features extracted by the model, enabling the network to focus on target regions with important information, while suppressing other irrelevant information and reducing interference caused by irrelevant backgrounds on detection results. The introduction of the attention mechanism can effectively enhance the detection model's feature learning ability, and many researchers have incorporated it to improve model performance. For example, Liu et al. [41] added the SE attention module to YOLOX to enhance the extraction of the cotton boll feature details. Bao et al. [42] added a dual-dimensional mixed attention (DDMA) to the detection model Neck, which parallelizes coordinate attention with channel and spatial attention to reduce missed and false detections caused by dense blade distribution. This study used the CBAM attention mechanism to enhance the BTC-YOLOv5s model's feature extraction ability. CBAM comprised two modules, SAM and CAM, and using the two submodules alone yielded an accuracy of 83.2% and 83.1%, respectively, inferior to the performance of the model using CBAM. As SAM and CAM are only spatial and channel attention modules alone, whereas CBAM combines both, it considers useful information from both feature channels and spatial dimensions, making it more beneficial for the model to locate and identify lesions.

#### *4.3. Outlook*

Although the proposed model can accurately identify apple leaf diseases, there are still some issues that deserve attention and further study. Firstly, the dataset used in this study only contains images of four disease types, whereas there are approximately 200 apple diseases in total. Therefore, future research will include images of more species and different disease stages. Secondly, the accuracy of model is not good in case of dense disease and decreases significantly compared to the performance in the sparse case. The detection results showed that scab had the highest error rate, mainly due to its irregular lesion shape and non-obvious border which interfered with the model detection. In the future, scab disease will be considered as a separate research topic to improve the model's detection accuracy.

#### **5. Conclusions**

This study proposed an improved detection model BTC-YOLOv5s based on YOLOv5s aimed at addressing the issues of missing and false detection caused by different shapes of diseased spots, multi-scale, and dense distribution of apple leaf lesions. To enhance the overall detection performance of the original YOLOv5s model, the study introduced the BiFPN module, which increases the fusion of multi-scale features and provides more semantic information. Additionally, the transformer and CBAM attention modules were added to improve the ability to extract disease features. Results indicated that the BTC-YOLOv5s model achieved an mAP@0.5 of 84.3% on the ALDD test set, with a model size of 15.8 M and detection speed of 8.7 FPS on an octa-core CPU device. Additionally, it still maintained good performance and robustness under extreme conditions. The improved model has high detection accuracy, fast detection speed and low computational requirements, making it suitable for deployment on mobile devices for real-time monitoring and the intelligent control of apple diseases.

**Author Contributions:** Conceptualization, H.L. and F.Y.; methodology, H.L.; software, H.L. and S.F.; validation, H.L., L.S. and S.F.; formal analysis, L.S.; investigation, H.L.; resources, H.L.; data curation, H.L. and S.F.; writing—original draft preparation, H.L.; writing—review and editing, H.L. and F.Y.; visualization, S.F.; supervision, L.S.; project administration, L.S.; funding acquisition, F.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Natural Science Foundation of Henan Province (No.222300420463); Henan Provincial Science and Technology Research and Development Plan Joint Fund (No.222301420113); the Collaborative Innovation Center of Henan Grain Crops, Zhengzhou and by the National Key Research and Development Program of China (No. 2017YFD0301105).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
