1. Introduction
Maize, as a primary staple crop in China, serves as a crucial source of feed for the livestock and aquaculture industries, as well as an indispensable raw material for the medical, hygiene, and chemical sectors. Ensuring both the yield and quality of maize holds significant importance [
1]. During the growth and development stages of maize, leaf diseases frequently occur, and the lack of timely prevention and control measures can lead to a reduction in the final yield and quality of maize [
2,
3]. According to estimations, between 2012 and 2015, the United States incurred losses of approximately 14 million metric tons of corn, equivalent to around 1.9 billion US dollars, due to northern leaf blight (NLB) [
4]. In the Chinese region, corn diseases have a severe impact on corn production, serving as a crucial factor constraining the healthy development of the corn industry. Annual yield losses due to diseases range between 6% and 10%, reaching levels exceeding 30% under severe conditions [
5]. This poses a formidable challenge to the corn industry, necessitating the implementation of effective preventive measures to ensure the stability and sustainable development of production. The common leaf diseases in maize include blight, rust, and gray leaf spot. Among these, gray leaf spot and rust exhibit relatively small lesions, while blight shares similar lesion colors with gray leaf spot [
6,
7]. In addition to the conventional occurrence of single leaf diseases on individual leaves in maize, there is a complex phenomenon of multiple overlapping and intermingling diseases on a single leaf, known as compound disease. Current research on the detection of leaf diseases in maize primarily focuses on single leaf diseases, with limited attention given to the detection of compound diseases on single leaves. However, the presence of compound diseases poses challenges for the accuracy of traditional deep learning-based disease detection algorithms. Therefore, it is imperative to conduct research on new disease detection models to identify compound leaf diseases in maize.
In recent times, an increasing number of scholars have incorporated object detection technology into the field of agriculture. Object detection methods are categorized into two categories, namely, two-stage object detection and one-stage object detection, based on the detection stage and speed. Representative models for two-stage object detection include R-CNN, Faster R-CNN, etc., while the YOLO series represents models for one-stage object detection. Wang [
8] et al. addressed the difficulty of identifying apple leaf diseases by proposing an improved method for apple leaf disease recognition based on Faster R-CNN. They utilized ResNest as the backbone feature extraction network, incorporated the feature pyramid network (FPN) for multiscale feature fusion, and optimized the proposal generation mechanism using a cascaded approach, resulting in an 8.7% increase in the average precision of the improved Faster R-CNN model. Wang [
9] et al. tackled missed detections in apple fruit recognition by introducing an improved small target detection algorithm based on YOLOv5s. They enhanced the RFA module, DFP module, and Soft-NMS algorithm, achieving precise detection of small targets with improvements in accuracy, recall, and mAP by 3.6%, 6.8%, and 6.1%, respectively. Qiao [
10] et al. aimed at accurate counting of red jujubes in orchards and proposed an improved YOLOv5s counting method. Using ShuffleNet V2 as the backbone, they introduced a novel data loading module (Stem) and replaced PANet with BiFPN to enhance feature fusion capability. The model showed reductions of 6.25% and 8.33% in model parameters and size, while achieving increases of 4.3%, 2.0%, 3.1%, 0.6%, and 3.6% in precision, recall, F1-score, AP, and FPS, respectively. Zhou [
11] et al. addressed the real-time detection of various apple bark diseases in orchards, training a YOLOv5s algorithm for apple bark disease recognition. The deployed lightweight model on the Android platform demonstrated rapid diagnosis of apple bark diseases, maintaining a stable average precision of 87.2%. The aforementioned research landscape indicates the widespread and effective application of object detection technology in the agricultural domain. Moreover, it is evident that, for crop disease detection scenarios, many scholars prefer the lightweight YOLOv5 model, providing valuable insights for the application of YOLOv5 in the identification of compound leaf diseases in maize.
The attention mechanism is a technology that mimics the human visual system, enabling computers to quickly identify important features or regions while disregarding irrelevant areas during image processing. Specifically, the attention mechanism typically consists of two components: an encoder and a decoder. The encoder generates an “attention map” that indicates which regions the model should focus on, while the decoder utilizes this attention map to weight the feature representations and generate the final output [
12,
13,
14]. Wen et al. [
15] proposed a largescale multiclass fine-grained pest and disease recognition network model (PD-Net) to address the challenges of low accuracy in disease recognition caused by the diverse and varied nature of pests and diseases. PD-Net incorporates a convolutional block attention model into the baseline network model. By leveraging mixed cross-channel and spatial domain attention, the model enhances its capability to extract and represent key features across both channel and spatial dimensions. Additionally, a cross-layer nonlocal module is introduced to improve the fusion of multi-scale features across multiple feature extraction layers. Experimental results demonstrated that the PD-Net model exhibits a certain advantage in accuracy for largescale multiclass pest and disease recognition tasks. Li et al. [
16] addressed the issue of low efficiency and slow speed in plant leaf disease recognition. They improved the ResNet network structure using the SE attention mechanism, achieving a low recognition error rate of 1.52% on a self-built dataset of apple leaf diseases. Wang et al. [
17] aimed to improve the accuracy of crop pest and disease recognition. They proposed an enhanced CBAM attention module called I_CBAM, which combines channel attention and spatial attention in parallel. This hybrid attention module demonstrated superior recognition performance for fine-grained classification of pests and diseases and exhibited robustness across different convolutional neural network models. Wang et al. [
18] aimed to achieve precise detection of Guangfu Hand pest and disease in complex backgrounds. They introduced the CBAM hybrid attention mechanism into the YOLOv5 network model, enhancing the model’s focus on the feature information of Guangfu Hand pest and disease. This led to an improved recognition accuracy, with an average precision of 93.06% for the final model. The aforementioned research status indicates that incorporating attention mechanisms into object detection models can effectively enhance the overall performance of the models.
In the training of deep learning models, the quantity of data plays a crucial role. Sufficient data enables more comprehensive training of the model, allowing it to better learn the features of the target. The size of the dataset has a significant impact on the accuracy of the model. Han et al. [
19] proposed a maize gray leaf spot image generation algorithm based on CycleGANs (cycle-consistent adversarial networks), aiming to address the difficulties in collecting maize disease images, particularly the high variability in symptoms of gray leaf spot disease. By performing disease image translation, the algorithm enables the generation of diseased maize images from healthy crop images. This method first extracts features from healthy maize images and gray leaf spot disease images separately and then applies disease transfer to the healthy maize images to obtain the desired maize gray leaf spot images. Experimental results demonstrated that compared to image transfer using VAE (variational autoencoder) and GAN (generative adversarial network), the visual effects of maize gray leaf spot disease at different severity levels are better with the CycleGAN approach, and the generated maize gray leaf spot images using the CycleGAN are more accurate. The quantity and quality of the dataset are crucial factors that significantly impact model performance. Albert et al. [
20] applied image synthesis techniques to alleviate the limitation of data quantity on the accuracy of digital plant disease phenotyping. They utilized two classes of tomato data, namely healthy data and bacterial spot disease data, from the PlantVillage dataset. Realistic data were synthesized based on a deep convolutional generative adversarial network (DC-GAN), with 80% (1272 instances) and 20% (318 instances) of the data used for training and testing, respectively. The DC-GAN algorithm was trained on the original bacterial spot disease training dataset (A) at different time periods. Three batches of synthetic bacterial spot disease training data were generated using the selected model. Subsequently, the same DC-GAN algorithm was trained on the original healthy tomato training dataset (B), and three batches of synthetic healthy tomato training data were generated using the selected model. The classification accuracy of the original dataset and various synthetic datasets were compared. The results indicated that the third DC-GAN synthesized training dataset outperformed the original training dataset containing 1272 real samples. The aforementioned research status indicates that using generative adversarial networks can effectively address the issue of limited data quantity, supporting the training of object detection models. Due to the scarcity of image data for compound diseases in real maize fields, this study aims to enrich the dataset of compound maize leaf diseases. By employing the CycleGAN, this research generates synthetic data for certain compound diseases, thereby augmenting the training data for the model and enhancing its robustness.
Due to the small size and inconspicuous features of gray leaf spot lesions, as well as their dispersed distribution, they are prone to missed detections and false alarms. Rust disease lesions, although having relatively simple features, are also dispersed, leading to missed detections. Moreover, the color characteristics of blight and gray leaf spot disease are quite similar, which can result in false alarms when they co-occur. To address the issues of missed detections and false alarms in maize leaf diseases, specifically gray leaf spot and rust diseases, and to improve the accuracy of identifying compound maize leaf diseases, this study enriches the dataset of compound maize leaf diseases using the CycleGAN. Furthermore, a maize leaf disease detection model named YOLOv5s-C3CBAM is proposed based on the YOLOv5s architecture, incorporating an attention mechanism. The aim is to enhance the detection capability of the model’s backbone network for compound diseases through the attention mechanism. This research provides valuable insights for improving the accuracy of maize leaf disease detection in real-world field conditions with compound disease occurrences.
2. Materials and Methods
2.1. Acquisition of Initial Experimental Data
The initial corn leaf data were obtained from publicly available datasets on platforms such as Kaggle (Kaggle, LLC, San Francisco, CA, USA), OpenDataLab (Shanghai AI Laboratory, LLC, Shanghai, China), and PaddlePaddle (Baidu Brain, Beijing, China). The datasets encompass a uniform pixel size of 256 × 256, comprising multiple images captured at different time intervals and from various orientations. After meticulous screening of the public datasets, a collection of 2107 original images was curated, focusing on corn leaf conditions, including grey leaf spot disease, rust disease, blight, compound diseases, and healthy leaf samples. The dataset consists of five categories, and representative images depicting different diseases are illustrated in
Figure 1.
2.2. Maize Leaf Image Generation Based on CycleGAN
In the original dataset, there is a limited number of maize leaf images belonging to the compound disease category. However, the inclusion of compound disease leaf data has a significant impact on the accuracy and generalization of the detection model. Obtaining image data for compound diseases in real maize fields is challenging. Therefore, the CycleGAN was employed to generate a portion of maize leaf images with compound disease.
To increase the diversity of training data and ensure a balanced source of data for model training, the generation of maize leaf images with compound diseases was accompanied by training the model to generate three types of single disease images as well as healthy leaf images. These generated images were used for the final model training.
The CycleGAN [
21,
22] is a variant of the generative adversarial network (GAN) [
23,
24,
25] and offers several advantages over traditional GANs. Firstly, the CycleGAN can learn the mapping relationship between two different domains even without paired data. Secondly, the CycleGAN enables bidirectional mapping, meaning it can convert images from one domain to another and vice versa. Additionally, the CycleGAN introduces a cycle consistency loss function, which enhances network stability. Moreover, the CycleGAN is capable of generating diverse images. Therefore, the CycleGAN is well suited for high-quality image translation and generation tasks.
The CycleGAN utilizes four deep neural networks consisting of two generators and two discriminators to achieve image generation. One generator transforms images from domain A to domain B, while the other generator transforms images from domain B to domain A. The two discriminators are responsible for distinguishing between real and generated images in domains A and B, respectively. The network architecture of the CycleGAN is illustrated in
Figure 2 [
26].
The CycleGAN trains the model using an adversarial loss function (Equations (1) and (2)) and a cycle consistency loss function (Equation (3)). The adversarial loss function ensures that the generators are capable of producing realistic images that deceive the discriminators. The cycle consistency loss function ensures that the images retain similar visual features after undergoing the reverse transformation. The overall loss of the network is shown in Equation (4).
: The adversarial loss term involving the generator and discriminator ensures that the generator produces synthetic images that are challenging for the discriminator to distinguish from real images. The expectation terms, and , represent the log-probabilities of the discriminator’s predictions on real and generated images, respectively.
: Similar to the previous term, this adversarial loss term involves another generator and a different discriminator . It ensures that the generator produces images challenging for discriminator . The expectation terms and represent the log-probabilities of the discriminator’s predictions on real and generated images.
: The cycle consistency loss term is defined as . It enforces the cycle consistency between the input and output of the generators and These terms measure the absolute differences between the reconstructed and original images.
: The hyperparameter in the overall loss term serves as a weight for balancing the importance of adversarial losses and cycle consistency. Adjusting allows for controlling the emphasis on adversarial training versus cycle consistency training.
The process of using the CycleGAN to generate maize leaf images can be described in four steps: data preparation, generator model training, maize leaf image generation, and generated image selection. The entire process is illustrated in
Figure 3.
2.2.1. Generation of Maize Leaf Images with a Single Disease
First, we set up the experiment to generate individual maize leaf disease images for blight, rust disease, and gray leaf spot disease. Training set A is composed of a dataset of 100 256 × 256-pixel healthy maize leaf images. Training set B is further divided into datasets consisting of 300 256 × 256-pixel images each for blight, rust disease, and gray leaf spot disease, respectively.
Blight lesions on maize leaves are elongated and typically distributed along the leaf veins. They can reach lengths of 5–10 cm and have a dry, shriveled surface. The color of the lesions is yellow-brown or gray-brown [
27]. Rust disease lesions, on the other hand, are circular with a diameter of approximately 0.2–1 mm. The lesions appear raised and have a brittle texture. The color of rust disease lesions is yellow-brown or orange-yellow [
28].
Due to the distinct characteristics of rust disease and blight lesions, in the experiment of transforming healthy maize leaves into diseased leaves with blight and rust disease, the model was trained for 100 epochs with a batch size of 4. The generated diseased images at different epochs during the training process are shown in
Figure 4 and
Figure 5.
The initial symptoms of gray leaf spot disease on maize leaves appear as water-soaked, light brown spots. Over time, these spots gradually expand into light brown stripes or irregular gray to brown elongated lesions [
29].
In the experiment of generating gray leaf spot disease images, the disease lesions are relatively small, and the color characteristics are not prominent. This makes it challenging for the generator network to extract features accurately. Therefore, the model was trained for 150 epochs with a batch size of 4. By increasing the number of iterations and extracting features multiple times, the quality of the generated images was enhanced. The generated images at different epochs during the training process are shown in
Figure 6.
From
Figure 4,
Figure 5 and
Figure 6, it can be observed that in the 100th epoch, the quality of the generated blight images is relatively high. This is mainly because the disease lesions of blight have simple and distinct features. Regarding rust disease, there is minimal difference in image quality between the 70th and 100th epochs. This can be attributed to the small size of rust disease lesions but prominent color characteristics. In the training process of generating gray leaf spot disease images, due to the difficulty in extracting the disease lesion features, the image quality did not meet the training requirements until the 110th epoch. Finally, the image quality reached its best state in the 150th epoch. The training process indicates that for generating complex images with challenging-to-extract features, increasing the number of iterations during model training can improve the quality of the generated images.
The generated maize leaf disease images produced using the CycleGAN were individually screened, resulting in a final selection of 217 images for blight, 220 images for rust disease, and 217 images for gray leaf spot disease, meeting the required criteria.
2.2.2. Generation of Maize Leaf Images with Compound Disease
The lesion characteristics of composite diseases are highly complex. In the experimental generation of composite disease images, four sets of image generation experiments were designed to produce high-quality composite disease images. These four sets include transitions from healthy leaves to composite disease, gray leaf spot disease to composite disease, rust disease to composite disease, and blight disease to composite disease.
In the experiment of generating composite disease images from healthy leaves, training set A was established, consisting of 100 256 × 256-pixel images of healthy maize leaves. Additionally, training set B was created, consisting of 100 256 × 256-pixel images of composite disease leaves. Due to the limited availability of composite disease data in the dataset, based on the experience gained from the gray leaf spot disease image generation experiment, it was found that increasing the number of model iterations to some extent can enhance the quality of the generated images. Therefore, in this round of experiments, the model was trained for 200 epochs with a batch size of 4. The generated images at different epochs during the training process are shown in
Figure 7.
In the remaining three sets of experiments for generating composite disease images, training set A was set as the disease images of gray leaf spot, rust, and blight, respectively, with each set consisting of 300 images of size 256 × 256 pixels. Training set B remained unchanged. In these three experiments, since the size of training set A increased and the experiments involved the transformation from a single disease to a composite disease, the model was trained for 100 epochs with a batch size of 4. The generated images at different epochs during the training process are shown in
Figure 8.
Analyzing the image generation process in
Figure 7 and
Figure 8, it can be observed that the group generating images from healthy leaves to composite diseases had the best results. This can be attributed to the singular and distinct features of healthy leaves, the relatively high number of model iterations, and the limited interference when transforming from the composite disease dataset to healthy leaf images. Consequently, the transformation and generation process yielded better results. Similarly, in the experiments involving the transformation from blight and rust disease to composite diseases, the results were relatively good due to the simplicity and distinctiveness of the features in large spot and rust diseases. However, in the experiment of transforming from gray leaf spot disease to composite diseases, the generated image quality was the poorest. This can be attributed to the less prominent features of gray leaf spot disease, resulting in average results when generating composite images with gray leaf spot disease, blight, and rust disease. The weights corresponding to the best image generation results during the training process were used to complete the image generation task. After screening the generated images, a total of 200 usable composite disease images were obtained for model training purposes.
In the aforementioned transformation and generation experiments, based on the cycle generation characteristics of the CycleGAN, while generating single disease and composite disease images, corresponding healthy maize leaf images were also generated. After screening these images, a total of 199 healthy maize leaf images that met the requirements were obtained. Examples of the generated leaf images from different categories are shown in
Figure 9.
2.3. Construction of Maize Leaf Disease Dataset
After the generation of corn leaf images using CycleGAN, the statistics of the original images participating in the subsequent training of the detection model, as well as the final images after traditional data augmentation involving translation, rotation, grayscale, and mosaic, are presented in
Table 1.
The final dataset was constructed by utilizing the LabelImg (Version 1.8.6) tool to annotate disease targets in the images of corn leaves. Four categories were annotated, including blight, rust disease, gray leaf spot disease, and healthy, generating a labeled dataset for corn leaf diseases. Subsequently, the constructed dataset of corn leaf diseases was randomly divided into training and validation sets in a 9:1 ratio.
In addition, we randomly selected 100 maize leaf disease images from the publicly available dataset, ensuring that these images had not been included in the training and validation sets, and were captured at different time periods and from various angles. This subset of images was utilized as an independent test set for evaluating the generalization and robustness of the model in real-world scenarios. This approach was employed to ensure the model’s performance metrics reflect its ability to generalize and remain robust across diverse conditions in a real-world context.
2.4. Construction of Maize Leaf Compound Disease Recognition Network
In addressing the challenging issue of low disease detection accuracy in real-world maize fields due to compound diseases on corn leaves, this study proposes an attention mechanism-based object detection model for more precise and rapid detection of compound diseases. Through subsequent experiments within the YOLO series detection models, it was observed that the YOLOv5 series algorithm has a relatively smaller model size compared to YOLOv7. Considering the deployment requirements of the subsequent model to handheld devices in this study, we opted to enhance the YOLOv5 backbone network. Drawing inspiration from the subjective initiative of human vision, an attention mechanism module was integrated into YOLOv5 to enhance the accuracy of compound disease detection in the detection model.
2.4.1. YOLOv5 Network Model
YOLOv5 (Version 6.1) is a deep learning-based object detection framework that leverages CSPDarknet53 as its backbone network and incorporates techniques such as adaptive prediction and dynamic convolutional kernels. These advancements aim to enhance the model’s perception capability and robustness, resulting in improved accuracy. Furthermore, YOLOv5 follows the design principle of “single-stage detection,” which enables faster and more efficient performance compared to traditional two-stage detection algorithms while maintaining high accuracy. This design philosophy provides excellent real-time capability and scalability.
YOLOv5 provides four different models of varying sizes: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The YOLOv5s model, being the most lightweight, is suitable for resource-constrained scenarios such as mobile and embedded devices. The YOLOv5m model strikes a balance between speed and accuracy and performs well in general scenarios. The YOLOv5l model offers higher accuracy and finds wide application in scenarios that demand higher precision. The largest model, YOLOv5x, achieves the highest accuracy and performs well in complex scenes [
30]. Considering the requirements of fast and accurate detection of composite diseases in the context of this study, the YOLOv5s model was selected as the base network architecture. The network architecture diagram of YOLOv5s is illustrated in
Figure 10.
2.4.2. Attention Mechanism Module
There are three commonly used attention mechanisms, including the channel attention mechanism (squeeze and excitation, SE), the spatial and channel mixed attention mechanism (convolutional block attention module, CBAM), and the coordinate attention mechanism (coordinate attention, CA) [
31].
The SE (squeeze and excitation) attention mechanism [
32] is an adaptive attention mechanism that focuses on weighting features from different channels in a feature map to enhance attention to important features. The structure of the SE attention mechanism is depicted in
Figure 11a. The SE attention mechanism operates through squeeze and excitation operations on the input feature map. The squeeze operation extracts global information from the input feature map and compresses it into a vector, while the excitation operation applies channel-wise weights based on this vector.
The CA (coordinate attention) mechanism [
33] distinguishes itself from traditional attention mechanisms by incorporating both the interrelationships between different positions in the input sequence and their absolute positional information. In particular, the CA mechanism includes two components: spatial attention and channel attention. Spatial attention primarily emphasizes the positional information of the target in the image by encoding the position information and obtaining weights for each position. Channel attention, on the other hand, focuses on the relationships between feature channels by encoding the channel information and obtaining weights for each channel. The network structure of the CA mechanism is illustrated in
Figure 11b.
The CBAM (convolutional block attention module) mechanism [
34] improves model performance by applying adaptive spatial and channel-level attention weighting to the feature map. The CBAM mechanism comprises two parts: the channel attention module (CAM) and the spatial attention module (SAM). The CAM is primarily utilized to process information in the channel dimension of the image, while the SAM is employed to handle spatial dimension information. The structural diagram of the CBAM mechanism is depicted in
Figure 11c.
2.4.3. Improved Network Architecture Based on Attention Mechanism
In this subsection, two methods are utilized to improve the network structure of YOLOv5s in its backbone network by incorporating attention mechanisms.
Method 1: An attention mechanism module is incorporated before the final SPPF (spatial pyramid pooling with factorized convolutions) layer of the YOLOv5s backbone network. The refined backbone network structure is depicted in
Figure 12a.
Within the backbone network, the input image is processed using convolutional modules and the C3 module until it reaches the final layer, known as the SPPF (spatial pyramid pooling—fast) layer. The SPPF layer serves as a spatial pyramid pooling feature extraction layer. Its purpose is to extract features at various scales in order to facilitate the detection of objects with different sizes. In the SPPF layer, the input feature map is subject to pooling operations at different scales, allowing for the detection of objects of various sizes. Specifically, the SPPF layer divides the input feature map into multiple subregions at different scales and performs pooling operations on each subregion individually. This enables the detection of objects of different sizes while preserving spatial information. The structure of the SPPF layer is depicted in
Figure 12b.
An attention mechanism module is introduced in the layer before the SPPF layer to incorporate weighted differentiations into the feature map before the multiscale fusion process takes place in the SPPF layer. This is achieved by multiplying the feature map with the weight matrix of the attention mechanism. The inclusion of the attention mechanism in the preceding layer of the SPPF layer enhances the extraction of deep-level features, leading to a more pronounced emphasis on important features in the feature map after the fusion process in the SPPF layer. Consequently, this enhancement contributes to the overall improvement in the detection model’s performance.
Method 2: An attention mechanism module is incorporated after the C3 module in the backbone network of YOLOv5s. The revised structure of the backbone network is illustrated in
Figure 13a.
Within the backbone network, the image undergoes four rounds of processing through the C3 module before reaching the final SPPF layer. The C3 module is composed of three convolutional (Conv) modules and one bottleneck module. The structure of the C3 module is illustrated in
Figure 13b. Each of the three Conv modules in the C3 module employs a 1 × 1 convolution operation, which is responsible for either dimensionality reduction or expansion. The bottleneck module, implemented with residual connections in the backbone network, consists of two Conv modules. The first Conv module performs a 1 × 1 convolution operation to reduce the channel dimension by half, while the second Conv module performs a 3 × 3 convolution operation to double the number of channels.
Performing dimensionality reduction initially facilitates improved understanding of feature information using convolutional kernels, while dimensionality expansion enables the extraction of more comprehensive and detailed features. The adoption of residual connections in the bottleneck module addresses the issue of gradient vanishing by adding the input to the output. By incorporating an attention mechanism in the C3 module, attention weights can be introduced at the preliminary stage of feature extraction. Given that the backbone network comprises four instances of the C3 module, the attention module can exert its influence four times. Moreover, the preceding attention feature maps are adjusted using subsequent attention weights. The introduction of an attention mechanism after the C3 module effectively enhances the extraction of feature maps at both shallow and deep levels.
3. Results and Analysis
3.1. Experimental Parameter Settings and Evaluation Metrics
3.1.1. Experimental Platform and Model Parameter Settings
The model construction, training, and testing were performed on a cloud server using the Linux operating system. The deep learning framework was built using Python 3.7 and PyTorch 1.9.1 to train and test the network model. The experiments were conducted utilizing a GPU, specifically the GeForce RTX 2080 Ti with a memory capacity of 11,019 MiB. The training images were resized to 256 × 256 pixels, and the batch size was set to 16 for training. The training process consisted of 300 iterations. During the training phase of the network model, an initial learning rate of 0.01 was set, and a cosine annealing decay strategy was employed to update the learning rate. The model was trained to classify objects into four different categories.
3.1.2. Evaluation Metrics for Experimental Results
Precision, recall, mean average precision (mAP), and F1 score were employed as evaluation metrics for the detection model in this study. The F1 score is a composite metric that comprehensively evaluates precision and recall. A higher F1 score indicates that the model achieves a better balance between precision and recall. The formulas for calculating mAP, precision, recall, and F1 score are presented in Equations (5)–(8), respectively.
In the above equations, TP represents the count of disease targets correctly detected, FP represents the count of targets falsely identified as disease using the network, and FN represents the count of disease targets missed.
3.2. Comparison and Analysis of Models Trained on Different Datasets
In this study, a portion of the maize leaf dataset was generated using the CycleGAN. To evaluate the effectiveness of the generated data for compound disease detection, Dataset 1 (original images, as shown in
Table 1) and Dataset 2 (final images after incorporating CycleGAN-generated images, as shown in
Table 1) were used to train and test the YOLOv5s baseline model under identical conditions. The experimental results are presented in
Table 2.
The models trained using the two datasets were applied to detect diseases in the same set of four compound disease images of maize leaves. The resulting detection images are illustrated in
Figure 14.
In
Figure 14, the red bounding box contains lesions of northern leaf blight, the pink bounding box encompasses lesions of rust disease, and the orange-yellow bounding box encloses lesions of gray leaf spot. The labels ‘B’, ‘R’, and ‘GL’ correspond to ‘Blight’, ‘Rust Disease’, and ‘Gray Leaf Spot’, respectively. Based on
Figure 14, it can be observed that the disease detection model trained on Dataset 2 demonstrates higher comprehensive recognition accuracy and generalization ability for compound diseases in maize leaves compared to the model trained on Dataset 1.
3.3. Comparison and Analysis of Different Improved Models
The six models obtained by incorporating SE, CBAM, and CA mechanisms in both the C3 module and SPPF layer at two different positions were trained on the same platform and using the same experimental framework with identical training parameters. The experimental results are presented in
Table 3.
Upon observing the evaluation metric mAP_0.5 in the fourth column of
Table 3, it can be noted that all the improved models show varying degrees of enhancement in detection accuracy compared to the original YOLOv5s network model. Among them, the YOLOv5s-C3CBAM model, which incorporates the CBAM after the C3 module in the backbone network, achieves the highest recognition accuracy with a mAP_0.5 value of 83%. This represents a 3.1 percentage point increase over the original YOLOv5s network model. The primary reason for this improvement is that the CBAM mechanism can simultaneously consider both channel and spatial information in the image, providing a more comprehensive attention mechanism compared to single-channel or single-spatial attention mechanisms. Moreover, the C3 module appears multiple times in the backbone network and serves as the primary feature extraction module. Incorporating the CBAM mechanism in the C3 module significantly enhances the feature extraction capabilities of the backbone network. Consequently, the YOLOv5s-C3CBAM model exhibits superior detection accuracy compared to other improved models.
Furthermore, recall, as a crucial metric, assesses the comprehensiveness and coverage of a model in detecting targets. Therefore, a high recall indicates that the model is more inclined to capture all true positives, thereby mitigating the risk of false negatives. The level of recall reflects the extent of missed detections in the object detection model. In the fifth column of
Table 3, the recall values of each model are displayed, showing that all the improved models exhibit varying degrees of improvement in recall compared to the original YOLOv5s network model. Among them, the YOLOv5s-C3CA model achieves the highest recall with a value of 79.3%. The YOLOv5-C3CA model demonstrates favorable outcomes in the detection of small objects and the minimization of false negatives. The CA mechanism module considers both the interrelationship between different positions in the input sequence and their absolute positional information. Therefore, incorporating the CA mechanism in the C3 module allows for the model to extract more comprehensive features during the feature extraction process, reducing the possibility of missed detections and improving the recall of the model. The sixth column of
Table 3 shows the F1 scores of each model. With the exception of the improvement models incorporating the SE attention mechanism, which show a slight decrease in the F1 parameter, the remaining four improvement models demonstrate varying degrees of increase compared to the original model. The highest F1 score is achieved using the YOLOv5s-CBAM model, representing a 0.99 percentage point improvement over the original model. The F1 scores of the YOLOv5s-C3CBAM and YOLOv5s-C3CA models are comparable, both ranking as the second best in the sequence. The last column in
Table 3 reflects the precision of each model. Among all the improved models, the YOLOv5s-CBAM model achieves the highest precision, followed by the YOLOv5s-C3CBAM model.
The experimental results demonstrate that each proposed improvement model in this study can enhance the comprehensive performance of the detection model. Considering the research objective of improving the accuracy of maize leaf compound disease recognition, it is crucial to improve the mean average precision (mAP_0.5) on the baseline model. Additionally, it is important to increase recall as much as possible to reduce the issue of missed detections, while the F1 score provides a comprehensive reflection of the improvement in both precision and recall. Among all the improved models, the YOLOv5s-C3CBAM model achieves the highest mAP_0.5 of 83%, representing a 3.1 percentage point improvement over the original model. Furthermore, its recall and F1 score rank as second best, with values of 75.7% and 81.98, respectively, indicating improvements of 2.3% and 0.8% compared to the original model. Therefore, it can be concluded that among the six improvement models analyzed, the YOLOv5s-C3CBAM model stands out as the optimal one.
3.4. Analysis of Accuracy of the Optimal Improved Models for Different Leaf Diseases
To further analyze the recognition accuracy of the optimal improved model for different diseases, we conducted an analysis of the confusion matrix for the classification predictions of the YOLOv5s-C3CBAM model. The confusion matrix is presented in
Figure 15, and the recognition accuracy for different diseases is shown in
Table 4.
From
Figure 15 and
Table 4, it can be observed that the average recognition accuracy of the improved model is 83%. Among the four leaf categories, the recognition performance for healthy leaves is the best, with an accuracy of 98.8% and a recall of 95%. The recognition accuracies for gray leaf spot disease, rust disease, and blight are 61.7%, 79.6%, and 92%, respectively. The confusion matrix reveals that there is confusion between gray leaf spot disease and blight, which may be attributed to the similarity in color between the lesions of these two diseases. Additionally, the lesions of gray leaf spot disease are small, making it difficult to extract their contour features, which could potentially lead to misclassification by the model. Consequently, it is evident that there is still room for improvement in the classification of gray leaf spot disease and blight in the improved model.
Using the YOLOv5s-C3CBAM model, three typical compound maize leaf disease images were selected for disease detection, each containing a combination of “blight and rust disease,” “blight and gray leaf spot disease,” and “blight, rust disease, and gray leaf spot disease.” The final detection results are illustrated in
Figure 16 (the colors of the bounding boxes and the disease types represented by the labels in
Figure 16 are consistent with those in
Figure 14). From the results, it is evident that YOLOv5s-C3CBAM exhibits effective detection of blight and gray leaf spot diseases. However, its performance is less satisfactory in detecting rust disease when the lesions are small and scattered.
3.5. Comparison and Analysis of the Optimal Improved Model with Other Models
The optimal improved model proposed in this study, named YOLOv5s-C3CBAM, was compared with YOLOv5m (one version of the original model of YOLOv5), YOLOv7-tiny (the newer version of YOLO), and Faster R-CNN (the most used two-stage object detection model). The evaluation results of different models are presented in
Table 5.
The models discussed in this subsection were trained in the experimental environment mentioned in
Section 3.1. The YOLOv5m, YOLOv7-tiny, and Faster R-CNN models were all trained for 300 epochs. Subsequently, the trained models were uniformly evaluated on the independent validation set mentioned in
Section 2.3.
From the analysis of
Table 5, it can be observed that the YOLOv5s-C3CBAM model, which is the optimal improved model, performs the best in terms of the mAP_0.5 evaluation metric with a value of 83%. On the other hand, Faster R-CNN exhibits the poorest performance. This could be attributed to the small size of lesions in maize leaf diseases such as gray leaf spot and rust, which makes it challenging to extract distinctive features. Additionally, Faster R-CNN, being a larger model, may suffer from limited training data, and its performance is further hindered by the high similarity between lesions of blight and gray leaf spot in compound disease images. Consequently, the overall recognition accuracy of the Faster R-CNN model is lower. Compared to the YOLOv5s-C3CBAM model, the YOLOv5m and YOLOv7-tiny models, which are part of the same series, exhibit inferior performance due to the absence of the attention mechanism module.
In terms of the recall evaluation metric, the YOLOv5s-C3CBAM model, as the optimal improved model, performs the best with a value of 75.7%. This improvement can be attributed to the addition of the CBAM attention mechanism after the C3 module, which enhances the feature extraction capability of the model’s backbone network. This reduction in missed detections improves the recall rate. The other three models exhibit moderate performance, with Faster R-CNN having a recall rate of 71%, and YOLOv5m and YOLOv7-tiny having recall rates of 75.2% and 73%, respectively.
Regarding the precision evaluation metric, the YOLOv5s-C3CBAM model achieves the best performance, with an F1 score of 81.98%. On the other hand, the Faster R-CNN model performs the worst. This can be attributed to the poor feature extraction capability of the Faster R-CNN model for rust disease and gray leaf spot, leading to lower precision overall and impacting the F1 score.
Among the four different detection models, YOLOv7-tiny and YOLOv5s-C3CBAM models are smaller in size, measuring 11.7 MB and 12.6 MB, respectively. In contrast, Faster R-CNN and YOLOv5m models are larger, with sizes of 108 MB and 42.2 MB, respectively.
In the comparison of the four models, YOLOv7-tiny exhibits the highest FPS, primarily due to its lightweight network architecture. YOLOv5s-C3CBAM achieves an FPS of 53, and in comparison, to YOLOv5m and Faster R-CNN, its integration of the lightweight CBAM mechanism does not incur significant increases in parameters and computational overhead, striking a favorable balance between detection speed and model size. The larger model depth and width of YOLOv5m result in decreased detection speed. Meanwhile, Faster R-CNN, employing a two-stage detection approach with a deeper network, exhibits the slowest detection speed and the largest model size.
In summary, compared to the YOLO series models (YOLOv5m, YOLOv7-tiny) and the two-stage detection model Faster R-CNN, the optimal improved model YOLOv5s-C3CBAM demonstrates the best overall performance. It is most suitable for completing the detection task in this study. Therefore, YOLOv5s-C3CBAM was chosen as the final detection model in this research.
4. Discussion
4.1. Selection of the YOLOv5s Model
Compared to YOLOv4, YOLOv5 features a smaller model size, faster computational speed, and lower memory footprint. It represents the inaugural version developed using the PyTorch framework, facilitating ease of use, training, and deployment. Additionally, YOLOv5 introduces data augmentation techniques such as CutMix and Mosaic, enhancing detection performance for small targets and addressing imbalanced category distributions.
YOLOv7 introduces a cascaded model scaling strategy, generating models of varying sizes to reduce parameter and computational complexity, particularly suitable for real-time object detection. However, relative to YOLOv5, its network architecture is more intricate, demanding greater computational resources and exhibiting inferior performance in detecting small targets and dense scenes. YOLOv8 enhances overall performance through network structure optimization, delivering higher and more flexible detection results suitable for diverse engineering applications. Nevertheless, due to its complexity, substantial computational resources and time are required for detection.
In the context of this study, considering the small size of the corn leaf disease targets and the subsequent deployment of the improved model for real-time detection on handheld devices, which are resource-constrained devices, a prudent choice would be YOLOv5s within the YOLOv5 series as the foundational model. This selection strikes a balance between detection accuracy, speed, and computational resource requirements.
4.2. Selection of GAN Models
The primary focus of this research is on compound diseases in maize leaves. Compound disease images are characterized by overlapping and intertwined lesions. To generate compound disease images, it is necessary to merge one or two types of lesions onto a leaf where another type of lesion is present, creating a composite effect. This requires a GAN model capable of feature fusion. Various improved GAN models exist, such as the DCGAN, InfoGAN, CycleGAN, and WGAN. Among them, the DCGAN is a deep convolutional neural network-based generative adversarial network model used for image generation. However, experiments have shown that using the DCGAN for generating maize leaf disease images only allows for single-class image generation after each model training. Moreover, the unstable quality of disease patterns on maize leaf images is due to the model being trained with a single type of training datum. To ensure data enrichment and balanced data sources in subsequent disease detection model training, it is necessary to generate images of three single disease types and healthy leaves. However, the DCGAN’s unidirectional output characteristic limits its efficiency in image generation. On the other hand, the CycleGAN is a network capable of bidirectional mapping, transforming images from one domain to another and vice versa. It can fuse image features, and the input of two types of training data allows for better preservation of pathological characteristics in the generated maize leaf disease images, resulting in higher stability. Additionally, the cyclic generation characteristic of the CycleGAN enables bidirectional image output, allowing for the simultaneous generation of two types of maize leaf images and improving image generation efficiency. Therefore, in this study, the CycleGAN was chosen to generate images of compound diseases and other disease patterns in maize leaves.
4.3. Image Generation Experiment
During the experiment of generating disease images, there were instances where the lesions did not align with natural growth patterns. For example, blight did not follow the direction of leaf veins but grew perpendicularly to the texture. The primary reason for this issue was the inclusion of training images with significant rotation angles. To address this problem, the training set underwent preprocessing to remove images with large rotation angles, thereby reducing the impact on the generated images. Additionally, a selection process was applied to the generated images, comparing them with naturally occurring real images and removing any inaccurately generated ones. In the experiment of generating gray spot disease images, initial attempts using the same number of model iterations as for blight and rust disease resulted in poor image quality. To improve the results, the experiment was redesigned with an increased number of model iterations. Ultimately, the trained and generated images exhibited improvements in terms of clarity and lesion distribution, meeting the requirements for training purposes. In the experiment of generating compound diseases, insights were gained from the generation of single-disease images (blight and gray spot disease). The training set underwent preprocessing, and the number of iterations was increased to 200, resulting in a successful completion of the experiment.
4.4. Innovation and Limitations
This study presents two primary innovations: (1) The utilization of the CycleGAN for generating compound disease images in maize leaves addresses the issue of limited availability of data for compound disease, which is insufficient to support the data requirements of large-scale deep learning. (2) The incorporation of attention mechanisms enhances the network model’s focus on the lesion targets, mitigating the interference caused by compound diseases and improving the accuracy of disease recognition.
However, there are certain limitations in this study. The classification of maize leaf diseases is not sufficiently detailed. The current research only focuses on the recognition of three common diseases in maize leaves. In real-world field conditions, the classification of maize leaf diseases is more refined, such as distinguishing between common rust, southern rust, tropical rust, and stalk rust. Furthermore, the fine annotation of maize leaf disease data requires a considerable amount of time and effort. Future research will aim to optimize the construction of a fine-grained maize leaf disease dataset, perform fine-grained disease annotation for multiple categories of maize leaf data, and continue training the model to enhance its capability for fine-grained detection of compound diseases in maize leaves.
4.5. Image Resolution Selection
In this study, we adopted an image resolution of 256 × 256 pixels, resulting in an image size of approximately 20 KB. This decision was based on two critical considerations. Firstly, given the inconsistent sizes of the raw images we acquired, we performed uniform dimensionality reduction before utilization. Currently, many studies commonly employ a size of 256 × 256 pixels, as higher resolutions increase the training burden, leading to slower training speeds, while lower resolutions may hinder the model from fully learning target features, impacting algorithmic performance. The choice of 256 × 256 pixels strikes a well-balanced point between training load and the effective learning of lesion features.
Secondly, from the perspective of unmanned aerial vehicle (UAV) usage, we conducted prior UAV-based research and found that images obtained using UAVs are more suitable for detecting the whole plant or the field, not the leaves. Although UAV-captured images boast higher pixel counts, they fail to address the issue of mutual occlusion among leaves. In contrast, lower-pixel single-leaf images, while having lower resolution, are more suitable for capturing lesion features on individual leaves, demonstrating superior performance. This decision takes into account the balance between image resolution and the detailed observation of the research subject, providing a more practical and feasible data foundation for this study.
5. Conclusions
Currently, research on maize leaf disease detection primarily focuses on a single leaf with a single disease, with limited attention given to the detection of a compound disease on a single leaf. However, the presence of compound diseases in maize leaves poses challenges to traditional deep learning algorithms for disease detection, leading to lower accuracy. Furthermore, there are insufficient data available on compound diseases in maize leaves, making it difficult to fulfill the data requirements for large-scale deep learning. To address these issues, this study employed the CycleGAN to generate synthetic data of compound diseases in maize leaves, thereby enriching the dataset. Additionally, an approach based on attention mechanisms was proposed for the recognition of compound diseases in maize leaves.
In this study, the CycleGAN was utilized to learn the features of existing compound disease images, enabling the successful transformation of single-disease images of healthy leaves, rust diseases, blights, and gray leaf spot diseases into compound disease images of maize leaves through the cyclic generation property of the CycleGAN. This approach enriched the dataset and laid a solid foundation for model training. Six improved models based on attention mechanisms were proposed, with YOLOv5-C3CBAM identified as the optimal improved model. This model incorporated the CBAM mechanism after the C3 module, significantly enhancing the feature extraction capability of the original model’s backbone network and resulting in a certain degree of improvement in the overall detection performance. Compared to the other five improved models, the optimal improved model, YOLOv5-C3CBAM, achieved the highest average precision of 83%. This model demonstrated superior comprehensive detection capability in real field conditions compared to the other five improved networks. Additionally, the optimal improved model, YOLOv5-C3CBAM, achieved the highest recall rate of 75.7%. Compared to the other five improved models, this model exhibited stronger detection capability for small targets and a lower probability of missed detections. Moreover, the model size of the optimal improved model YOLOv5-C3CBAM was 12.6 Mb, smaller than the other five improved models, providing a significant advantage in terms of subsequent device deployment.
The optimal improved model YOLOv5-C3CBAM exhibits higher accuracy and stronger generalization capability compared to the baseline YOLOv5s model, resulting in better practical performance in real field conditions. In comparison to the YOLOv7-tiny model, another model in the YOLO series, the optimal improved model YOLOv5-C3CBAM significantly enhances disease recognition performance without substantially increasing the model size. Furthermore, compared to the traditional large model Faster R-CNN, the proposed optimal improved model YOLOv5-C3CBAM not only demonstrates superior accuracy but also has a much smaller model size, making it more promising for practical applications. The comprehensive performance of the proposed network model surpasses that of commonly used object detection algorithms, indicating its high practical value. It can serve as a reference for future research on accurate identification of compound crop diseases in real field conditions.