1. Introduction
In modern agricultural production, cotton, as a vital raw material for the global textile industry, plays a crucial role in ensuring the supply of textiles [
1]. Unfortunately, the growth of cotton is frequently plagued by various pests and diseases, which not only threaten the yield and quality of the crop, but also potentially incur economic losses for farmers [
2]. Traditional methods of pest and disease identification, mostly reliant on farmers’ experience and manual inspection, are time-consuming and labor-intensive. Moreover, their accuracy heavily depends on the individual skills of farmers, making these methods unsuitable for large-scale modern agriculture [
3].
With the rapid development of information technology, particularly breakthroughs in deep learning and computer vision, new possibilities for intelligent detection of agricultural pests and diseases have emerged [
4,
5,
6]. Advanced techniques, such as deep learning, enable automatic analysis of field images of cotton, accurately identifying types and locations of pests and diseases [
7], thereby improving the efficiency and accuracy of pest and disease management [
8]. Particularly in large-scale farm management, this approach significantly reduces manual labor, enhancing the speed and precision of pest and disease identification [
9].
A model for detecting and classifying cotton plant diseases based on convolutional neural networks (CNNs) was developed and analyzed by Suriya et al., utilizing multiple convolutional and max pooling layers to extract features from images of cotton plants [
10]. However, the accuracy of their model was not notably high. In contrast, Zambare et al. proposed a deep learning model based on CNN, achieving a detection and classification accuracy of 99.38% for various diseases in cotton plant images [
11]. Yet, the performance of their model in complex field scenarios could not be guaranteed. Addressing this, Rai et al. proposed an improved Deep Convolutional Neural Network (DCNN) model for identifying and predicting different disease states in cotton plant images collected from real environments [
12]. These researchers utilized deep learning technologies, particularly models based on CNN, to enhance the accuracy of detection and classification of diseases in cotton plants. Their research not only improved the accuracy of cotton disease detection but also provided effective technological means for the early diagnosis and prevention of agricultural diseases.
Pankaj et al. introduced a model for predicting cotton diseases based on Internet of Things (IoT) hardware sensors and CNN algorithms, assisting farmers in identifying cotton diseases and recommending appropriate pesticides through a mobile application [
13]. Additionally, Shao et al. [
14] proposed a model for identifying cotton leaf diseases, enhanced by a bilinear coordinate attention module. This model, operating in natural environments, precisely locates and identifies diseased areas on cotton leaves. It integrates spatial coordinate information with features by embedding them into the feature map through the bilinear coordinate attention mechanism, reducing the loss of feature information and focusing more on diseased leaf areas while minimizing attention to redundant information, such as healthy areas. Although significant progress in cotton disease recognition was made by the teams of Pankaj and Shao, demonstrating the vast potential of utilizing IoT and deep learning technologies in practical applications, their models required substantial computational resources. Therefore, further improvements and optimizations of these technologies are crucial.
Existing pest and disease recognition technologies must overcome a series of challenges to achieve widespread application in the agricultural sector [
15]. On one hand, due to the diversity of cotton pests and diseases and the fact that some have minute appearance features or resemble natural surroundings, accurate identification becomes increasingly difficult. On the other hand, complex and variable field environments, such as lighting conditions and background noise, can affect the effectiveness of pest and disease recognition. Furthermore, most existing recognition models usually require substantial computational resources, making them unsuitable for applications on mobile devices or edge computing devices [
16,
17].
In response to these challenges, this study presents an innovative solution combining deep learning, knowledge graphs, and edge computing. Firstly, deep learning models, particularly those utilizing Transformer technology [
18], are effectively able to process complex image data, thereby enhancing the accuracy of pest and disease detection. Transformer technology, originally designed for natural language processing tasks, is centered around the self-attention mechanism. This mechanism enables the model to process all elements in the input sequence simultaneously and compute the relationships among them. In the context of image processing, this implies that the model not only focuses on local features, as traditional convolutional neural networks do, but also captures and analyzes global relationships within the image. Such a global perspective allows Transformers to more effectively recognize complex patterns in images, such as subtle changes indicative of pests and diseases. When analyzing images of cotton plant diseases, Transformer models are capable of simultaneously noting different parts of the leaves and understanding the interrelationships among these parts, which is particularly important for disease feature recognition. Additionally, Transformer models typically possess deeper network structures, further enhancing their feature extraction and recognition capabilities. Compared to traditional deep learning models, Transformers provide a higher level of abstraction, making them more effective in processing image data with complex structures and patterns, thus achieving higher accuracy and robustness in pest and disease detection applications.
Secondly, by constructing a knowledge graph specifically for cotton pests and diseases, the integration of domain experts’ knowledge with deep learning models further improves the recognition capabilities and accuracy. The rich information contained in the knowledge graph assists the model in making more precise judgments during the identification process.
Additionally, this study considers the practical application scenarios of the model, especially in resource-constrained mobile or edge computing environments. By optimizing the model structure and computational strategy, the model is enabled to run efficiently on these devices, achieving rapid pest and disease detection. This not only enhances the timeliness of pest and disease management but also provides significant technical support for the intelligent automation of agricultural production.
3. Materials and Methods
3.1. Dataset Collection
In the research presented in this paper, to ensure the accuracy and practicality of the intelligent model for cotton pest and disease identification, datasets from diversified sources were selected for experiments. These datasets were primarily acquired through manual collection and internet crawling techniques, as shown in
Table 1 and
Figure 1.
By this method, a large and varied collection of cotton pest and disease images was amassed. These images not only cover different types of pests and diseases, but also include field images of cotton in various environments. The rationale for this approach is the abundance of publicly available resources related to cotton pests and diseases on the internet, including professional agricultural websites, forums, and social media platforms. Utilizing web crawling techniques facilitated the efficient collection of these image data, providing an ample sample pool for model training.
3.2. Dataset Annotation
The annotation of the dataset constitutes a critical step in constructing an effective model for cotton pest identification. Initially, a team comprising agricultural experts and data scientists was organized to manually annotate the collected images. During the annotation process, not only were the specific locations of pest damage in the images marked but detailed descriptions of pest types were also provided, such as names and characteristics of the pests. To enhance the accuracy and efficiency of annotation, a semi-automated method was employed. Specifically, a simple pretrained model was first used for preliminary pest detection and annotation in the images. Subsequently, these automated annotations were reviewed and corrected by the expert team, as shown in
Figure 2.
Common image processing techniques were utilized to assist in the annotation process, such as edge detection and color segmentation for identifying pest areas in the images. The basic principle of edge detection can be represented by the following mathematical formula:
where
represents the gradient intensity of the image at point
, and
and
are the gradients at that point in the
x and
y directions, respectively. Calculating the image’s gradient effectively detects the edges of pest areas, thus assisting the annotation process.
3.3. Dataset Augmentation
In the field of deep learning and computer vision, particularly in image recognition tasks, the quality and diversity of a dataset are key factors determining the effectiveness of model training and final performance. The diversity of a dataset encompasses not only the variation and complexity of images, but also the presentation of samples under different conditions, all of which directly impact the model’s generalization capability and practicality. To enhance the accuracy and robustness of the model in the specific task of cotton pest identification, dataset augmentation is employed as a vital technical strategy. The primary aim of data augmentation is to create more training samples artificially by applying various transformations and processing methods to the original dataset, as shown in
Table 1 and
Table 2.
These methods include, but are not limited to, geometric transformations, color adjustments, random cropping, and noise addition, significantly increasing the dataset’s diversity and size. This increased diversity is crucial for improving the model’s generalization ability, especially in the application scenario of cotton pest identification, where the model needs to accurately identify pest images under different lighting conditions, angles, and backgrounds. In this paper, three advanced data augmentation techniques, namely Cutout, Cutmix, and Mixup, are employed. Each technique possesses unique advantages and applicable scenarios, enhancing the model’s depth of understanding and flexibility in image interpretation.
3.3.1. Cutout
The fundamental principle of the Cutout method [
42] is to randomly select an area in the image and set the pixel values of that area to zero or other specific values. This simple yet effective strategy enhances the model’s learning of key features and reduces reliance on non-essential parts. For instance, in the scenario of cotton pest identification, using Cutout prevents the model from overly depending on certain specific pest features, such as a particular shape or color, and instead teaches it to identify pests from a more comprehensive perspective. The operation of the Cutout method is relatively straightforward. First, the size and shape of the area to be covered need to be determined. In most cases, the area is rectangular and its dimensions are set based on the specific requirements of the experiment. Once the dimensions of the covered area are determined, the next step is to randomly select a point in the image as the center of this area, as shown in
Figure 3A. The randomness of this process is key to ensuring the diversity of data augmentation. The pixels within the selected area are then set to zero or other specified values, creating a “blank” area in the image. This process is repeatedly applied to different images in the dataset, thereby enhancing the entire dataset. Through these steps, the Cutout method effectively increases the diversity of the dataset and the generalization ability of the model. In the application of cotton pest identification, this method is particularly helpful in improving the model’s ability to recognize partially obscured or incomplete pest images, thus enhancing overall identification performance.
3.3.2. Cutmix
The fundamental principle of the Cutmix [
43] method involves cropping a region from one image and pasting it onto a corresponding location in another image. This method not only enhances the model’s ability to learn local features but also increases its robustness across category boundaries. In operation, the cropping area chosen by Cutmix is typically rectangular, with its size and location randomly determined. This region-level image blending approach, compared to traditional pixel-level blending, more effectively preserves the structural information of images while introducing additional background and contextual information. The mathematical representation of the Cutmix method can be simplified as follows:
Here,
and
are two distinct training images, and
M is a binary mask matrix of the same size as the images, indicating the region cropped from image
, while
indicates the corresponding area retained in image
. The symbol ⊙ denotes element-wise multiplication. The operational steps of Cutmix are relatively straightforward. Firstly, two training images,
and
, are randomly chosen, along with a random position as the center of the cropping area, as shown in
Figure 3B. Subsequently, a rectangular area is cropped from image
based on preset size parameters, and this area is pasted onto the corresponding position in image
. Finally, the newly created image
serves as the mixed result, with its label also being proportionally mixed based on the size of the cropping area. In the application of cotton pest identification, the Cutmix method effectively enhances the model’s understanding of the complex background and the relationships between different types of pests. This data augmentation technique not only allows the model to learn more robust feature representations but also shows improved performance in dealing with diverse and complex pest images.
3.3.3. Mixup
The core idea of Mixup [
44] is to blend two images at the pixel level, along with a proportional blending of their corresponding labels. The advantage of this method lies in its ability to enable the model to learn smoother decision boundaries during the training process, thereby improving its adaptability and robustness to variations in input data. Compared to traditional data augmentation methods, Mixup not only offers a new perspective to increase data diversity but also helps mitigate the problem of overfitting, especially in cases of smaller dataset sizes. The Mixup method can be mathematically described as follows:
Here,
and
are two distinct images, and
and
are their respective labels. The coefficient
is randomly drawn from the interval [0, 1]. This blending approach results in the generated image
and label
containing information from both original samples, thus increasing the diversity and complexity of the training data. In the application of this paper, the operation of Mixup is relatively simple and direct. First, two images and their corresponding labels are randomly selected from the training dataset. Then, these images and labels are linearly mixed according to a predetermined
value, as shown in
Figure 3C. The newly blended images and labels are then used as input data for training the model. This process is repeated throughout the entire training set to ensure the effectiveness of data augmentation. In the application of cotton pest identification, the Mixup method effectively increases the diversity of the training data, enabling the model to exhibit stronger robustness and generalization capabilities when faced with images of various types of cotton pests.
3.4. Proposed Methods
3.4.1. Transformer in Object Detection with Knowledge Graph
The focus of this study is the application of a method combining Transformer architecture with knowledge graph technology to enhance the accuracy and efficiency of cotton pest and disease detection. The Transformer model, primarily known for its success in the field of natural language processing, hinges on the self-attention mechanism. This mechanism enables the model to address long-distance dependencies in sequences, a crucial aspect for detecting pests and diseases in cotton images. However, a challenge faced by Transformers in processing images is the effective integration of domain-specific knowledge. Therefore, knowledge graph technology is employed to structurally integrate domain expertise into the model, enhancing its recognition capabilities and accuracy.
Network Structure Design: Input Layer: The input layer receives raw cotton pest and disease image data. Images are first standardized to a fixed size and normalized through a preprocessing module. Feature Extraction Layer: Features of the images are extracted using deep convolutional networks. This layer outputs a series of feature maps containing spatial information of the images. Transformer Layer: Feature maps are fed into the Transformer model. Multiple self-attention mechanisms are set up in this layer, each capable of capturing correlations at different positions in the feature maps. Knowledge Graph Integration Layer: Information from the knowledge graph is combined with the output of the Transformer layer.
F denotes the input feature map; symbolizes knowledge graph information; is the output of the Transformer layer. Professional knowledge is converted into operational feature vectors and fused with the output of the Transformer layer.
Practical Application: In practical applications, a knowledge graph specific to cotton is used. The Transformer model in this study comprises several layers of blocks, each consisting of multiple distinct operators. These layers are stacked in a specific order to form a deep network structure. Composition of the Block: Each Transformer block mainly consists of the following core components: joint-attention mechanism (as described above), normalization layer, feed-forward network, and residual connection. Joint-Attention Mechanism: This is the core component of the Transformer model. In this operator, each “head” processes a different aspect of the input data. Assuming the model has N heads and each head processes data in dimension D, the total dimension of the operator is . Each head’s weights are controlled by a separate set of parameters, enabling the model to capture various feature correlations. N (Number of Heads): This parameter defines the number of heads in the self-attention mechanism of the model. In this study, N is chosen as 16, meaning each Transformer block contains 16 independent attention “heads”. This design allows the model to analyze the input data from multiple perspectives simultaneously. D (dimension per head): D defines the dimension each head processes. To maintain computational efficiency, a smaller value for D is typically chosen in this study, with D set to 64. This setting ensures that each head focuses on a specific set of features while keeping the overall computational complexity within a reasonable range. Normalization Layer: A normalization layer follows each self-attention mechanism and feed-forward network to stabilize the model’s learning process. It typically employs Layer Normalization to ensure data maintains a relatively stable distribution while flowing through the network. Feed-Forward Network: Each Transformer block also includes a feed-forward network, usually composed of two fully connected layers with a ReLU activation function in between. This network primarily functions to further nonlinearly transform the output of the self-attention layer. Residual Connection: Each core operator in the Transformer uses residual connections. This means the input is directly added to the output of the operator before passing to the next level. Residual connections help to prevent the vanishing gradient problem in training deep networks. Connection Method: In the Transformer, data first passes through the joint-attention mechanism, then through the normalization layer, followed by the feed-forward network, and finally through another normalization layer. Each step is accompanied by residual connections to facilitate information flow and deeper training of the network. Network Size and Channel Quantity: In this study, the size and channel quantity of each layer of the Transformer block are optimized based on experiments and dataset characteristics. For example, initial layers might have fewer channels (32, 64) to reduce computational load, while deeper layers increase the number of channels (256, 512) to capture more complex features.
3.4.2. Construction of Knowledge Graph in Agriculture
In this study, the construction of a knowledge graph plays a crucial role in enhancing the understanding and reasoning capabilities of the cotton pest and disease detection model. A knowledge graph is a semantic network that represents entities and their interrelations in a graphical form, effectively organizing and managing a vast amount of professional knowledge to provide rich background information and prior knowledge for deep learning models. The process of constructing an agricultural knowledge graph involves four main steps: data collection, entity recognition, relationship extraction, and knowledge integration. Initially, in the data collection phase, extensive data related to cotton pests and diseases were gathered, including academic papers, professional books, online databases, and agricultural reports. These data encompass information such as types, characteristics, causes, impacts, and control measures of cotton pests and diseases, providing a wealth of raw materials for constructing the knowledge graph. Subsequently, in the entity recognition phase, natural language processing models [
45] were employed to analyze the collected texts, identifying key entities related to cotton pests and diseases, such as names, symptoms, pathogens, and control agents. These entities form the foundation of the knowledge graph, representing core concepts in the field of cotton pest and disease management. Further, in the relationship extraction phase, interrelations between entities were analyzed, for example, the “manifests as” relationship between a certain pest and a specific symptom, or the “can be treated with” relationship between a control agent and a pest. Through extraction and classification of these relationships, a multi-layered, interconnected knowledge network is constructed, as shown in
Figure 4.
Finally, in the knowledge integration phase, the extracted entities and relationships were consolidated to form a structured knowledge graph. Graph database technology (Neo4j) [
46] was utilized to store and manage the knowledge graph, ensuring its efficient and stable operation. Additionally, data cleansing and quality assessment were conducted to ensure the accuracy and reliability of the information in the knowledge graph. As of now, the constructed knowledge graph and model have been internally tested by some researchers at the China Agricultural University [
47].
In the cotton pest and disease detection model, the application of the knowledge graph significantly enriches the model’s understanding capabilities. By integrating the knowledge graph with deep learning models, the model can extract features from images and utilize the professional knowledge in the knowledge graph for reasoning and judgment, thus showing higher accuracy and robustness in complex real-world applications. For instance, when the model detects a specific symptom in an image, it can quickly identify potential pests and diseases by querying the knowledge graph, and even recommend appropriate control measures. This combination of deep learning and knowledge graphs offers a new solution for smart agriculture.
3.4.3. Joint-Attention Mechanism for Complex Background
In the research presented here, the integration of a joint-attention mechanism and a joint-head design specifically for tiny object detection stands as a key innovation in the cotton pest and disease detection model. The joint-attention mechanism enhances the model’s recognition capabilities in complex backgrounds by combining spatial and channel attentions, particularly excelling in detecting minute pest infestations. Specifically, the joint-attention mechanism initially employs spatial attention to identify significant locations within the image, achieved through a set of convolutional layers (Conv) followed by an activation function like sigmoid. The objective of spatial attention is to highlight key areas within the image, focusing the model’s processing on these segments. The computation formula for spatial attention is expressed as: . The convolutional layers in this mechanism each have specific configurations to accommodate feature extraction and processing at varying scales. The detailed configurations of these layers are as follows:
The first convolutional layer aims at initial feature extraction from the image. Utilizing a small kernel size (3 × 3) with a stride of 1, it ensures the extraction of high-resolution features. The number of channels in this layer is set between 64 to 128, depending on the complexity of the input image.
The second convolutional layer deepens the extracted features after the initial extraction, using the same kernel size (3 × 3) but increasing the number of channels to 256. This approach aids in extracting more complex features, aiding the model in recognizing finer details.
Downsampling convolutional layers are incorporated to accommodate images of various scales. These layers reduce the dimensions of feature maps by increasing the stride (for example, a 2 × 2 convolution with a stride of 2), thereby capturing a broader range of contextual information.
Upsampling convolutional layers, in contrast to downsampling ones, increase the dimensions of feature maps through transpose convolution techniques, crucial for restoring detailed information in the image.
To ensure effective processing of images of different sizes, the convolutional layers in the joint-head adopt an adaptable structure. By adjusting the number and configuration of these layers, the network can flexibly adapt to inputs of various resolutions, thus ensuring effective processing of images of all sizes. This design allows the network to maintain high-resolution features while capturing wider contextual information, enhancing accuracy and robustness in detecting minute targets.
Subsequently, channel attention works on analyzing and emphasizing important feature channels in the image. This step involves processing the feature maps with average pooling (AvgPool) and maximum pooling (MaxPool), followed by a multilayer perceptron (MLP), also employing an activation function like sigmoid. The computation formula for channel attention is: . Finally, these two types of attentions are integrated and combined with the original feature maps through element-wise multiplication and addition, generating the final output feature map: . This step is the crux of the joint-attention mechanism, enabling the model to concurrently focus on both spatial and channel information of the image. Such a design allows the model to accurately locate and recognize minute pest and disease features while understanding the image, thereby improving detection accuracy and robustness in complex cotton field environments. Through the joint-attention mechanism and the joint-head design for tiny objects, the presented model effectively addresses the multiple challenges in cotton pest and disease detection.
3.4.4. Joint-Head for Tiny Objects
In the research conducted for this paper, a specialized design joint-head was developed for tiny objects, building upon the foundation of the joint-attention mechanism. This approach aims to enhance the detection of minute targets in cotton pest and disease detection by integrating feature information from various scales through a specific fusion strategy, thus enabling precise localization and identification of small-scale features.
Principle of Joint-Head: Differing from the traditional multi-head mechanisms, joint-head does not merely fuse information at the same scale but strategically combines feature maps of different scales. This design allows the model to capture fine details while simultaneously comprehending higher-level semantic information. The implementation of joint-head is achieved through the following steps:
Initially, the input feature map is processed across different scales, including both downsampling and upsampling operations, to obtain representations at various resolutions.
Subsequently, the joint-attention mechanism is applied to these feature maps at different scales, highlighting important information at each scale.
Finally, these attention-processed feature maps are fused, integrating information across scales.
The fusion strategy can be mathematically expressed as follows:
Here,
represents the fused feature map,
N is the number of scales,
is the weight of the
scale, and
is the feature map processed by the attention mechanism at scale
i. The specific implementation is as shown in Algorithm 1.
Algorithm 1 Joint-Head for Tiny Object Detection |
Require:
Input feature map F, Number of scales N, Functions for DownSampling , UpSampling , and Joint Attention Ensure:
Enhanced feature map for tiny object detection
1: Initialize an empty list
2: for to N do
3: if median scale then
4: {DownSampling for smaller scales}
5: else if median scale then
6: {UpSampling for larger scales}
7: else
8: {Original scale}
9: end if
10: {Apply Joint Attention}
11: Append to
12: end for
13: {Weighted sum of feature maps}
14: return
|
Difference from Multi-Head Mechanism: Contrasting with the multi-head mechanism, joint-head places greater emphasis on integrating features from various scales. While multi-head mechanisms typically analyze information from multiple perspectives within the same scale, joint-head combines features from different scales to obtain a more comprehensive understanding. This approach is particularly effective for processing minute targets, as small-scale features often manifest differently across scales and require a synthesis of information at multiple levels for accurate identification.
Contribution to the Task: In the task of cotton pest and disease detection, crucial features like small pests or early-stage disease spots are often minute and challenging to detect accurately with traditional methods. The design of joint-head enables the model to effectively capture these minute features and analyze them in conjunction with a broader context. This not only improves the precision in detecting minute targets but also enhances the model’s adaptability to features of varying scales. Especially in complex cotton field environments, this capability is vital for enhancing the accuracy and robustness of detection. By integrating features across different scales, joint-head significantly improves the model’s performance in detecting minute features of pests and diseases, providing an effective solution for cotton pest and disease detection.
3.4.5. Combined Loss Function
In the research presented in this paper, combined loss function was specifically proposed to optimize the training of the model and enhance its performance in complex tasks. This loss function amalgamates various types of losses, aiming to simultaneously address classification, localization, and other challenges in cotton pest and disease detection.
Principle of the Combined Loss Function: The core idea of the combined loss function is to consider the advantages of different loss types comprehensively, facilitating a holistic optimization of the model training. This function comprises the following components:
Classification Loss: This evaluates the model’s accuracy in identifying different categories of pests and diseases. Common classification loss functions, such as Cross-Entropy Loss, play a pivotal role in this aspect.
Localization Loss: This assesses the precision of the model in locating areas affected by pests and diseases, often involving the prediction of bounding boxes, utilizing losses like Intersection over Union (IoU) or Smooth L1 loss.
Regularization: This prevents model overfitting, ensuring good generalization ability of the model across diverse data.
The combined loss function can be mathematically expressed as follows:
In this study, the coefficients , , and are crucial for balancing the significance of classification loss , localization loss , and the regularization term within the overall loss function. The determination of these coefficients significantly impacts the training and eventual performance of the model. Initially, the values for , , and are often set based on prior research and experimental insights at the commencement of training. Throughout different stages of training, the model’s performance is evaluated using a validation set. Depending on the model’s performance in classification, localization, and regularization, the values of , , and are adjusted accordingly. For instance, if the model underperforms in localization tasks, the value of may be increased to give more weight to the localization loss in the total loss. Additionally, an auxiliary network, a simple multilayer perceptron, is employed for automatic adjustment. This network receives various performance indicators of the model, such as components of loss and accuracy, as inputs. During the training process, it computes the current classification loss , localization loss , and regularization loss . The auxiliary network aims to minimize the overall loss of the main model, implying that it learns to adjust the coefficients to optimize the performance of the main model. This gradient-based automatic adjustment strategy allows for dynamic optimization of the loss function coefficients based on the model’s performance during training.
3.5. Experiment Settings
This section elaborates on the experimental design of this paper, encompassing the division of the dataset, selection of baseline models, and configuration of optimizers and hyperparameters.
3.5.1. Hardware and Test Platform
The hardware configuration used in this study forms the foundation for efficient and precise detection of cotton pests and diseases. To ensure the validity and reliability of the experiments, advanced hardware platforms were utilized for training and testing the model. The details of the hardware configuration employed in this study are as follows: Primarily, the model training and testing were conducted on high-performance GPU servers. These servers were equipped with multiple NVIDIA GeForce RTX 3090 graphics cards, each possessing 24 GB of video memory, providing substantial computational power and rapid data processing capabilities. The RTX 3090, based on the Ampere architecture, supports efficient parallel computing and deep learning optimization techniques, making it highly suitable for large-scale deep learning model training and complex computational tasks. Additionally, the servers were fitted with powerful CPU processors and ample memory resources, ensuring the efficiency of the entire training process. Beyond GPU servers, specialized data collection devices were also employed. These included high-definition cameras and various sensors installed in cotton fields for real-time monitoring and collection of images related to cotton pests and diseases. These devices, characterized by their high resolution and adaptability to different environmental conditions, are capable of capturing clear images under various lighting and weather conditions, providing a high-quality data source for model training and testing.
Regarding the mobile platform, Apple’s iOS devices were chosen as the testing platform for the application. Specifically, iPhones equipped with high-performance A-series chips, such as the A14 Bionic, were used. These chips offer robust CPU and GPU performance and include a dedicated neural engine for efficiently executing deep learning models. Moreover, the high-resolution displays and stable operating systems of iOS devices provide users with a favorable interactive experience and a stable operating environment.
In summary, the hardware configuration for our experiments was designed to meet the high-performance demands of model training while also accommodating the environmental adaptability and user experience requirements of practical applications. Through these advanced hardware resources, the cotton pest and disease detection model was able to undergo efficient training and accurate inference, thus achieving favorable results in practical applications.
3.5.2. Dataset Settings
For the study, the collected 3129 image data were divided into training and validation sets, ensuring the model’s generalizability and the accuracy of its evaluation. Specifically, the dataset was divided in an 80–20% ratio, with 80% of the data allocated for the training set, serving the purpose of model training and optimization, while the remaining 20% was utilized as the validation set for assessing and comparing the model’s performance. To guarantee uniformity in the distribution of data, stratified sampling was implemented during the dataset division, ensuring that both the training and validation sets were representative and diverse in terms of categories, image backgrounds, and lighting conditions.
3.5.3. Baseline
To comprehensively evaluate the detection model proposed in this paper, several baseline models were selected for comparison: YOLOv7 [
48], known for its lightweight structure and efficiency in object detection, has shown excellent performance across multiple domains, particularly suited for real-time detection tasks. YOLOv8 [
49], as the latest iteration in the YOLO series, offers enhanced performance and accuracy, representing the cutting edge in the field of object detection. RetinaNet [
50], with its unique focal loss design, is particularly effective in addressing class imbalance issues, a common challenge in object detection tasks. EfficientDet [
51], renowned for its efficient network architecture and exceptional performance-to-resource ratio, is well-suited to resource-constrained environments. Detection Transformer (DETR) [
52], a Transformer-based object detection model, demonstrates good capabilities in handling complex scene object detection tasks. These models were chosen as baselines to assess the performance of the proposed model relative to current state-of-the-art technologies.
3.5.4. Optimizer and Hyperparameters
During the training process, the Adam optimizer was selected due to its proven convergence speed and stability when handling large datasets. Combining the benefits of momentum optimization and RMSprop, the Adam optimizer adaptively adjusts the learning rate for each parameter, which is particularly effective for deep learning models. The hyperparameters were set as follows: The initial learning rate was set at 0.001, with a learning rate decay strategy implemented, halving the learning rate when performance on the validation set ceased to improve. Batch size was set to 16 or 32, depending on the GPU memory capacity. Weight decay was set at 0.0001 to prevent overfitting. The number of training epochs was set to 100 but an early stopping strategy was adopted, halting training if no improvement in validation set performance was observed for 10 consecutive epochs.
3.5.5. Model Evaluation Metrics
In the context of this study, accuracy, mean average precision (mAP), and frames per second (FPS) serve as key indicators for assessing the performance of the model. Detailed mathematical descriptions of these evaluation metrics and their significance are provided below.
Accuracy is one of the most intuitive metrics used to measure the proportion of correctly predicted samples by the model. Its formula is given as follows:
Accuracy reflects the model’s overall performance across the entire dataset, indicating its ability to correctly identify pests. While being an intuitive metric, it is not always sufficient, especially in datasets with class imbalance. In such scenarios, high accuracy might be achieved even if the model predicts most samples correctly, but it does not necessarily imply good predictive capability across all categories.
mAP is a more nuanced metric for evaluating the performance of classification models, particularly suitable for multi-category detection tasks. mAP first calculates the average precision (AP) for each category, followed by averaging the APs across all categories. AP represents the average precision of the model at different recall rates, calculated as follows:
where
is the precision at recall rate
r. The formula for mAP is as follows:
Here, N is the total number of categories and is the average precision for the ith category. mAP considers not only the accuracy of predictions but also their completeness. It is a metric that balances recall and precision, crucial for complex tasks requiring differentiation between various categories.
FPS is a critical metric for measuring the real-time performance of a model, especially in applications requiring real-time processing. FPS refers to the number of frames processed by the model per second, calculated as follows:
A high FPS indicates faster image processing by the model, which is vital for real-time pest detection systems. In agricultural applications, real-time monitoring and response are key to effective pest control. Thus, enhancing FPS not only improves user experience but also ensures the timeliness and effectiveness of pest management measures.
In summary, these evaluation metrics not only reflect the model’s accuracy and efficiency in pest detection tasks, but also provide crucial insights into the model’s performance in practical applications. By considering these metrics comprehensively, a thorough evaluation and optimization of the pest detection model’s performance can be achieved.
4. Results and Discussion
4.1. Results of Cotton Pest and Disease Detection
The experimental design in this chapter aims to evaluate and compare the performance of different deep learning models in the task of cotton pest and disease detection. By analyzing metrics such as accuracy, mean average precision (mAP), F1-score, and frames per second (FPS), insights into the strengths and limitations of each model and the underlying reasons are gained. The experimental results are presented in
Table 3.
Initially, RetinaDet exhibits relatively weaker performance in this experiment, with accuracy, mAP, and F1-score of 0.84, 0.83, and 0.84, respectively, and an FPS of 31.8. RetinaDet, a target detection model based on the feature pyramid and focal loss, is primarily designed to address class imbalance issues. However, its performance is limited in processing cotton pest and disease images due to challenges in handling fine features and complex backgrounds. YOLOv7 and YOLOv8 show better results, with accuracies of 0.91 and 0.92, mAPs of 0.91 and 0.91, F1-scores of 0.90 and 0.91, and FPS of 46.5 and 48.3, respectively. The YOLO series models are known for their speed and efficiency, suitable for scenarios requiring rapid response. They effectively capture image features through deep convolutional neural networks but may have limitations in processing extremely fine details. EfficientDet’s performance, with an accuracy of 0.89, mAP of 0.85, and F1-score of 0.87, and an FPS of 22.9, indicates a slower processing speed. EfficientDet emphasizes a balance between efficiency and performance, but its capability may be constrained by the model’s design when processing complex cotton pest and disease images. The DETR model shows similar accuracy and mAP to YOLOv8, but with a lower FPS of 34.0. This suggests that, while DETR’s Transformer structure performs well in understanding global image information, it is slower than convolution-based models. The model proposed in this paper surpasses others in all metrics, with an accuracy of 0.94, mAP of 0.95, F1-score of 0.94, and FPS of 49.7. This superiority is attributed to innovations in structural design, feature extraction, and optimization strategies. The proposed model integrates Transformer technology and knowledge graphs, along with a unique joint-attention mechanism and a combined head design for small targets, as shown in
Figure 5. These features enable the model to efficiently and accurately capture fine details and process complex backgrounds. Furthermore, the model’s optimization strategies ensure high processing speed, making it outstanding in real-time detection.
Overall, the performance of the model proposed in this paper in the task of cotton pest and disease detection validates its effectiveness both theoretically and practically. By considering image features and domain knowledge comprehensively, and through optimized network structure design, the proposed model not only improves in accuracy but also maintains a high level of processing speed. These results provide robust technical support for intelligent identification of cotton pests and diseases, significantly contributing to enhanced agricultural productivity and crop protection.
4.2. Visualization Analysis of Detection Results
4.2.1. Confusion Matrix Analysis
In the analysis of confusion matrices for cotton pest and disease detection, emphasis was placed on examining the performance of different models across specific pest and disease subclasses, particularly focusing on which subclasses were prone to confusion and the potential reasons behind such confusion.
Data from
Figure 6 reveal significant misclassification issues in certain subclasses by various models. For instance, the YOLO series models exhibited a degree of confusion between “Constrictor Aphid” and “Foliar Disease”. This confusion might stem from the visual similarities between these two pests, such as size, shape, or color, challenging the deep-convolutional-neural-network-based models in accurate differentiation. Similar phenomena were observed in the RetinaDet and EfficientDet models, indicating a common challenge in handling subtle features. In the case of the DETR model, despite a high overall accuracy, confusion persisted in specific categories, such as “Bacterial Blight” and “Red Spot”. This could be attributed to the challenge of distinguishing subtly different categories, despite the Transformer’s capability in capturing global contextual information. This suggests that global context might not always suffice in discerning categories with similar features. Conversely, the method proposed in this paper demonstrated higher accuracy in the confusion matrices, particularly in categories prone to confusion. For example, compared to other models, the proposed method exhibited less confusion between “Constrictor Aphid” and “Foliar Disease”. This improvement might be attributed to the application of the joint-attention mechanism and joint-head design for tiny objects, enhancing the model’s ability to distinguish subtle differences. Additionally, the use of a combined loss function might have contributed to optimizing the model’s performance in both classification and localization tasks, thereby reducing confusion.
Overall, the analysis of confusion matrices highlighted performance disparities and potential weaknesses of different models in specific subclasses. By comprehensively understanding the causes of these confusions, future model designs and improvements can be better guided, especially in handling subtle features and similar categories. This is crucial for enhancing the accuracy and robustness of cotton pest and disease detection.
4.2.2. Detection Results Visualization
In this section, an in-depth exploration was conducted on the performance of the cotton pest and disease detection model proposed in this paper, particularly in practical applications. Visualization analysis of the detection results clearly demonstrates the model’s accuracy and efficiency in identifying pest and disease features, especially its exceptional capability in processing images with complex edges and multiple detection points, as illustrated in
Figure 7.
Firstly, regarding the processing of complex edges, the model displayed significant advantages. In agricultural image processing, edge detection is often challenging due to the variable shapes of cotton leaves and their high color similarity with the background. The model accurately identified the edges of cotton leaves, even in situations where the leaf edges were blurred or highly blended with the background colors. This advantage stems from the model’s powerful feature extraction capability and effective handling of complex backgrounds, ensuring high accuracy in pest and disease detection. Secondly, the model also excelled in scenarios involving multiple detection points in an image. In practical applications, cotton pests and diseases often appear not in isolation but as clusters of multiple spots or infestation points. Under such circumstances, identifying and locating each spot or pest point becomes particularly complex. The model was not only capable of precisely locating each detection point but also distinguished between adjacent multiple spots or pest points, demonstrating its efficiency and accuracy in multi-target detection tasks. Moreover, the visualization results also revealed the model’s ability to recognize different types of pests and diseases. The model could identify not only common types but also effectively recognize some uncommon or difficult-to-detect pest and disease types. This indicates the model’s significant advantage in learning and generalization abilities, enabling it to adapt to varied practical application environments. Lastly, it is worth mentioning that the model also showed good adaptability and robustness in processing images under different lighting conditions and shooting angles. Whether in strong lighting or in shadow, the model accurately identified pests and diseases, further reflecting its capability of adapting to complex environmental conditions.
4.3. Ablation Study on Different Attention Architecture
In this section, a pivotal experiment was conducted to evaluate and compare the performance of different attention architectures in cotton pest and disease detection. The aim of this experiment was to explore the impact of various attention mechanisms on model performance, thereby gaining a better understanding of how these models operate and determining the most suitable attention architecture for this task. By comparing the accuracy, mAP, and FPS of different architectures, insights were gained into their strengths and limitations.
Table 4 and
Figure 8 present the performance of five different attention architectures. Initially, the pure self-attention architecture demonstrated good performance, achieving an accuracy of 0.93 and an mAP of 0.92, though its FPS was relatively low at 31.8. The key advantage of self-attention [
19] lies in its ability to capture long-distance dependencies, enabling the model to understand global information in images. However, due to higher computational costs, this architecture is somewhat limited in processing speed. The channel attention and spatial attention architectures [
53], focusing on the channel and spatial features of images, respectively, showed moderate performance. The channel attention architecture achieved an accuracy of 0.90 and an mAP of 0.87, while the spatial attention architecture had an accuracy of 0.88 and an mAP of 0.90. Compared to self-attention, both architectures improved in FPS, reaching 39.7 and 41.1, respectively, indicating their efficiency in processing speed. Yet, they might not be as comprehensive as self-attention in capturing global context information.
The architecture combining joint attention with joint-head proved to be the most effective in the experiment, attaining an accuracy of 0.96 and an mAP of 0.95, while maintaining a high FPS of 51.4. This efficient performance can be attributed to its simultaneous focus on spatial and channel information in images, and the integration of the joint-head design for better handling of tiny objects. This suggests that considering both spatial and channel information is crucial in complex tasks like cotton pest and disease detection. Lastly, the architecture that combined joint attention with multi-head attention [
19] also showed good performance, with an accuracy and mAP of 0.93 and 0.92, respectively, but a lower FPS of 32.6. This indicates that, while multi-head attention can provide diverse perspectives in image analysis, it may not be as fast as the joint-head design in processing.
In summary, the experimental results indicate that the application of joint-attention mechanisms combined with joint-heads is highly effective in cotton pest and disease detection tasks. Not only does it offer high accuracy and mAP, but it also maintains a high FPS, showcasing its superiority in both precision and processing speed. These findings provide valuable insights, guiding the selection and design of the most suitable attention architecture for specific tasks. Understanding the mathematical and technical characteristics of each architecture allows for a deeper analysis of their applicability and limitations in various scenarios, thus offering essential guidance for future model design and optimization.
4.4. Agricultural Large-Model Implementation with Sensors
With the advancement in agricultural technology and the development of the Internet of Things, sensors are increasingly being utilized in agriculture. They not only monitor crop growth environments in real-time but also collect valuable data to support agricultural decision making. Integrated with large-scale agricultural models, sensor technology can achieve more intelligent and efficient pest and disease monitoring and management. To incorporate data from various types of sensors, a multi-source data alignment module is implemented prior to the model in practical deployment, as shown in
Figure 9.
This module is pivotal in processing and fusing heterogeneous data from different sensors, including temperature, humidity, and light intensity data. The working principle of the multi-source data alignment module can be broken down into the following steps:
Data Preprocessing: Initially, all sensor data are subjected to preprocessing. Non-image data (such as temperature and humidity) are transformed into standardized numerical representations.
Feature Extraction: Subsequently, appropriate feature extraction methods are employed for each data type. In this study, fully connected layers are used to extract features from environmental data like temperature and humidity.
Feature Fusion: The features from different data sources are then fused. This step typically involves aligning and fusing features in a common feature space using techniques such as weighted summation and concatenation.
Data Alignment: Post feature fusion, data alignment operators ensure that features from different sources are effectively aligned in the same dimensional space. This alignment allows subsequent deep learning models to better utilize the data.
Assuming there are
n different data sources (in this study,
, representing temperature, humidity, and light intensity data), each source
i yields a feature vector
after feature extraction. The objective of the data alignment module is to transform these feature vectors into a unified feature space for effective fusion. Feature fusion can be represented as follows:
Here,
is the fused feature vector,
is the weight of the
ith data source,
is the data alignment operator, and
are the operator parameters. The
operator maps feature vectors of different dimensions to a common dimensional space. This can be achieved through a fully connected layer or another appropriate neural network layer:
Here, and are the weights and bias of the fully connected layer, respectively, and is an activation function, such as ReLU or sigmoid. In practice, the operator parameters must be obtained through training to ensure effective support for pest and disease detection tasks after data alignment. In this way, the multi-source data alignment module unifies data from different sensors, providing a comprehensive and accurate feature representation for the deep learning model, thereby enhancing the effectiveness of cotton pest and disease detection.
4.5. Application on Mobile Computing Platform
In the modern agricultural production, real-time monitoring and rapid identification of cotton pests and diseases are crucial for ensuring crop health and improving yield. The research aims to develop an application for mobile computing platforms that performs instant detection and analysis of cotton pests and diseases. The application not only supports server-side processing but also has the capability of completing inference analysis locally on mobile devices, greatly enhancing its application value in environments without network access, as shown in
Figure 10.
Mobile Platform Framework (iOS Platform) and Development Process: The Apple iOS platform was chosen as the target platform for the mobile application due to its broad user base and mature development environment. The iOS app is written in Swift and integrates with the Core ML framework, which is Apple’s solution for deploying machine learning models on iOS devices. Core ML supports various model formats and can leverage the advantages of Apple hardware to improve the model’s operational efficiency on devices. The development process includes the following:
Model Training and Optimization: Deep learning models are trained and validated on the server side using a large collection of cotton pest and disease image data. After training, the models are optimized to accommodate the performance limitations of mobile devices.
Model Conversion and Testing: The trained models are converted into Core ML model format and tested on iOS devices to ensure their accuracy and efficiency.
Application Development: Using the Xcode development environment and Swift language, the iOS application is developed by integrating the optimized model into the application and designing an intuitive user interface.
Local Inference Implementation: An optimized Core ML model is embedded in the iOS app to enable local execution. A user-friendly operational interface is designed to allow users to easily upload images and receive inference results from the model.
Performance Testing and Optimization: Extensive performance tests are conducted to ensure the application performs well across different models of iOS devices. Model parameters are adjusted based on test results to balance inference speed and accuracy.
Mobile–Server System Design: As illustrated in
Figure 11, the mobile–server system comprises several key components:
Data Collection Terminals: Multiple data collection terminals are deployed in the fields to collect images of cotton pests and diseases. Communication Network: GSM/GPRS networks are used for data transmission to ensure the images collected are promptly uploaded to the server. Server-Side Processing: The server stores and analyzes the collected data to produce pest and disease detection results. Mobile Application: Users access pest and disease detection services through an application on their iOS devices. The app can communicate with the server when network connection is available or work independently using local inference functionality when there is no network. Local Storage and Inference: The application allows users to store images directly on the device and perform pest and disease recognition inference using the embedded Core ML model. User Interaction and Feedback: The application provides an interactive interface for users to upload images and receive inference results. Users can also provide feedback to help developers further improve the application.
With the design and implementation of the above system, the application developed in this research realizes efficient detection of cotton pests and diseases on mobile computing platforms. Whether processing data through the server under good network connection conditions or completing inference locally on mobile devices when there is no network, the application provides users with accurate and fast pest and disease identification services, significantly enhancing the level of intelligence in modern agriculture.
5. Conclusions
In this study, a deep-learning-based intelligent recognition model is proposed for cotton pest and disease detection, achieving efficient and accurate identification of cotton pests and diseases. The significance of this task lies in not only enhancing the automation level of pest and disease identification, alleviating labor intensity for farmers, but also improving agricultural productivity, which is crucial for ensuring cotton yield and quality.
The research innovation of this paper mainly includes the following aspects: Firstly, the latest Transformer technology and knowledge graph are adopted and applied to the task of cotton pest and disease detection, which is relatively rare in previous studies. The introduction of Transformer technology allows the model to better capture long-range dependencies in images, enhancing the model’s understanding of global information; the use of the knowledge graph provides the model with a wealth of domain knowledge, enhancing the model’s accuracy in identifying pest and disease features. Secondly, the joint-attention mechanism and joint-head design for tiny objects proposed in this paper enable the model to have higher accuracy and robustness in dealing with complex backgrounds and tiny objects. With these innovative architectures, our model outperforms traditional models in all metrics, including accuracy, mAP, and FPS, all of which are validated in the experimental results. In terms of experimental results, the method proposed in this paper performs excellently in the cotton pest and disease detection task, specifically achieving an accuracy rate of 0.94, an mAP of 0.95, and an FPS of 49.7, all surpassing other comparative models. Especially when dealing with complex images where multiple tiny aphids cluster together, our model exhibits superior performance.
Nevertheless, our research still has some limitations. For instance, although the model shows good performance in the experiments, how to maintain this performance on a larger scale dataset and how to further reduce the consumption of computational resources require more in-depth study. In addition, the robustness of the model in dealing with images under extreme environmental conditions, such as extreme lighting and occlusions, needs to be improved. In future research, we plan to expand the scale of the dataset, adding more types of pest and disease images to improve the model’s generalization ability. Also, we will explore more efficient model optimization methods to reduce the inference time of the model on mobile devices, achieving faster pest and disease detection. Moreover, we will consider incorporating more types of sensor data, such as temperature and humidity information, to achieve more accurate and comprehensive pest and disease monitoring.