Implementation and Evaluation of Attention Aggregation Technique for Pear Disease Detection

Hai, Tong; Zhang, Ningyi; Lu, Xiaoyi; Xu, Jiping; Wang, Xinliang; Hu, Jiewei; Ji, Mengxue; Zhao, Zijia; Wang, Jingshun; Dong, Min

doi:10.3390/agriculture14071146

Open AccessArticle

Implementation and Evaluation of Attention Aggregation Technique for Pear Disease Detection

by

Tong Hai

^1,†,

Ningyi Zhang

^1,†,

Xiaoyi Lu

^1,†,

Jiping Xu

¹,

Xinliang Wang

¹,

Jiewei Hu

¹,

Mengxue Ji

¹,

Zijia Zhao

¹,

Jingshun Wang

^1,2,3,* and

Min Dong

^1,*

¹

China Agricultural University, Beijing 100083, China

²

College of Biology and Food Engineering, Anyang Institute of Technology, No. 73 Huanghe Road, Anyang 455000, China

³

Taihang Mountain Forest Pests Observationand Research Station of Henan Province, Linzhou 456550, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agriculture 2024, 14(7), 1146; https://doi.org/10.3390/agriculture14071146

Submission received: 14 June 2024 / Revised: 5 July 2024 / Accepted: 11 July 2024 / Published: 15 July 2024

(This article belongs to the Special Issue Comprehensive Application and Prospects of New Technologies for Plant Protection)

Download

Browse Figures

Versions Notes

Abstract

:

In this study, a novel approach integrating multimodal data processing and attention aggregation techniques is proposed for pear tree disease detection. The focus of the research is to enhance the accuracy and efficiency of disease detection by fusing data from diverse sources, including images and environmental sensors. The experimental results demonstrate that the proposed method outperforms in key performance metrics such as precision, recall, accuracy, and F1-Score. Specifically, the model was tested on the Kaggle dataset and compared with existing advanced models such as RetinaNet, EfficientDet, Detection Transformer (DETR), and the You Only Look Once (YOLO) series. The experimental outcomes indicate that the proposed model achieves a precision of 0.93, a recall of 0.90, an accuracy of 0.92, and an F1-Score of 0.91, surpassing those of the comparative models. Additionally, detailed ablation experiments were conducted on the multimodal weighting module and the dynamic regression loss function to verify their specific contributions to the model performance. These experiments not only validated the effectiveness of the proposed method but also demonstrate its potential application in pear tree disease detection. Through this research, an effective technological solution is provided for the agricultural disease detection domain, offering substantial practical value and broad application prospects.

Keywords:

multimodal data processing; attention aggregation techniques; dynamic regression loss; AI in agriculture; deep learning for precision agriculture

1. Introduction

In modern agricultural production, the timely and accurate detection of diseases is crucial for ensuring the healthy growth of crops and improving yield [1,2]. This is particularly vital for economically important fruit trees such as pear trees, where the prompt diagnosis and treatment of diseases are key [3]. Traditional methods for disease detection, which rely on the expertise of agricultural experts, are not only inefficient but also fail to cover extensive farmlands [4]. With the rapid development of artificial intelligence technologies [5], the use of computer vision for disease detection has become an effective solution. Studies by Maryam and others [6] have charted various research efforts using machine learning with different data modalities for disease detection, demonstrating the immense potential of deep learning (DL) fusion models when predicting with multimodal data. However, Maryam et al.’s model is relatively complex and computationally intensive. Therefore, Zhang and others combined the transformer model with agricultural pest detection, introducing the TinySegformer model, which is known for its efficiency, accuracy, and lightness and is utilized for the identification and detection of grain pests [7]. Tian and colleagues focused on six areas, including crop growth monitoring, disease prevention, automatic harvesting, quality inspection, modern farm automation management, and unmanned aerial vehicle field information monitoring. They addressed current agricultural challenges, improving the economic efficiency, versatility, and robustness of agricultural automation systems, thereby advancing agricultural equipment and systems toward greater intelligence [8]. This provided powerful and practical solutions for agricultural pest detection and offered new insights and directions for future research in the field.

However, existing technologies still face issues with accuracy and generalization capabilities, especially when applied in complex agricultural environments [9,10]. Zhang and others used pixel-weighted fusion to combine two segmentation results. Through this process, the model’s robustness and segmentation performance were enhanced. However, it still could not effectively utilize the 3D information in the dataset [11]. To address this, Zhou and others introduced the Atrous Pyramid Generative Adversarial Network (GAN) segmentation network, which reduces interference from the external environment during 3D image recognition and enhances the class-related connectivity between adjacent pixels, though it still loses a significant amount of detail at complex edges [12]. Lin and others proposed a morphology correction-based global transformer segmentation network that improved edge recognition, achieving 0.903 in accuracy. Additionally, they increased the continuity by 408% through morphological closing operations, setting dilation and erosion coefficients, and removing noise. However, there is still a need to enhance the segmentation accuracy of remote sensing images, especially of small-scale objects [13]. In terms of generalization abilities, Zhang introduced the Safe Adaptive Linear Unit (Safe-SALU), which further enhances the network’s generalization capabilities, resulting in faster learning speeds during the early training stages [14]. Further developments led Zhang to propose the Symmetry GAN detection network, achieving 0.9737, 0.9845, and 0.9841 in P, R, and mAP, respectively, though it did not effectively utilize the temporal continuity of image sequences [15].

In recent years, the development of multimodal technologies [16] has provided new approaches to addressing the aforementioned issues [17]. By integrating data from different sources, such as image data and sensor data, multimodal technology can provide more comprehensive and abundant information, thereby enhancing the accuracy of disease detection. Moreover, rapid advancements in DL, particularly with convolutional neural networks (CNNs) [18], Transformers [19], and large models, have paved new ways to enhance model performance and generalization capabilities. For example, Zhou and others constructed the “graphic-text” multimodal collaborative representation and knowledge-assisted disease recognition model (ITK-Net), achieving the highest accuracy, precision, sensitivity, and specificity at 99.63%, 99%, 99.07%, and 99.78%, respectively [20]. Inspired by the practicality of CNN models and agricultural IoT, Rutuja R. Patil introduced a novel multimodal data fusion framework named Rice-Fusion for diagnosing rice diseases, which faced overfitting issues during the training phase [21]. Consequently, Ilavendhan A proposed a new technique for detecting leaf diseases using a multimodal DL approach that combines visual and non-visual data, significantly enhancing the accuracy of leaf disease detection and enabling the effective management of plant diseases [22]. Influenced by this, Chittabarmi and others [23] explored various effective ML and DL classifiers for leaf disease detection, noting that combining machine learning models with other models could enhance the performances of hybrid models, and that using ensemble learning algorithms could improve the accuracy, highlighting this as a direction for future research. However, how to effectively integrate these technologies and apply them to pear tree disease detection [24] remains a challenge to be addressed.

This paper aims to propose a pear tree disease detection method based on a multimodal large model and attention aggregation module. Through deeply analyzing the intrinsic connections and complementarities of multimodal data, a novel large model framework has been designed, effectively integrating image data and sensor data. Extensive experiments conducted in April this year validated the effectiveness of the proposed model. Additionally, attention mechanisms have been incorporated to accurately focus on features related to disease, further enhancing the detection’s accuracy and efficiency. The main innovations and contributions of this work include the following:

Large multimodal model design: A large multimodal model architecture is proposed specifically for pear tree disease detection. Not only does this architecture consider the visual features of image data, but it also effectively utilizes environmental information from sensor data, significantly enhancing the model’s accuracy and generalization capabilities through DL technologies.
Multimodal weighting module: To better process and integrate data from different modalities, a multimodal weighting module was designed, which adaptively adjusts the weight of different modal data within the model, further optimizing model performance.
Attention aggregation module: Attention mechanisms have been introduced, particularly through the design of an attention aggregation module within the model, effectively focusing on feature areas closely related to disease detection, thereby enhancing the model’s sensitivity and recognition capabilities for disease features.
Dynamic regression loss function: A dynamic regression loss function was designed to accommodate variations in disease stages and types, further enhancing the model’s accuracy and robustness.

With these innovations, the method proposed in this paper not only achieves significant performance improvements in pear tree disease detection but also provides new ideas and methods for multimodal data processing and DL model design. Additionally, the integration of a web application further enhances the system’s accessibility and usability for end users. This web application, built on modern web technologies, allows users to interact with the system through a browser interface, making it possible to access disease detection functionalities remotely and in real time. It is believed that the findings of this research will positively impact the development of disease detection technologies in the agricultural field. In summary, the main contribution of this study lies in proposing an innovative method for pear tree disease detection based on a large multimodal model and attention aggregation module, utilizing DL technology to achieve efficient and accurate disease detection, which is of great significance for enhancing agricultural production efficiency and ensuring food safety.

2. Materials and Methods

2.1. Dataset Collection

2.1.1. Image Data

The dataset for this study was primarily collected in Zhuozhou Experimental Garden and from farmers in Bayannur, Inner Mongolia Autonomous Region, to ensure that representative samples of pear tree diseases and related environmental data were obtained. High-resolution photographic equipment and precise environmental monitoring sensors were employed to capture comprehensive information on crop diseases and environmental variables. The image data were collected using high-performance photographic equipment, specifically the DJI Phantom 4 Pro drone and Nikon D850 camera, as shown in Figure 1 and Table 1.

These devices were chosen for their ability to provide an image quality of up to 4K resolution, capturing detailed health states of pear tree leaves, branches, and fruits. Additionally, to optimize the image quality under varying lighting conditions, the data collection was primarily scheduled during early morning and evening hours when natural light is softest, reducing issues like sunspots or overexposure. The types of pear tree diseases and pests, along with the quantity of image data collected, are summarized in the following table.

The images captured various growth states of pear trees, ranging from healthy to different stages of disease infection, including mild, moderate, and severe conditions. To ensure data diversity and sufficiency, each state was photographed under varying conditions such as different lighting conditions, times of day, and angles. Each condition of the pear trees was documented at least three times. Specifically, a parameter

θ

, representing the coverage rate, was introduced to quantify the diversity of the image data collected under these different environmental and photographic conditions, and it was defined as

θ = \frac{Number of pear tree conditions collected}{Total number of pear tree conditions}

(1)

In this project, an ideal value of

θ

close to 1 indicates that every condition of the pear tree was adequately recorded.

2.1.2. Sensor Data

Concurrently with the image data collection, the collection of sensor data was equally crucial. Multiple sensors, including temperature, moisture, soil pH, and light sensors, were installed in the pear tree fields. These sensors automatically recorded data every 15 min, ensuring the continuity and timeliness of the data, which allowed for the precise capture of environmental conditions related to the growth and disease development of pear trees. The accurate processing of sensor data is vital for the performances of multimodal disease detection models. Data preprocessing included several steps: firstly, for each type of sensor data, daily averages and coefficients of variation were calculated to assess daily variability. For example, for temperature data T, the daily average

\bar{T}

and daily coefficient of variation

C V_{T}

were given by

\bar{T} = \frac{1}{N} \sum_{i = 1}^{N} T_{i}, C V_{T} = \frac{\sqrt{\frac{1}{N - 1} \sum_{i = 1}^{N} {(T_{i} - \bar{T})}^{2}}}{\bar{T}}

(2)

where N is the number of samples per day, and

T_{i}

is the temperature recorded at the ith sampling. This method of data processing helped identify key environmental factors that could influence the development of pear tree diseases. The collected image and sensor data, after initial screening and preprocessing, were integrated into a multimodal dataset for training and validating the proposed disease detection model. During integration, care was taken to ensure that image data and corresponding environmental data were matched by timestamps to enhance the dataset’s relevance and usability. Additionally, to improve the efficiency and effectiveness of model training, the data were standardized to ensure the comparability of different sources and types of data within the model.

2.2. Image Data Augmentation

In this study, three data augmentation techniques—Cutout [25], CutMix [26], and Mixup [27]—were employed to enhance the model’s generalization ability and effectively prevent overfitting. These methods introduce various forms of image modifications during the training phase, thereby increasing the diversity of the training data and improving the model’s ability to recognize unseen samples, as shown in Figure 2. These data augmentation techniques were applied only during the training phase, aiming to test the model’s practical performance by increasing the complexity of the training data without altering the original test data. This ensures a high recognition accuracy and reliability in real-world scenarios.

2.3. Sensor Data Preprocessing

In modern agricultural disease detection, sensor data play a crucial role. These data encompass environmental parameters such as temperature, humidity, soil moisture, and illumination, as well as other vital information monitoring plant growth conditions. To enhance the accuracy and reliability of the plant disease identification system, these sensor data must undergo appropriate preprocessing. The preprocessing steps include outlier removal, missing value imputation, and data normalization, each based on clear mathematical principles. Outliers in the collected sensor data may arise due to equipment malfunctions, operational errors, or external environmental disturbances. If not addressed, these outliers can severely affect the training effectiveness and predictive accuracy of subsequent models. Typically, statistical methods are employed to detect and remove these anomalous data points. A commonly used technique involves filtering based on standard deviation, assuming that the data follow a normal distribution. The mean (

μ

) and standard deviation (

σ

) of the data are calculated, and points that lie several standard deviations from the mean are removed:

Data point x is an outlier if | x - μ | > k \times σ

(3)

Here, k is a chosen coefficient, often set to 2 or 3. This method is straightforward and effective in most scenarios but relies on the assumption that data are normally distributed. Missing data is a common issue in agricultural environmental monitoring, potentially caused by sensor failures, data transmission interruptions, or other technical issues. Popular methods for addressing missing data include interpolation and filling with historical averages. For time-series data, linear interpolation is an effective approach, assuming a uniform change between adjacent data points. The formula for linear interpolation is as follows:

x_{interp} = x_{before} + \frac{(x_{after} - x_{before})}{n} (i - 1)

(4)

Here,

x_{interp}

is the value to be imputed,

x_{before}

and

x_{after}

are known data points before and after the missing value, n is the total number of missing values plus one, and i ranges from 1 to

n - 1

. Data normalization is another critical step in preprocessing, especially when dealing with data from various sensors and different scales. Normalization ensures that the model is not biased toward features with larger magnitudes. Common normalization techniques include min-max scaling and Z-score standardization. Min-max scaling adjusts the data to fall between 0 and 1:

x_{norm} = \frac{x - \min (x)}{\max (x) - \min (x)}

(5)

Z-score standardization, on the other hand, transforms the data so that the mean is 0 and the standard deviation is 1:

x_{standardized} = \frac{x - μ}{σ}

(6)

Here,

μ

and

σ

are the mean and standard deviation of the original data, respectively. Through following these preprocessing steps, which are guided by mathematical principles, the quality of the sensor data can be significantly enhanced, providing accurate and consistent inputs for subsequent disease detection models. This not only aids in training effectiveness but also enhances the stability and reliability of the model in practical applications. Accurate environmental data provide essential auxiliary information that aids models in better understanding and predicting the occurrence and development of plant diseases.

2.4. Proposed Method

2.4.1. Overall

The method proposed in this study for detecting pear tree diseases using a multimodal large model and attention aggregation module incorporates several key components: a design for a large model handling multimodal data, a multimodal weighting module, an attention aggregation module, and a dynamic regression loss function, as shown in Figure 3.

Each module was designed to optimize the efficiency and accuracy of extracting information from processed multimodal data, thereby enhancing the disease detection performance. Initially, the processed image and sensor data are input into a DL framework consisting of four main parts:

Multimodal Data Input Layer: The input layer of the model receives processed image data and sensor data. The image data consist of high-resolution agricultural scene images, standardized to meet network input requirements; the sensor data include temperature, humidity, and other information collected from various environmental sensors, also normalized.
Large Multimodal Model: This module is the core of the system, containing two parallel subnetworks: a Transformer for processing image data and a multilayer perceptron (MLP) for handling sensor data. The CNN utilizes its multilayer convolutional structure to effectively extract spatial features from images, while the MLP focuses on extracting environmental features from sensor data. The outputs of these two data streams are later merged for further analysis and processing.
Multimodal Weighting Module: At the output end of the multimodal large model, the feature vectors from the CNN and MLP outputs are fused. A weighting module is introduced here to learn the importance of different modal data, dynamically adjusting the influence of each modality in the final decision.
Attention Aggregation Module: After the multimodal features are fused, an attention mechanism is introduced to further refine and enhance the model’s focus on disease-related features. Self-attention layers from the Transformer calculate dependencies among different features, enhancing the model’s ability to capture key information.
Dynamic Regression Loss Function: To make the model training more aligned with the actual needs of disease detection, a loss function that combines classification accuracy and regression precision is used. This loss function not only considers the error between the predictions and actual labels but also introduces a dynamic factor to adjust the weight of the error, accommodating different stages and types of disease characteristics.

2.4.2. Design of Large Models for Multimodal Data

In this study, a large multimodal model was specifically designed for the task of pear tree disease detection, aiming to optimize the fusion processing of image and sensor data. This model structure includes an image processing branch based on the Transformer architecture and an MLP branch for processing sensor data, with detailed descriptions of each branch’s design and mathematical model provided below. The image data branch employs the Transformer architecture for its exceptional capability to capture global dependencies in images, as shown in Figure 4.

The core of the Transformer is its self-attention mechanism, which allows the model to consider all positions within the entire image domain when processing images, thereby better understanding the contextual information within the images. Images are initially divided into multiple fixed-size image patches (16 × 16 pixels), each of which is converted into a fixed-dimensional embedding vector through a linear projection layer. These embedding vectors are then fed into the Transformer model. Since the Transformer inherently lacks the ability to process sequence positional information, positional encodings are added to each image patch embedding. Positional encodings are added to the image patch embeddings to provide positional information:

z_{0} = x_{patch} + E_{pos},

(7)

where

x_{patch}

is the embedding vector of the image patch, and

E_{pos}

is the positional encoding. The encoder consists of multiple identical layers, each containing a Multi-Head Attention (MHA) mechanism and a simple feed-forward network (FFN), with each sublayer followed by Layer Normalization and residual connections:

Attention (Q, K, V) = softmax (\frac{{QK}^{T}}{\sqrt{d_{k}}}) V

(8)

MHA (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}

(9)

{head}_{i} = Attention ({QW}_{i}^{Q}, {KW}_{i}^{K}, {VW}_{i}^{V})

(10)

where

W_{i}^{Q}, W_{i}^{K}, W_{i}^{V},

and

W^{O}

are learnable parameter matrices,

d_{k}

is the dimension of the key vectors, and h is the number of heads. The sensor data branch employs a three-layer MLP for processing data collected from environmental sensors, such as temperature and humidity. The input layer was designed according to the dimensions of the sensor data, with each sensor’s data first preprocessed before the input. The fully connected layer includes two hidden layers and an output layer, each followed by batch normalization and ReLU activation functions, to learn nonlinear representations of environmental data. After processing the individual modal data, the outputs of the Transformer subnetwork and the MLP subnetwork are sent to a fusion layer, which combines the two types of feature information:

f_{f u s i o n} = Concat (f_{i m g}, f_{s e n})

(11)

f_{c o m b i n e d} = W_{f u s i o n} f_{f u s i o n} + b_{f u s i o n}

(12)

Here, Concat denotes the concatenation operation, and

W_{f u s i o n}

and

b_{f u s i o n}

are the weight and bias parameters of the fusion layer. The advantage of employing the Transformer architecture lies in its outstanding ability to capture global dependencies in images, which is crucial for disease detection as disease features may appear at any location in the image. Furthermore, multimodal weighting and fusion strategies ensure that the model can fully utilize complementary information from different data sources, enhancing the accuracy and robustness of disease recognition. This design enables the model not only to precisely identify the presence of diseases but also to maintain good performance under various environmental conditions and different stages of disease.

2.4.3. Multimodal Weighting Module

In this study, a multimodal weighting module was designed to fully exploit and utilize information from different data sources, particularly image and sensor data. The core purpose of this module is to dynamically adjust the contributions of different data sources in model decision making, thereby enhancing the accuracy and efficiency of pear tree disease detection. The design of the multimodal weighting module is based on a small neural network, structured to evaluate and fuse features from two primary data sources—images and sensor data. This module accepts feature vectors from the aforementioned Transformer subnetwork and MLP subnetwork as inputs and outputs a weight parameter that adjusts the contribution of these features in the final output, as shown in Figure 5.

Specifically, the module consists of the following parts:

Input Layer: This layer receives feature vectors from the CNN (now Transformer subnetwork) and MLP subnetwork, assuming that the feature vector dimensions outputted by the Transformer subnetwork are $d_{i m g}$ and those by the MLP subnetwork are $d_{s e n}$ .
Fully Connected Layer: The first fully connected layer takes an input dimension of $d_{i m g} + d_{s e n}$ , which is the sum of the dimensions of the output vectors from both networks. The output dimension of this layer is set to a higher dimension, such as 256, to ensure sufficient capacity to capture features extracted from both data sources.
Activation Layer: This layer employs the ReLU activation function to enhance the non-linear processing capability, helping the model learn more complex inter-feature dependencies.
Weight Output Layer: The final layer is a fully connected layer that outputs a single weight value $α$ , activated by a sigmoid function to ensure that the output range is between 0 and 1, representing the weight distribution from the image data source to the sensor data source:

$α = σ (W_{o u t} \cdot h + b_{o u t})$

(13)

Here, $σ$ denotes the sigmoid activation function, $W_{o u t}$ and $b_{o u t}$ are the parameters of the weight output layer, and $h$ is the output of the previous fully connected layer.

Mathematically, the weighting module adjusts the

α

value dynamically, balancing the importance of image and sensor features in the final decision-making process. Specifically, the fused feature vector can be expressed as

f_{f u s i o n} = α f_{i m g} + (1 - α) f_{s e n}

(14)

This design enables the model to adaptively adjust the dependence on different data sources according to the specific application context. For example, in situations where image features are particularly clear, the model can enhance the influence of image data by increasing the value of

α

; conversely, when environmental data are more decisive, the model can decrease the value of

α

, allowing sensor data features to play a greater role. The design advantage of this multimodal weighting module lies in its flexibility and adaptability. In practical applications, the manifestation of pear tree diseases may be influenced by various factors, such as environmental conditions and disease types, where features presented in different data sources may vary. By dynamically adjusting the weights of the two data sources, this module allows the model to automatically optimize its information processing strategy based on the specific circumstances of the current task, thereby enhancing the accuracy and robustness of disease detection. Furthermore, the multimodal weighting module aids the model in addressing difficulties that traditional models may encounter when processing multimodal data, such as information redundancy and noise interference between modalities. Precisely controlling the influence of each modality within the model ensures that the model maximizes the use of relevant information and prevents irrelevant information from interfering with the final decision. This strategy is particularly suitable for handling complex and variable agricultural scenarios and is key to enhancing the performance of pear tree disease detection.

2.4.4. Attention Aggregation Module

In this study, to enhance the accuracy and efficiency of pear tree disease detection, an attention aggregation module was specifically introduced, which is based on the Transformer’s self-attention mechanism but modified and optimized to accommodate the characteristics of multimodal data, as shown in Figure 6.

The standard self-attention mechanism of the Transformer, designed for processing sequence data, allows each element to interact only with other elements within the sequence to calculate attention scores. However, when dealing with multimodal data, such as image and sensor data, the substantial differences in the nature and dimensions of these data sources need to be considered. Traditional self-attention may not suffice to directly handle such modal discrepancies; hence, the attention aggregation module was specially designed in the following aspects:

Preprocessing before modal fusion: Unlike directly applying self-attention, the attention aggregation module first preprocesses the data from different modalities through independent linear transformations, adjusting dimensions and standardizing features to better suit subsequent attention calculations.
Enhanced interaction mechanism: This module introduces a special interaction mechanism, enabling features from image and sensor data not only to compute attention within their own modality but also to interact between modalities, thus fully exploiting and utilizing cross-modal complementary information.

The core of the attention aggregation module is to enable the model to effectively focus on key features relevant to disease detection through an enhanced self-attention mechanism. The specific mathematical expressions are as follows:

Linear Transformations: Firstly, independent linear transformations are applied to the image and sensor features:

$Q_{i m g} = W_{Q_{i m g}} f_{i m g}, Q_{s e n} = W_{Q_{s e n}} f_{s e n}$

(15)

$K_{i m g} = W_{K_{i m g}} f_{i m g}, K_{s e n} = W_{K_{s e n}} f_{s e n}$

(16)

$V_{i m g} = W_{V_{i m g}} f_{i m g}, V_{s e n} = W_{V_{s e n}} f_{s e n}$

(17)

where $W_{Q_{i m g}}, W_{K_{i m g}}, W_{V_{i m g}}, W_{Q_{s e n}}, W_{K_{s e n}},$ and $W_{V_{s e n}}$ are learnable weight matrices.
Enhanced Self-Attention Computation**: The query (Q), key (K), and value (V) matrices of both image and sensor data are combined for the attention computation:

$Attention = softmax (\frac{(Q_{i m g} + Q_{s e n}) {(K_{i m g} + K_{s e n})}^{T}}{\sqrt{d_{k}}}) (V_{i m g} + V_{s e n})$

(18)

Through these designs, the attention aggregation module not only enhances the analysis of inter-feature correlations within the model but also promotes effective information exchange between image and sensor data, which is particularly crucial for complex disease detection tasks. For instance, while image data may visually display the appearance characteristics of a disease, sensor data provide the environmental context of disease development; the combination of these can significantly enhance the accuracy and timeliness of disease recognition. Moreover, the introduction of this module enables the model to more precisely locate key disease features and make decisions based on these findings, reducing information redundancy and interference caused by modal differences. This approach shows distinct advantages in processing agricultural image data with rich backgrounds and complex disease representations, effectively enhancing the overall performance of pear tree disease detection.

2.4.5. Dynamic Regression Loss Function

In the application of DL to pear tree disease detection, the selection of the loss function is crucial for the training efficacy and ultimate performance of the model. Traditional Transformer models typically employ cross-entropy loss for classification tasks, which is suitable for singular classification problems. However, in addressing the real-world challenge of pear tree disease detection, a complex problem combining classification and regression is encountered. Against this backdrop, a dynamic regression loss function has been proposed, which not only integrates classification accuracy and regression precision but also introduces dynamic factors to adjust loss weights, adapting to different stages and types of disease characteristics. The dynamic regression loss function is an extension of traditional loss functions, incorporating classification loss

L_{cls}

and regression loss

L_{reg}

and weighting each sample’s loss with a weight factor

w_{n}

. The mathematical expression is as follows:

L = \sum_{n = 1}^{N} w_{n} (λ_{cls} L_{cls} (y_{n}, {\hat{y}}_{n}) + λ_{reg} L_{reg} (y_{n}, {\hat{y}}_{n}))

(19)

where N is the number of training samples.

w_{n}

is dynamically adjusted based on the severity and type of the disease.

λ_{cls}

and

λ_{reg}

are hyperparameters used to balance classification and regression losses.

L_{cls}

represents the classification loss, typically employing a cross-entropy loss function.

L_{reg}

represents the regression loss, generally utilizing the mean squared error (MSE).

The classification loss

L_{cls}

can be defined as

L_{cls} (y, \hat{y}) = - \sum_{c = 1}^{C} y_{o, c} log ({\hat{y}}_{o, c})

(20)

where

y_{o, c}

is the true value of label y in category c, and

{\hat{y}}_{o, c}

is the predicted value. The regression loss

L_{reg}

, aimed at predicting aspects such as the size and location of the disease, may use the root mean squared error:

L_{reg} (y, \hat{y}) = \sqrt{\sum_{i = 1}^{d} {(y_{i} - {\hat{y}}_{i})}^{2}}

(21)

where d is the dimension of the regression task, and

y_{i}

and

{\hat{y}}_{i}

are the actual and predicted values, respectively. The dynamic regression loss function, by introducing dynamic weights

w_{n}

, enables the model to adjust each sample’s loss contribution according to the different stages and types of disease. The mathematical rationale behind this design is that different types and stages of diseases impact the final detection accuracy variably; some diseases may be more challenging to detect or diagnose, thus requiring more focused attention from the model. Furthermore, by appropriately setting

λ_{cls}

and

λ_{reg}

, the model is optimized during training to simultaneously enhance the tasks of classification and regression, balancing the performance between identifying the presence of a disease (classification) and the specific characteristics of the disease (such as size, shape, and other regression tasks). This joint optimization strategy significantly enhances the model’s generalization ability and robustness, especially in complex agricultural environments where disease manifestations may vary due to multiple factors. Employing a dynamic regression loss function offers clear advantages for pear tree disease detection. Firstly, it allows the model to dynamically adjust the focus of learning based on actual disease characteristics, which is particularly useful in datasets where disease manifestations are diverse and complex. Secondly, this loss function supports the model in more effectively utilizing multimodal data (image and sensor data), improving the interpretation and utilization of the rich information contained within these data through the joint training of classification and regression, thus enhancing detection accuracy and timeliness.

2.5. Experimental Setup

2.5.1. Software and Hardware Platform

In this study, a comprehensive software and hardware platform was utilized to support large complex multimodal models and the attention aggregation module for pear tree disease detection tasks. High-performance computing resources and flexible, powerful software support are necessary for processing and analyzing extensive image and sensor data.

Python was chosen as the primary programming language due to its wide application in scientific computing and machine learning fields. The Python ecosystem includes numerous powerful libraries essential for building and testing complex DL models. Specifically, PyTorch was selected as the main DL framework. Renowned for its intuitive API and flexible computational graph characteristics, PyTorch is widely used in both research and industry sectors. It supports automatic differentiation and dynamic neural networks and efficiently performs tensor operations on GPUs, facilitating efficient model training and inference. Additional tools such as NumPy for numerical calculations, Pandas for data handling and analysis, and Matplotlib for various plotting and visualization tasks, formed the foundation of our data processing and model development. On the hardware front, high-performance computing servers equipped with NVIDIA Tesla V100 GPUs were selected. The Tesla V100 GPU, designed for DL and high-performance computing, offers an exceptional floating-point performance and massive parallel processing capabilities, ideal for training and testing models. The powerful computational capabilities of this GPU significantly reduce the model training time and enhance experimental iteration speed. The server is also equipped with substantial SSD storage and ample RAM, ensuring rapid data read/write and processing, thus maintaining the smooth operation of training and testing processes. This combination of software and hardware enables the construction of an efficient and stable experimental environment. It allows for the rapid iteration and testing of new model architectures and parameters and handles large-scale datasets effectively, advancing the research and application of pear tree disease detection technology.

2.5.2. Training Strategy

The core of the training strategy involves the appropriate setting and adjustment of various hyperparameters, as well as the adoption of scientific evaluation methods to verify the model’s generalization ability and robustness. The selection of hyperparameters critically influences the training outcome and the ultimate performance of the model. In this research, key hyperparameters include the learning rate, batch size, and number of training iterations. The learning rate, a crucial parameter controlling the speed of model weight updates, was initially set to 0.001, a common starting point suitable for most DL tasks. To address potential issues of overshooting or premature convergence during training, a learning rate decay strategy was implemented. Specifically, if there is no significant improvement in model validation performance for several consecutive epochs, the learning rate is automatically halved, aiding the model in fine-tuning as it approaches an optimal solution.

The dataset was divided into training, validation, and testing sets at a ratio of 7:2:1, respectively. This division ensures that a significant portion of the data is utilized for model training while still providing ample data for validation and independent testing to gauge model generalization. The batch size, which is the number of samples processed before the model is updated, directly affects the memory consumption and speed of model training. The choice of batch size in this study depended on the GPU memory capacity of our servers. Typically, a batch size of 32 or 64 is selected, balancing efficiency and performance. Larger batch sizes enhance GPU utilization and training speed but also require more memory and may impact the quality of model convergence. The number of training iterations, which is the count of complete passes through the entire training dataset, is a crucial parameter determining the sufficiency of model training. In this study, the number of iterations was based on the model’s performance on a validation set. A large upper limit, such as 100 epochs, was set, but training was stopped early if there was no improvement in the validation performance after 20 consecutive epochs.

To comprehensively assess themodel performance and generalization ability, the K-fold cross-validation method was employed. In K-fold cross-validation, the entire dataset is evenly divided into K subsets. In each validation round, K−1 subsets are used as training data, and the remaining subset serves as the test data. This method efficiently utilizes limited data resources and reduces the likelihood of model evaluation results being influenced by random data splits. In this study, 5-fold cross-validation was chosen, meaning that each subset was used as the test set once, while the other four subsets were used for training. This approach ensures more robust and reliable model performance evaluation results.

2.6. Performance Metrics

In the application of DL for plant disease detection, accurately assessing the performance of the model is crucial. Four core performance metrics were employed in this study to comprehensively evaluate the model’s recognition capabilities: precision, recall, accuracy, and F1-Score. These metrics reflect the model’s performance in pear tree disease detection tasks from various perspectives, aiding in understanding the model’s utility in practical applications. Precision measures the accuracy with which the model identifies disease samples, reflecting the proportion of correctly identified samples among those classified as diseased. It particularly focuses on the model’s false positive rate, i.e., instances where non-diseased samples are mistakenly marked as diseased. The formula for calculating precision is

Precision = \frac{T P}{T P + F P}

(22)

where

T P

(True Positives) denotes the number of samples correctly identified as diseased, and

F P

(False Positives) represents the number of samples incorrectly identified as diseased. A high precision indicates fewer false positives, suggesting that the model is cautious when labeling samples as diseased, with a low rate of false alarms. Recall measures the proportion of identified diseased samples out of all actual diseased samples, primarily focusing on the model’s false negative rate, i.e., actual diseased samples that are missed. The importance of recall directly impacts the timeliness and effectiveness of disease control. The formula for recall is

Recall = \frac{T P}{T P + F N}

(23)

Here,

F N

(False Negatives) represents the number of diseased samples incorrectly identified as non-diseased. A high recall indicates that the model captures more diseased samples, even at the cost of including some false positives. Accuracy is the most intuitive performance metric, representing the proportion of all correctly identified samples (both diseased and non-diseased) out of all samples. This metric is effective when the dataset’s class distribution is balanced but may provide an overly optimistic estimate in imbalanced situations. The formula for accuracy is

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(24)

where

T N

(True Negatives) denotes the number of samples correctly identified as non-diseased. Accuracy offers an overall success rate, indicating the model’s performance across all classification tasks. The F1-Score is the harmonic mean of precision and recall, a performance metric that considers both precision and recall equally. It is particularly useful in scenarios where both precision and recall are of equal importance. The formula for the F1-Score is

F 1 - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(25)

The F1-Score is crucial for imbalanced datasets as it balances the impact of precision and recall, providing a more comprehensive assessment of model performance. These detailed performance metrics enable a comprehensive evaluation of the model in pear tree disease detection tasks. Precision and recall help us understand the accuracy and miss rate of the model in identifying diseases, while accuracy and the F1-Score provide an overview of the overall model performance. Together, these metrics ensure that the model can be evaluated and optimized from multiple dimensions to adapt to the practical application needs in the complex and variable agricultural environment.

2.7. Baseline Models

In this study, to comprehensively evaluate the proposed large multimodal model and attention aggregation module for pear tree disease detection, several contemporary object detection models were selected as baseline models for comparison. These baseline models included YOLOv5 (You Only Look Once) [28], YOLOv9 [29], RetinaNet [30], EfficientDet [31], and a Detection Transformer (DETR) [32], each with its unique architecture and advantages suitable for various image detection tasks.

The YOLO series of models are renowned for their speed and accuracy as fast real-time object detection systems. YOLOv5 and YOLOv9, newer versions in the series, feature improved architectures and enhanced efficiency. YOLO models treat object detection as a single regression problem, directly mapping image pixels to bounding box coordinates and class probabilities in an end-to-end manner. This approach significantly accelerates processing. The output is represented by a tensor of dimensions

S \times S \times (B \times 5 + C)

, where

S \times S

represents the grid into which the image is divided, B is the number of bounding boxes predicted per grid cell, 5 denotes the coordinates and confidence score for each box, and C stands for class probabilities. YOLOv5 and YOLOv9 incorporate multi-scale predictions and depthwise separable convolutions to enhance accuracy and speed. Their structures utilize the Cross-Stage Partial Network (CSPNet), which improves the model’s learning capability and efficiency. The fundamental loss function includes coordinate loss, confidence loss, and classification loss:

L_{total} = λ_{coord} L_{coord} + λ_{obj} L_{obj} + λ_{cls} L_{cls}

(26)

where

λ_{coord}

,

λ_{obj}

, and

λ_{cls}

are the weights for different loss components. RetinaNet is an efficient single-stage detector that uses Focal loss to address the class imbalance problem in object detection. Focal loss is an enhanced cross-entropy loss that reduces the relative loss for well-classified examples, focusing on difficult, easily misclassified samples:

L_{FL} = - α {(1 - p_{t})}^{γ} log (p_{t})

(27)

where

p_{t}

is the model’s predicted probability for the correct class, with

α

and

γ

as tuning parameters that control the rate and extent of the decrease in weight, effectively enhancing the detection performance for minority classes. EfficientDet is a structurally efficient model that achieves a balance between speed and accuracy through the uniform scaling of network width, depth, and resolution, along with a weighted bidirectional feature pyramid network (BiFPN). BiFPN enhances feature interactions and optimizes feature fusion using an attention mechanism:

F_{i} = Conv (\sum_{j} w_{j} \cdot F_{j})

(28)

where

F_{i}

is the feature at level i of the pyramid,

F_{j}

denotes the input features, and

w_{j}

denotes learned attention weights, indicating the importance of different features during fusion. DETR applies the Transformer architecture to object detection tasks, simplifying traditional detection pipelines through global reasoning and end-to-end training. DETR processes image features using the Transformer’s encoder–decoder architecture, combined with a simple feed-forward network that directly predicts object classes and bounding boxes:

Output = FFN (Decoder (Encoder (F)))

(29)

where F represents the input image features, and FFN is the feed-forward network used for prediction. DETR eliminates the need for complex post-processing steps like Non-Maximum Suppression (NMS), significantly simplifying the detection workflow. Through comparisons with these advanced baseline models, the performance and practicality of the proposed method for pear tree disease detection in real-world applications can be more comprehensively assessed. The selection of these baseline models spans from traditional convolutional networks to the latest Transformer-based approaches, ensuring a thorough and fair evaluation. This comparison clearly identifies the advantages and potential areas for the improvement of the proposed method.

3. Results and Application

3.1. Disease Detection Experimental Results

The primary aims of the experiment documented in this study were to validate and compare the performance differences between the multimodal large model with an attention aggregation module proposed in this research and other existing advanced models such as RetinaNet, EfficientDet, DETR, and the YOLO series in pear tree disease detection tasks. Through comparing different models across four key performance indicators—precision, recall, accuracy, and F1-Score—the intention was to demonstrate the advantages of the proposed method in handling complex and diverse plant disease image data. The experimental results are displayed in Table 2 and Figure 7 and Figure 8.

Looking specifically at the data in Table 2, it is evident that the performance indicators gradually improve from RetinaNet, EfficientDet, DETR, YOLOv5, and YOLOv9, with the method proposed in this research achieving the best results across all indicators. RetinaNet showed a precision of 0.82, recall of 0.80, accuracy of 0.81, and an F1-Score of 0.81. EfficientDet slightly outperforms RetinaNet on all indicators, suggesting that its more effective structure and parameter adjustments positively impact disease detection. DETR, a model utilizing a Transformer for detection, further improves in terms of the precision, recall, and F1-Score, illustrating the Transformer’s advantage in integrating global information. Both YOLOv5 and YOLOv9, known for their balance of speed and performance, through optimized network architectures and algorithmic improvements, exhibit higher precision and recall. In particular, YOLOv9 approaches a score of 0.90 on all indicators, highlighting its potential for fast and accurate plant disease detection. The proposed method achieves the highest scores across all performance metrics (precision of 0.93, recall of 0.90, accuracy of 0.92, F1-Score of 0.91), demonstrating the effectiveness of multimodal data processing and attention aggregation techniques.

From a theoretical analysis perspective, the design and mathematical characteristics of each model directly impact the experimental results. RetinaNet addresses the issue of class imbalance with the focal loss function, making it suitable for handling common sample imbalance problems in disease detection. EfficientDet enhances feature integration with composite scaling and BiFPN, improving the detection accuracy for small targets. DETR simplifies the detection process with a global perspective and end-to-end training but may still face challenges in handling highly complex backgrounds of plant diseases. YOLOv5 and YOLOv9, with their more efficient architectural design and improved anchoring strategies, effectively enhance speed and accuracy, which is especially suitable for scenarios requiring real-time processing. The superior performance of the proposed method across all metrics can be attributed to several design advantages: firstly, the large multimodal model merges information from both image and sensor data, utilizing the complementary strengths of different network structures to enhance the model’s overall ability to recognize disease features. Secondly, the attention aggregation module, through the self-attention mechanism, effectively focuses on key disease features, improving recognition accuracy and efficiency. Lastly, the introduction of a dynamic regression loss function allows the model to adaptively adjust the loss weights during training, offering better recognition performance across various stages and types of diseases. These combined design elements not only enhance model performance but also strengthen the model’s applicability and robustness in complex agricultural settings.

3.2. Detailed Analysis of Pear Tree Disease Detection Performance

This section aims to thoroughly analyze the performance of the proposed pear tree disease detection model across different types of diseases, to validate the model’s precision, recall, accuracy, and F1-score. By assessing the specific detection outcomes for various diseases, this study intended to demonstrate the model’s wide applicability and effectiveness in practical applications. Additionally, the design of this experiment aimed to showcase the effectiveness of technologies such as multimodal fusion and dynamic regression loss functions in enhancing the accuracy of disease detection.

The experimental results presented in Table 3 indicate that the model performs well across different types of pear tree diseases. Specifically, the detection precision for Black spot disease is 0.94, recall is 0.91, accuracy is 0.93, and the F1-Score is 0.92; for Ring pattern disease, the precision is 0.92, recall is 0.89, accuracy is 0.91, and the F1-Score is 0.90; for Black star disease, the precision is 0.93, recall is 0.90, accuracy is 0.92, and the F1-score is 0.91. These data illustrate the model’s consistency and reliability in detecting various diseases, with accuracy rates generally exceeding 0.90, reflecting the technological advancement and practical utility.

From a theoretical and mathematical perspective, the model’s exceptional performance in different disease detections is largely due to its strategy of integrating image data with environmental sensor data. The model utilizes a deep learning framework, employing convolutional neural networks (CNNs) to extract detailed image features, while multilayer perceptrons (MLP) analyze environmental data collected from sensors, such as temperature and humidity, which directly affect disease development. In this way, the model not only identifies disease features from images but also adjusts its detection strategies based on environmental factors, thereby achieving a higher detection accuracy in complex natural environments. Additionally, the introduction of dynamic regression loss functions further enhances the model’s adaptability to disease detection. These loss functions dynamically adjust error weights according to different stages of disease development, optimizing the model’s learning trajectory during training and focusing more on difficult-to-detect or error-prone disease types. This not only improves the model’s generalization ability but also ensures rapid and accurate responses to various diseases in practical applications. In summary, a detailed performance analysis of different diseases demonstrates that the model proposed in this paper is innovative theoretically and exhibits an outstanding performance in practical applications. In the future, we plan to further expand the model’s application scope to include more types of crop disease detection, advancing the modernization and intelligence of agricultural disease management.

3.3. Model Application

In modern agricultural production, disease detection plays a crucial role in enhancing crop yield and quality. This paper introduces a novel pear tree disease detection system that integrates both a server-side and an iOS mobile component. The system leverages advanced data processing and machine learning technologies to achieve rapid and accurate disease detection. This section will detail the development background, technical implementation, and practical performance of these two systems, as shown in Figure 9.

The server-side system is deployed in Bayannur, Inner Mongolia, a renowned pear-growing region in China characterized by its typical climatic features and range of diseases. The front end of the server uses Vue.js, while the back end combines Node.js with a MongoDB database to create a robust system capable of managing large volumes of data. The system can collect and analyze data in real time from various environmental sensors, such as temperature, humidity, light, and carbon dioxide concentration, which are crucial for monitoring and predicting the development of diseases. Additionally, the server is connected to multiple cameras that continuously monitoring the condition of pear orchards and perform disease detection through image recognition technology. The detection results are stored in the database and are reviewed and adjusted by professional technicians every 72 h to continuously optimize and train the model. The iOS mobile end is primarily aimed at farm workers and fruit growers, who can use their smartphones to photograph pear trees. The system analyzes the images immediately and provides feedback on the disease detection results. The iOS application was developed in the Swift programming language, utilizing Apple’s Core ML framework to integrate the machine learning model, ensuring efficient and stable operation. The user interface is user friendly and easy to operate, allowing users to easily access information on the health status of pear trees and take appropriate preventive measures accordingly.

This system is unique in that it combines multimodal data processing with real-time monitoring technologies, providing a more comprehensive and accurate diagnosis of diseases. Moreover, the system includes a data feedback and model iteration mechanism. With the continual input of field data, the model can adapt to different environmental changes, enhancing the accuracy and reliability of the detection. In practice, this system has been successfully deployed in multiple pear orchards, significantly improving the efficiency of disease management and overall orchard yield. In the future, we plan to further expand the functionality and application scope of the system by introducing more crop types and disease conditions, enhancing the universality and flexibility of the system. Additionally, we aim to explore the use of artificial intelligence and big data technologies to further enhance the intelligence level of the system. For example, by employing DL and image analysis technologies, we can achieve earlier disease warnings and more precise analyses of disease causes, thus contributing to the development of modern precision agriculture.

4. Discussion and Ablation Studies

4.1. Ablation Study on Multimodal Weighting Module

The purpose of this experiment was to validate the efficacy of the multimodal weighting module in pear tree disease detection tasks. Through an ablation study comparing setups with and without the multimodal weighting module, the specific contributions of the module to enhancing the performance of disease detection were explored. The key aspect of this experimental design was to demonstrate how the multimodal weighting module effectively integrates data from different sources (image and sensor data) and enhances the model’s recognition capabilities through weight adjustment. The experimental results shown in Table 4 indicate significant improvements in precision, recall, accuracy, and F1-Score upon the incorporation of the multimodal weighting module, thus confirming its practical value in pear tree disease detection.

Specifically, the results reveal that without the multimodal weighting module, the model achieved a precision of 0.88, recall of 0.85, accuracy of 0.87, and an F1-Score of 0.86. Upon integrating the multimodal weighting module, precision increased to 0.93, recall to 0.90, accuracy to 0.92, and F1-Score to 0.91. This marked improvement in performance illustrates the superior capabilities of the multimodal weighting module in handling complex data sources. From a theoretical analysis perspective, the design of the multimodal weighting module is primarily based on a weighting mechanism that allows the model to automatically adjust the weights of input features based on the importance of different data sources. In the mathematical construction of the model, this module dynamically adjusts the weight ratios of merged features by learning the relevance and complementarity of different modal features. Specifically, when image data and sensor data are processed through their respective networks (such as the Transformer and MLP), their feature representations need to be combined for the final disease judgment. Without the weighting module, this fusion might be merely a simple stacking or concatenation, without considering the actual contribution differences of each modality in specific tasks, which could lead to the redundancy of information or neglect of key information. The introduction of the multimodal weighting module allows the system to automatically optimize the weight distribution of each modality in the final decision-making process. For example, under certain environmental conditions, sensor data (such as humidity and temperature) might be more critical for predicting certain types of diseases, while the weight of image data might be relatively lower when visual distinctions are challenging to discern. This dynamic weighting not only enhances the model’s adaptability to data but also improves its generalization ability and accuracy under various conditions. Additionally, the introduction of the multimodal weighting module also improves the model’s ability to handle data imbalances. In practical applications, there may be numerical imbalances between different types of disease images and corresponding environmental data. Traditional methods might be biased toward the majority class under such circumstances and neglect minor but significant categories. Through strategic adjustments with the weighting module, the model can pay more attention to features that contribute more to the final disease recognition, thereby enhancing the accuracy and robustness of detection.

4.2. Ablation Experiment of Dynamic Regression Loss Function

The purpose of this experiment was to explore the impact of different loss functions on the performance of pear tree disease detection models, in order to validate the effectiveness and advantages of the dynamic regression loss function in practical applications. Loss functions, as critical components of optimization algorithms, directly influence the convergence speed and ultimate performances of the models during training. Through comparing the performance of traditional cross-entropy loss, focal loss, smooth L1 loss, and the newly proposed dynamic regression loss under the same dataset and network architecture, a more accurate assessment of the ability of various loss functions to handle samples of differing difficulties and their impact on model generalization could be achieved.

The experimental results shown in Table 5 indicate that the model using cross-entropy loss achieved a precision of 0.88, a recall of 0.85, an accuracy of 0.87, and an F1-Score of 0.86. The model using focal loss showed an improved precision of 0.90, recall of 0.87, accuracy of 0.89, and F1-Score of 0.88. The model using smooth L1 loss further improved to a precision of 0.92, recall of 0.89, accuracy of 0.91, and F1-Score of 0.90. The model employing dynamic regression loss demonstrated the highest performance with a precision of 0.93, recall of 0.90, accuracy of 0.92, and F1-Score of 0.91. These data clearly show that the dynamic regression loss function performed optimally on all assessment metrics, demonstrating its effectiveness and superiority in handling pear tree disease detection tasks.

The different results can primarily be attributed to the design principles and mathematical characteristics of each loss function. Cross-entropy loss, the most commonly used loss function for classification problems, optimizes the model by minimizing the discrepancy between the predicted probability distribution and the true distribution. However, it does not take into account the impact of class imbalance or the varying difficulty levels of samples. Focal loss adjusts the cross-entropy loss by increasing the weights of the samples that are difficult to classify correctly, thus addressing the issue of class imbalance. This adjustment makes the model pay more attention to difficult samples, thereby improving recall and precision. However, focal loss may still not adequately handle all types of errors, especially in regression tasks. Smooth L1 loss, which combines the advantages of L1 and L2 losses, is commonly used in regression problems. It is less sensitive to outliers and effectively balances the gradient issues of the loss function, making it suitable for tasks with continuous output values. The dynamic regression loss introduces a dynamic adjustment mechanism during the model training process, adjusting the weight of the loss function based on the model’s performance throughout the training. This loss function not only focuses on the errors between samples but also considers the model’s overall performance during training, optimizing the model’s adaptability and generalization ability through real-time adjustments.

4.3. Model Robustness Validation

4.3.1. Validation of Different Data Sources

This section aims to explore the impact of three different data inputs on the model performance in pear tree disease detection: sensor data only, image data only, and a combination of both sensor and image data. The core design of the experiment aimed to validate the effectiveness of multimodal data processing and to ascertain the performance of each type of data used alone or in combination, thus clarifying the contributions of different data sources to the accuracy of disease detection.

Firstly, the experimental results, shown in Table 6, indicate that when the model relies solely on sensor data for pear tree disease detection, the precision, recall, accuracy, and F1-Score are relatively low, being 0.38, 0.35, 0.37, and 0.37, respectively. This outcome suggests that although sensor data, such as temperature, humidity, and light levels, can provide useful information about environmental conditions, they are ineffective on their own for identifying specific disease states. Sensor data lack sufficient detail to independently differentiate between different types of diseases, as these environmental factors are related to the development of diseases but do not directly reflect specific disease characteristics. When the model relies solely on image data, the performance significantly improves, with the precision, recall, accuracy, and F1-Score reaching 0.85, 0.83, 0.84, and 0.83, respectively. Image data can visually capture the appearance features of diseases, such as changes in leaf color, spots, deformities, etc., which are direct indicators related to diseases. Moreover, the development of image processing technologies enables the extraction of key features from complex backgrounds, greatly enhancing the accuracy and reliability of image-based disease detection methods. Therefore, using only image data already achieves a high recognition effect. However, the best performance is achieved when sensor data are combined with image data, with a precision, recall, accuracy, and F1-Score of 0.90, 0.86, 0.89, and 0.88, respectively. This demonstrates that the fusion of multimodal data can further enhance the performance of disease detection. Although sensor data alone have limited effectiveness, when combined with image data, they provide comprehensive contextual information that enhances the model’s capability to adapt to complex scenarios. For instance, the same disease manifestations might vary under different lighting or humidity conditions; by integrating environmental data, the model can more accurately interpret these variations and avoid misidentification. From a theoretical and mathematical perspective, multimodal data fusion optimizes the feature space coverage and differentiation by integrating information from different sources. By combining the high-dimensional features from image data with the conditional features from sensor data, the model can distinguish between various disease states using more complex decision boundaries, thereby enhancing the overall detection accuracy. Additionally, multimodal learning typically involves techniques for feature-level interaction and integration, such as feature concatenation, feature transformation, and cross-modal feature learning. The application of these techniques ensures the effective integration of features from different data sources.

4.3.2. Other Dataset Validation

This section aims to explore the performance of the proposed pear tree disease detection model across different datasets to assess its generalization ability and adaptability. The purpose of the experimental design was to verify the robustness and effectiveness of applying the same model to data from diverse sources and types. This experiment compares the model’s performances on the Rice Disease dataset, as shown in Figure 10, and PlantDoc dataset, as shown in Figure 11, which represent different crops and disease environments, providing a critical perspective for assessing the model’s generalization performance.

Initially, the experimental results, as shown in Table 7, indicate that on the Rice Disease Dataset, the proposed model achieved a precision of 0.92, a recall of 0.89, an accuracy of 0.90, and an F1-Score of 0.90. These outcomes suggest that the model exhibits a high identification accuracy and reliability on this dataset, effectively recognizing and classifying the disease states of rice. The Rice Disease Dataset may share similar image features and environmental backgrounds with the pear tree disease dataset, such as the color and shape of leaf lesions, enabling the effective transfer learning and application of its recognition capabilities. On the other hand, the model’s performance slightly declined on the PlantDoc dataset, achieving a precision of 0.90, a recall of 0.85, an accuracy of 0.88, and an F1-Score of 0.88. Despite the reduction in performance, the model still demonstrates a considerable generalization capability. The PlantDoc dataset includes a variety of crops and a broader range of disease types, potentially increasing the difficulty of recognition due to the diversity of the dataset, which introduces more complex image features and background noise. The performance decline on this dataset may stem from the optimization during the training process for specific pear tree disease features, which are less prominent or appear differently in the crops and diseases in PlantDoc. From a theoretical and mathematical perspective, differences in model performance can be explained by its adaptability and robustness in handling diverse data distributions. The model’s training is likely primarily based on the characteristics of pear trees; when applied to new datasets with different statistical properties, its performance may be impacted. Moreover, the model’s generalization capability is also influenced by its learning mechanisms—such as feature extraction capabilities, the design of the loss function, and the efficiency of the optimization algorithms. When dealing with the PlantDoc dataset, further adjustments in model parameters or the use of more advanced data augmentation techniques might be necessary to enhance its adaptability to different data sources. In summary, through this ablation experiment, this study not only confirms the model’s adaptability and robustness across different datasets but also reveals potential pathways to further enhance the model’s generalization capabilities. Future work can focus on optimizing the model’s architecture and learning strategies, such as incorporating domain adaptation techniques or exploring cross-domain feature learning methods to address a broader range of datasets and practical application scenarios, thereby enhancing the model’s practicality and efficiency.

4.4. Future Works

Despite certain achievements in pear tree disease detection, especially in the application of multimodal data processing and attention aggregation techniques, there remain some limitations and directions for future research. Firstly, although the model proposed in this study exhibits good adaptability and generalization in various crop disease detection tasks, there is room for improvement in the diversity and complexity of the datasets. The current experiments primarily relied on public datasets from Kaggle, which cover a variety of crops and disease types; however, the manifestation of crop diseases in real-world agricultural production environments may be more complex and variable. For example, the effects of environmental factors such as changes in light and humidity on diseases are not fully simulated, which may limit the effectiveness of the model in practical applications. Secondly, although this study attempted to enhance the adaptability of the model to different disease stages and types through a dynamic regression loss function, there is still room for improvement in the design of the loss function. The current dynamic adjustments mainly rely on performance feedback during the training process, which could lead to overfitting or underfitting in certain extreme cases. Future work could explore more complex designs of loss functions, such as incorporating more prior knowledge and environmental variables, to achieve more refined loss adjustment strategies. Moreover, the computational efficiency and resource consumption of the current model pose significant challenges. Despite the use of efficient structures such as Transformers and multi-layer perceptrons, the model’s demand for computational power remains high when processing large-scale agricultural image data. Particularly in practical agricultural applications, where the model may need to operate on edge devices, computational resource limitations could become a bottleneck in model deployment. Future research could explore model compression and optimization, for instance, through techniques like knowledge distillation and network pruning to reduce the model’s parameter count and computational demands. Finally, the methods for model evaluation and validation also need further enhancement. Currently, the model’s evaluation was primarily based on traditional performance metrics such as precision and recall, which, although reflective of the model’s overall performance, do not delve into quantifying the economic benefits and practical value of the model in specific agricultural scenarios. Future work could explore assessment methods that integrate more closely with actual agricultural production, considering factors such as the timeliness of disease detection and the accuracy of prevention recommendations, to more comprehensively evaluate the model’s application effectiveness. Through addressing these issues and continually optimizing the model design, it is anticipated that more precise, efficient, and practical pear tree disease detection technologies can be developed, providing robust support for modern agricultural production.

5. Conclusions

The central objective of this study was to enhance the accuracy and efficiency of pear tree disease detection, which holds significant practical implications for agricultural production. Through utilizing multimodal data and advanced attention aggregation techniques, a novel disease detection model was developed, demonstrating superior performance across various metrics, particularly in terms of precision, recall, and accuracy. Initially, experimental results revealed that the proposed model excels in pear tree disease detection tasks, achieving an accuracy of 92%, with precision and recall reaching 93% and 90%, respectively. This notable performance enhancement is attributed to the application of multimodal data processing and dynamic regression loss functions, which significantly improve the model’s ability to capture disease characteristics and adapt to complex data environments. In terms of multimodal data processing, the designed model effectively integrates information from images and sensors. Through dynamically adjusted weighting mechanisms, the model adaptively alters the weights of its input features based on the importance of different data sources. This strategy not only enhances the model’s performance in specific disease detection tasks but also improves its generalizability across various crops and environmental conditions. Furthermore, the attention aggregation module introduced in this study, which employs a self-attention mechanism, further strengthens the model’s ability to capture key information. This design enables the model to more accurately identify and localize disease areas, especially in complex or similar backgrounds, thus enhancing the accuracy and reliability of disease detection. This study also included several ablation experiments to verify the specific contributions of each component to the model’s performance. Comparisons between models with and without the multimodal weighting module and different loss functions showed that the introduction of the multimodal weighting module significantly improved the model’s precision and recall, while the dynamic regression loss function achieved the best results across all performance metrics. Additionally, a web application was developed to facilitate real-time interaction with the model. This application enables users to upload images directly, apply the detection model, and receive instantaneous diagnostic results, further enhancing the practical utility of the system for end users in agricultural settings. These ablation experiments not only validate the effectiveness of each component but also provide a clear direction for future optimizations. Lastly, the model was also validated in detecting diseases in other crops using other crop disease images from the Kaggle dataset. The results demonstrate that the proposed method also performs excellently in other crop disease detection tasks, with its accuracy and precision surpassing those of other existing models, further proving the generalizability and practical value of this method.

Overall, this study not only theoretically proposed a new method for detecting pear tree diseases but also empirically verified the effectiveness and practicality of this method through a series of experiments. Future work will further explore disease detection in other types of crops and attempt to apply this method in a broader range of agricultural production scenarios, aiming to provide more precise and efficient technological support for modern agricultural production.

Author Contributions

Conceptualization, T.H., X.L. and N.Z.; data curation, N.Z., J.X., J.H., M.J. and Z.Z.; formal analysis, J.X. and X.W.; funding acquisition, M.D.; investigation, X.W.; methodology, T.H. and N.Z.; project administration, Z.Z., J.W. and M.D.; resources, J.X., J.H., M.J. and Z.Z.; software, X.L.; validation, T.H., X.L. and X.W.; visualization, J.H. and M.J.; writing—original draft, T.H., X.L., N.Z., J.X., X.W., J.H., M.J., Z.Z., J.W. and M.D.; writing—review and editing, J.W. and M.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Research and Demonstration of Construction and Green Control Technology of Forestry Pest Monitoring Network in Taihang Mountains 2023A02NY002.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, L.; Zhang, S.; Wang, B. Plant disease detection and classification by deep learning—A review. IEEE Access 2021, 9, 56683–56698. [Google Scholar] [CrossRef]
Zhang, Y.; Wa, S.; Liu, Y.; Zhou, X.; Sun, P.; Ma, Q. High-accuracy detection of maize leaf diseases CNN based on multi-pathway activation function module. Remote Sens. 2021, 13, 4218. [Google Scholar] [CrossRef]
Gu, Y.H.; Yin, H.; Jin, D.; Zheng, R.; Yoo, S.J. Improved multi-plant disease recognition method using deep convolutional neural networks in six diseases of apples and pears. Agriculture 2022, 12, 300. [Google Scholar] [CrossRef]
Khakimov, A.; Salakhutdinov, I.; Omolikov, A.; Utaganov, S. Traditional and current-prospective methods of agricultural plant diseases detection: A review. In IOP Conference Series: Earth and Environmental Science; IOP Publishing: Bristol, UK, 2022; Volume 951, p. 012002. [Google Scholar]
Li, Q.; Ren, J.; Zhang, Y.; Song, C.; Liao, Y.; Zhang, Y. Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators. In Proceedings of the 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 9–13 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Ouhami, M.; Hafiane, A.; Es-Saady, Y.; El Hajji, M.; Canals, R. Computer vision, IoT and data fusion for crop disease detection using machine learning: A survey and ongoing research. Remote Sens. 2021, 13, 2486. [Google Scholar] [CrossRef]
Zhang, Y.; Lv, C. TinySegformer: A lightweight visual segmentation model for real-time agricultural pest detection. Comput. Electron. Agric. 2024, 218, 108740. [Google Scholar] [CrossRef]
Tian, H.; Wang, T.; Liu, Y.; Qiao, X.; Li, Y. Computer vision technology in agricultural automation—A review. Inf. Process. Agric. 2020, 7, 1–19. [Google Scholar] [CrossRef]
Moupojou, E.; Tagne, A.; Retraint, F.; Tadonkemwa, A.; Wilfried, D.; Tapamo, H.; Nkenlifack, M. FieldPlant: A dataset of field plant images for plant disease detection and classification with deep learning. IEEE Access 2023, 11, 35398–35410. [Google Scholar] [CrossRef]
Jung, M.; Song, J.S.; Shin, A.Y.; Choi, B.; Go, S.; Kwon, S.Y.; Park, J.; Park, S.G.; Kim, Y.M. Construction of deep learning-based disease detection model in plants. Sci. Rep. 2023, 13, 7331. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, X.; Wa, S.; Liu, Y.; Kang, J.; Lv, C. GenU-Net++: An Automatic Intracranial Brain Tumors Segmentation Algorithm on 3D Image Series with High Performance. Symmetry 2021, 13, 2395. [Google Scholar] [CrossRef]
Zhou, X.; Chen, S.; Ren, Y.; Zhang, Y.; Fu, J.; Fan, D.; Lin, J.; Wang, Q. Atrous Pyramid GAN Segmentation Network for Fish Images with High Performance. Electronics 2022, 11, 911. [Google Scholar] [CrossRef]
Lin, X.; Wa, S.; Zhang, Y.; Ma, Q. A dilated segmentation network with the morphological correction method in farming area image Series. Remote Sens. 2022, 14, 1771. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, Y.; Ma, X. A New Strategy for Tuning ReLUs: Self-Adaptive Linear Units (SALUs). In Proceedings of the ICMLCA 2021—2nd International Conference on Machine Learning and Computer Application, Shenyang, China, 17–19 December 2021; VDE: Frankfurt am Main, Germany, 2021; pp. 1–8. [Google Scholar]
Zhang, Y.; He, S.; Wa, S.; Zong, Z.; Lin, J.; Fan, D.; Fu, J.; Lv, C. Symmetry GAN detection network: An automatic one-stage high-accuracy detection network for various types of lesions on CT images. Symmetry 2022, 14, 234. [Google Scholar] [CrossRef]
Trong, V.H.; Gwang-hyun, Y.; Vu, D.T.; Jin-young, K. Late fusion of multimodal deep neural networks for weeds classification. Comput. Electron. Agric. 2020, 175, 105506. [Google Scholar] [CrossRef]
Almadani, B.; Mostafa, S.M. IIoT based multimodal communication model for agriculture and agro-industries. IEEE Access 2021, 9, 10070–10088. [Google Scholar] [CrossRef]
Li, Y.; Nie, J.; Chao, X. Do we really need deep CNN for plant diseases identification? Comput. Electron. Agric. 2020, 178, 105803. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, X.; Liu, Y.; Zhou, J.; Huang, Y.; Li, J.; Zhang, L.; Ma, Q. A time-series neural network for pig feeding behavior recognition and dangerous detection from videos. Comput. Electron. Agric. 2024, 218, 108710. [Google Scholar] [CrossRef]
Zhou, J.; Li, J.; Wang, C.; Wu, H.; Zhao, C.; Teng, G. Crop disease identification and interpretation method based on multimodal deep learning. Comput. Electron. Agric. 2021, 189, 106408. [Google Scholar] [CrossRef]
Patil, R.R.; Kumar, S. Rice-fusion: A multimodality data fusion framework for rice disease diagnosis. IEEE Access 2022, 10, 5207–5222. [Google Scholar] [CrossRef]
Ilavendhan, A.; Sreenidhi, R. Multi-Modal Deep Learning for Leaf Disease Detection: Integrating Visual and Non-Visual Data Sources. In Proceedings of the 2023 6th International Conference on Recent Trends in Advance Computing (ICRTAC), Chennai, India, 14–15 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 186–191. [Google Scholar]
Sarkar, C.; Gupta, D.; Gupta, U.; Hazarika, B.B. Leaf disease detection using machine learning and deep learning: Review and challenges. Appl. Soft Comput. 2023, 145, 110534. [Google Scholar] [CrossRef]
Yang, F.; Li, F.; Zhang, K.; Zhang, W.; Li, S. Influencing factors analysis in pear disease recognition using deep learning. Peer- Netw. Appl. 2021, 14, 1816–1828. [Google Scholar] [CrossRef]
DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Wang, H.; Shang, S.; Wang, D.; He, X.; Feng, K.; Zhu, H. Plant disease detection and classification method based on the optimized lightweight YOLOv5 model. Agriculture 2022, 12, 931. [Google Scholar] [CrossRef]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-time flying object detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]

Figure 1. Samples of the dataset. (A) is Black spot disease, (B) is Ring pattern disease, (C) is Black star disease, (D) is Pear rust, (E) is Pear anthracnose, (F) is Pear gall midge, (G) is Looper moth, (H) is Pear leaffolder, (I) is Pear aphid, (J) is Pear stink bug, (K) is Spiny bollworm, (L) is Pear stem wasp, (M) is Fruit-sucking moth, and (N) is Pear caterpillar.

Figure 2. Samples of three image data augmentation methods.

Figure 3. Framework of the pear tree disease detection system integrating multimodal data processing and attention aggregation technology. The depicted system comprises branches for processing image data and environmental sensor data, a multimodal weighting module, and a dynamic regression loss function, highlighting the comprehensive process from data preprocessing to final disease identification.

Figure 4. Schematic of the design structure for the large model tailored to multimodal data. The image showcases key steps in the multimodal data fusion process, including feature extraction (Extract) from image and environmental sensor data, feature transformation (Transform), and the fusion (Fuse) of features from different modalities. The diagram also specifically marks the attention connection mechanism that enhances the model’s focus and the integration of crucial features across two different dimensions (Attention Dimension 1 and Attention Dimension 2).

Figure 5. Schematic of the multimodal data fusion module structure. This figure demonstrates the specific process of combining features extracted from image data through convolutional networks with sensor data via transformation and fusion operations within the pear tree disease detection system. Here, ‘E’ represents the feature extraction process, ‘T’ denotes the feature transformation process, and ‘F’ indicates the fusion points of different modal features, aimed at enhancing the model’s integration ability for disease characteristics.

Figure 6. Structure schematic of the attention aggregation module. This illustration contrasts the feature aggregation process with (a) and without (b) the attention mechanism. In the attention aggregation module, after processing the features through a series of linear layers and activation functions, the scaled dot-product attention mechanism is applied to aggregate information finely. This is followed by concatenation and linear transformation, culminating in a mean pooling to obtain the aggregated feature representation. In contrast, the structure without attention proceeds directly to the mean pooling of the features, lacking the nuanced weighting of inter-feature relationships.

Figure 7. ROC plots by baseline models and proposed method.

Figure 8. Precision–recall plots by baseline models and proposed method.

Figure 9. Smart agriculture system based on proposed method.

Figure 10. Rice Disease dataset.

Figure 11. PlantDoc dataset.

Table 1. Details of dataset.

Disease/Pest	Number	Place	Equipment
Black spot disease	864	Bayannur, Linhe District	DJI Phantom 4
Ring pattern disease	579	Bayannur, Linhe District	Nikon D850
Black star disease	388	Bayannur, Linhe District	DJI Phantom 4
Pear rust	921	Bayannur, Linhe District	DJI Phantom 4
Pear anthracnose	654	Zhuozhou Experimental Garden Bayannur, Linhe District	Nikon D850
Pear gall midge	355	Zhuozhou Experimental Garden Bayannur, Urad Front Banner Wuyi Community	DJI Phantom 4
Looper moth	629	Zhuozhou Experimental Garden Bayannur, Linhe District	Nikon D850
Pear leaffolder	533	Bayannur, Urad Front Banner Wuyi Community, Linhe District	Nikon D850
Pear aphid	770	Bayannur, Urad Front Banner Wuyi Community Shuguang Township	DJI Phantom 4
Pear stink bug	298	Zhuozhou Experimental Garden Bayannur, Shuguang Township	DJI Phantom 4
Spiny bollworm	446	Zhuozhou Experimental Garden Bayannur, Shuguang Township	Nikon D850
Pear stem wasp	588	Zhuozhou Experimental Garden Bayannur, Shuguang Township	DJI Phantom 4
Fruit-sucking moth	669	Bayannur, Urad Front Banner Wuyi Community, Linhe District	DJI Phantom 4
Pear caterpillar	531	Bayannur, Urad Front Banner Wuyi Community, Linhe District	Nikon D850

Table 2. Disease detection experimental results.

Model	Precision	Recall	Accuracy	F1-Score
RetinaNet	0.82	0.80	0.81	0.81
EfficientDet	0.84	0.82	0.83	0.83
DETR	0.86	0.84	0.85	0.85
YOLOv5-Large	0.89	0.86	0.88	0.87
YOLOv9-Large	0.91	0.88	0.90	0.89
Proposed Method	0.93	0.90	0.92	0.91

Table 3. Pear tree disease detection performance by disease type.

Disease/Pest	Precision	Recall	Accuracy	F1-Score
Black spot disease	0.94	0.91	0.93	0.92
Ring pattern disease	0.92	0.89	0.91	0.90
Black star disease	0.93	0.90	0.92	0.91
Pear rust	0.94	0.92	0.93	0.93
Pear anthracnose	0.92	0.89	0.91	0.90
Pear gall midge	0.91	0.88	0.90	0.89
Looper moth	0.93	0.91	0.92	0.92
Pear leaffolder	0.92	0.89	0.91	0.90
Pear aphid	0.93	0.90	0.92	0.91
Pear stink bug	0.94	0.91	0.93	0.92
Spiny bollworm	0.92	0.90	0.91	0.91
Pear stem wasp	0.93	0.91	0.92	0.92
Fruit-sucking moth	0.94	0.92	0.93	0.93
Pear caterpillar	0.91	0.88	0.90	0.89

Table 4. Multimodal weighting module ablation experiment results.

Model Configuration	Precision	Recall	Accuracy	F1-Score
Without Multimodal Weighting Module	0.88	0.85	0.87	0.86
With Multimodal Weighting Module	0.93	0.90	0.92	0.91

Table 5. Ablation experiment of different loss functions.

Model	Precision	Recall	Accuracy	F1-Score
Cross-Entropy Loss	0.88	0.85	0.87	0.86
Focal Loss	0.90	0.87	0.89	0.88
Smooth L1 Loss	0.92	0.89	0.91	0.90
Dynamic Regression Loss	0.93	0.90	0.92	0.91

Table 6. Results of different data sources.

Data Source	Precision	Recall	Accuracy	F1-Score
Only Sensor Data	0.38	0.35	0.37	0.37
Only Image	0.85	0.83	0.84	0.83
Sensor Data + Image	0.90	0.86	0.89	0.88

Table 7. Detection results from different datasets.

Dataset	Precision	Recall	Accuracy	F1-Score
Rice Disease Dataset	0.92	0.89	0.90	0.90
PlantDoc Dataset	0.90	0.85	0.88	0.88

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hai, T.; Zhang, N.; Lu, X.; Xu, J.; Wang, X.; Hu, J.; Ji, M.; Zhao, Z.; Wang, J.; Dong, M. Implementation and Evaluation of Attention Aggregation Technique for Pear Disease Detection. Agriculture 2024, 14, 1146. https://doi.org/10.3390/agriculture14071146

AMA Style

Hai T, Zhang N, Lu X, Xu J, Wang X, Hu J, Ji M, Zhao Z, Wang J, Dong M. Implementation and Evaluation of Attention Aggregation Technique for Pear Disease Detection. Agriculture. 2024; 14(7):1146. https://doi.org/10.3390/agriculture14071146

Chicago/Turabian Style

Hai, Tong, Ningyi Zhang, Xiaoyi Lu, Jiping Xu, Xinliang Wang, Jiewei Hu, Mengxue Ji, Zijia Zhao, Jingshun Wang, and Min Dong. 2024. "Implementation and Evaluation of Attention Aggregation Technique for Pear Disease Detection" Agriculture 14, no. 7: 1146. https://doi.org/10.3390/agriculture14071146

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Implementation and Evaluation of Attention Aggregation Technique for Pear Disease Detection

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Collection

2.1.1. Image Data

2.1.2. Sensor Data

2.2. Image Data Augmentation

2.3. Sensor Data Preprocessing

2.4. Proposed Method

2.4.1. Overall

2.4.2. Design of Large Models for Multimodal Data

2.4.3. Multimodal Weighting Module

2.4.4. Attention Aggregation Module

2.4.5. Dynamic Regression Loss Function

2.5. Experimental Setup

2.5.1. Software and Hardware Platform

2.5.2. Training Strategy

2.6. Performance Metrics

2.7. Baseline Models

3. Results and Application

3.1. Disease Detection Experimental Results

3.2. Detailed Analysis of Pear Tree Disease Detection Performance

3.3. Model Application

4. Discussion and Ablation Studies

4.1. Ablation Study on Multimodal Weighting Module

4.2. Ablation Experiment of Dynamic Regression Loss Function

4.3. Model Robustness Validation

4.3.1. Validation of Different Data Sources

4.3.2. Other Dataset Validation

4.4. Future Works

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI