Application of Advanced Deep Learning Models for Efficient Apple Defect Detection and Quality Grading in Agricultural Production

Gao, Xiaotong; Li, Songwei; Su, Xiaotong; Li, Yan; Huang, Lingyun; Tang, Weidong; Zhang, Yuanchen; Dong, Min

doi:10.3390/agriculture14071098

Open AccessArticle

Application of Advanced Deep Learning Models for Efficient Apple Defect Detection and Quality Grading in Agricultural Production

by

Xiaotong Gao

¹,

Songwei Li

¹,

Xiaotong Su

¹,

Yan Li

¹,

Lingyun Huang

¹,

Weidong Tang

¹,

Yuanchen Zhang

^1,2,* and

Min Dong

^1,*

¹

China Agricultural University, Beijing 100083, China

²

College of Biology and Food Engineering, Anyang Institute of Technology, No. 73 Huanghe Road, Anyang 455000, China

^*

Authors to whom correspondence should be addressed.

Agriculture 2024, 14(7), 1098; https://doi.org/10.3390/agriculture14071098

Submission received: 10 June 2024 / Revised: 3 July 2024 / Accepted: 5 July 2024 / Published: 9 July 2024

(This article belongs to the Special Issue Comprehensive Application and Prospects of New Technologies for Plant Protection)

Download

Browse Figures

Versions Notes

Abstract

:

In this study, a deep learning-based system for apple defect detection and quality grading was developed, integrating various advanced image-processing technologies and machine learning algorithms to enhance the automation and accuracy of apple quality monitoring. Experimental validation demonstrated the superior performance of the proposed model in handling complex image tasks. In the defect-segmentation experiments, the method achieved a precision of 93%, a recall of 90%, an accuracy of 91% and a mean Intersection over Union (mIoU) of 92%, significantly surpassing traditional deep learning models such as U-Net, SegNet, PSPNet, UNet++, DeepLabv3+ and HRNet. Similarly, in the quality-grading experiments, the method exhibited high efficiency with a precision of 91%, and both recall and accuracy reaching 90%. Additionally, ablation experiments with different loss functions confirmed the significant advantages of the Jump Loss in enhancing model performance, particularly in addressing class imbalance and improving feature learning. These results not only validate the effectiveness and reliability of the system in practical applications but also highlight its potential in automating the detection and grading processes in the apple industry. This integration of advanced technologies provides a new automated solution for quality control of agricultural products like apples, facilitating the modernization of agricultural production.

Keywords:

advanced image processing; Segment Anything Model; apple quality grading; apple defect detection; deep learning in agriculture

1. Introduction

Apples are among the fruits with the highest consumption and trade volumes globally [1] and ensuring their quality is crucial for satisfying consumer demand and maintaining the reputation of suppliers. However, defects such as scratches, spots and irregular shapes pose significant challenges to producers and consumers alike [2,3]. These defects not only affect the visual appeal of apples but also impact their taste, texture and nutritional value [4]. Thus, the timely and accurate identification of apple defects and quality grading are of great economic value and practical significance.

Traditionally, apple defect detection and quality grading have relied on manual inspection [5], a method that is time-consuming, labor-intensive and highly subjective [6]. Consequently, the demand for automated systems capable of accurately and efficiently detecting defects and grading apples based on quality attributes has been increasing. With the rapid advancements in computer vision and deep learning technologies, machine learning offers new solutions for agricultural disease management. These technologies enable the automatic analysis of crop images, recognizing and categorizing different quality types, significantly enhancing the efficiency and accuracy of disease detection. Initially, infrared spectroscopy was commonly used, but its measurements of hardness were not very precise [7]. Upgrades to hyperspectral imaging for detection were proposed [8]. However, information obtained from near-infrared and hyperspectral spectroscopy can easily be obscured by spectral variations caused by the physical properties of food. Moreover, most instruments used in this method are complex and expensive [9]. Magnetic methods such as magnetic resonance imaging and electrical conductivity, acoustic methods like ultrasonography and pulse response, and other dynamic methods such as X-ray and CT scanning have also been utilized.

Recent advances in computer vision, machine learning and artificial intelligence have paved the way for the development of systems for apple defect detection and quality grading [10,11]. These systems utilize image-processing techniques to analyze digital images of apples and accurately identify various defects. Furthermore, machine learning algorithms are employed to grade apples based on predetermined criteria such as size, color and shape. Nie et al. proposed a particle swarm optimization-based support vector machine model for grading apples by fruit shape color and defect features, achieving a classification accuracy of 92% [12]. Sun et al. introduced a structure illumination reflectance imaging (SIRI) method based on pixel-based convolutional neural networks to detect early fungal infections in peaches, demonstrating an excellent performance symptom-detection rate of 97.6% [13]. Anuja et al. further researched and proposed an SVM-based fruit quality-grading system, achieving defect-detection accuracies of 77.24% (k-NN), 82.75% (SRC), 88.27% (ANN) and 95.72% (SVM) [14]. Su et al. developed a band-pass filter fluorescence macro-imaging system for testing the solution safety of celery [15]. Krishna et al. compared manual methods with computer vision in assessing mango attributes, developing various multiple linear regression models (MLR) with accuracies exceeding 97.9%, 93.5% and 92.5%, but utilizing monochromatic cameras not suitable for broad scenarios [16]. Alencastre et al. targeted monochromatic images not suitable for current smartphone photography, using color image datasets and convolutional neural networks for sugarcane quality detection, doubling the performance for the L 01-299 variety and increasing the performance fivefold for the HoCP 09-804 variety, although their data volume was too small [17].

Genze et al. further expanded the dataset in number and variety, utilizing convolutional neural network (CNN) architectures to achieve high average precision (mAP) of approximately 97.9%, 94.2% and 94.3% on the retained test datasets for corn, rye and fescue [18]. Li et al. built upon this foundation, proposing a model based on the CNN, with the best training and validation accuracies reaching 99% and 98.98%, respectively [19]. Zou et al. attempted quality detection from an olfactory perspective, proposing a design for an apple quality-grading electronic nose detection system based on computational fluid dynamics simulation and k-nearest neighbor support vector machine [20]. Hemamalini et al. utilized KNN, SVM, C4.5 and other machine learning methods to classify fruit photos. These algorithms determined whether the fruit was damaged, but their specificity and sensitivity still require improvement [21].

Wieme et al. utilized deep learning technology for the quality assessment of fruits, vegetables and mushrooms, with CNN (ResNet/ResNeXt) F1 scores of 0.8952 and 0.8905, enhancing model specificity and sensitivity. However, this approach has drawbacks. As the problem becomes more complex, more data are usually required, thus increasing training time, especially with hyperspectral imaging [22]. Ismail et al. presented an efficient machine vision system based on state-of-the-art deep learning technologies and stacked ensemble methods, achieving average accuracies of 99.2% and 98.6% for apple and banana test sets, respectively. However, this was solely based on the appearance of the fruit and used only a single view of fruit images, lacking a multi-view vision system training [23].

This article introduces a deep learning-based system for apple defect detection and quality grading, aimed at enhancing the accuracy and efficiency of detection. The main contributions of this study are as follows:

Firstly, the dataset used in this article underwent data augmentation for pixel-level segmentation tasks and was semantically annotated; through pixel-level annotation, high-precision apple quality detection is achieved, providing defect detection and quality grading in practical applications.
Secondly, in terms of model construction, the advantages of Transformers and neural networks were successfully combined. The self-attention mechanism of the Transformer enables the model to capture long-distance dependencies in the image, significantly enhancing the model’s ability to recognize apple defect features. Concurrently, the hierarchical feature extraction capability of neural networks further optimizes the precision of image segmentation.
Lastly, an innovative model architecture was designed, incorporating a jump connection Segment Anything Model (SAM) and maximum entropy selection optimization (targeted at Segment Anything segmentation). Through carefully designed models, not only was high precision in defect detection in complex agricultural environments ensured, but the size and computation of the model were also effectively controlled. This design enables it to operate under limited computational resources and meet the needs for real-time segmentation and processing of images.

2. Related Work

Semantic segmentation [24] is a critical task in the field of computer vision and serves as a key computational unit in image processing. It operates on different characteristics of images to perform tasks such as analysis, recognition, matching or retrieval. The objective is to divide a given image into several visually meaningful or interesting regions, classifying each pixel for subsequent image analysis and visual understanding. The development of deep learning in recent years has provided powerful tools for research in semantic segmentation tasks [25]. Notably, traditional image operator models [26], U-net [27] models and SAM [28] model have achieved significant success in semantic segmentation tasks.

2.1. Traditional Operators Applied to Image Grading

Traditional image-processing operators, such as convolutional operators, are employed in CNNs to extract features by computing a weighted sum as a convolution kernel (also referred to as a filter or weight matrix) slides over the input data [29]. Activation functions introduce nonlinearity into the network. This article primarily utilizes the Sobel operator [30], commonly used for edge detection [31]. The Sobel operator, proposed by Irwin Sobel and Gary Feldman, is a classic edge detection operator [32]. The original Sobel operator is an isotropic gradient operator that convolves two

3 \times 3

filters with the image to obtain approximate derivatives: one for horizontal and one for vertical changes. The following equation displays the basic computation of the Sobel operator [33]:

G_{x} = [\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}] * I, G_{y} = [\begin{matrix} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{matrix}] * I

(1)

where I represents the input image, and

G_{x}

and

G_{y}

are two images containing the approximate derivatives for horizontal and vertical, respectively. The ∗ symbol denotes the basic convolution operation between the input image and the two filters.

This provides a foundation for image-based fruit quality grading [34]. Although these methods are simple and efficient, they are generally only applicable to image-processing tasks of low complexity.

2.2. UNet

The UNet model [35], a classic network in the field of medical image segmentation, is renowned for its unique symmetrical structure and skip connections [36]. The architecture of UNet is composed of two parts: an encoding path (downsampling) and a decoding path (upsampling). This symmetrical structure enables UNet to efficiently extract and restore detailed information within images.

UNet has been widely applied in various image-segmentation tasks, including defect detection in agricultural products [37]. The model is a typical encoder-decoder structured network [38], suitable for medical image segmentation and other tasks involving small samples or high-dimensional problems [27]. UNet extracts features during the encoding phase through convolution and pooling operations, and restores spatial information in images during the decoding phase through upsampling and transposed convolution operations. The encoder part of UNet can be represented by a function E as follows:

E : R^{H_{in} \times W_{in} \times 3} \to R^{H_{mid} \times W_{mid} \times C}

(2)

where

H_{mid}, W_{mid}

denote the height and width of the intermediate feature maps. The decoder part can be represented by a function D, namely:

D : R^{H_{mid} \times W_{mid} \times C} \to R^{H_{out} \times W_{out} \times C}

(3)

Subsequently, the UNet model can be represented as

F = D \circ E

.

2.3. SAM

The SAM model, a novel image-segmentation framework [39], is characterized by a simple structure that includes a robust image encoder, a prompt encoder and a lightweight mask decoder [40,41]. This deep learning model integrates the advantages of Transformers and CNNs, making it suitable for complex image-segmentation tasks. Furthermore, by separating the image encoder and the prompt encoder/mask decoder [42], the reuse of image embeddings is facilitated, thereby reducing costs. The SAM model also serves as a tool for accelerated annotation; hints provided by users are used by the model to generate and iteratively improve masks, or to operate in a “segment everything” mode, generating masks for multiple objects that users can then select and modify. In agriculture, the SAM model has been applied to delineate agricultural field boundaries [43].

These models have achieved excellent results in their respective application domains, providing valuable insights for this study. However, these models lack specific structures and algorithms for apple defect detection and quality grading, and there is still room for improvement in efficiency and accuracy [44,45]. Therefore, a deep learning-based system for apple defect detection and quality grading has been designed, aimed at enhancing the accuracy and efficiency of detection.

3. Materials and Method

3.1. Dataset Collection

The apple image dataset utilized in this study was primarily collected from two major apple-producing regions in China: the apple orchards in Tianshui, Gansu Province, and Qixia, Yantai City, Shandong Province. These regions are renowned for their unique natural conditions and advanced cultivation techniques, which contribute to the high reputation of their apples in both domestic and international markets. The apple orchards in Tianshui, located in the southeastern part of Gansu Province, offer climatic conditions favorable for apple growth, including significant diurnal temperature variations that are beneficial for the accumulation of sugars in the apples. Apples from Tianshui are known for their bright color and fullness, making them an ideal choice for studying apple quality grading. Conversely, Qixia City, recognized as one of China’s prominent apple production bases, produces apples known for their pleasant taste and crisp texture. The region boasts a long history of apple cultivation, employing representative cultivation techniques and orchard management practices.

Data collection was strategically scheduled during the peak of the apple ripening season to ensure that the images captured reflected the characteristics of the fruit at different stages of maturity. In Tianshui, the collection occurred from mid-September to early October, while in Qixia, it was organized from late August to mid-September. The process involved capturing high-definition images from various angles directly from the apple trees in the orchards. To ensure the diversity and comprehensiveness of the data, the collection team focused on representative apples from each tree, taking at least five different angles (including the top, bottom, shoulders and sides) to obtain a comprehensive set of apple images. Establishing grading standards for apple quality is crucial for subsequent data analysis and model training. Apple grading is typically based on international or national standards, aligned with market demands and consumer preferences. This research adheres to Chinese agricultural industry standards and common international trade criteria, classifying apples into four grades. Grade A (Extra): Apples with no apparent defects, uniform color, regular shape, a diameter greater than 75 mm, no signs of pests or diseases, no mechanical damage and free from cracks, rot or other physiological defects. Grade B (First): Minor imperfections acceptable such as slight bruising or slightly uneven coloration, diameter ranging from 65 mm to 75 mm, with slight signs of pest damage or other minor non-structural defects. Grade C (Second): Apparent mild to moderate defects like blotches, irregular shapes, diameter between 55 mm and 65 mm, with moderate pest damage signs, mechanical injuries or minor cracks. Grade D (Third): Significant visible defects, shapes, colors and textures that do not meet the standards of the other three grades, diameter less than 55 mm or greater than 75 mm, with severe pest damage, mechanical injuries or fruit rot. In this study, approximately 5000 apple images were collected: about 1500 for Grade A, 1300 for Grade B, 1200 for Grade C and 1000 for Grade D, as shown in Table 1. This distribution aids the model in learning to recognize features of apples at different quality levels, thus enabling accurate automatic grading in practical applications.

Before the commencement of the data collection activities, an on-site inspection of the orchard was conducted by the collection team to evaluate the growth conditions of the apples and the orchard’s lighting conditions, planning the optimal shooting time and angles. Additionally, communication with the orchard managers ensured that the data collection activities would not interfere with the regular operations of the orchard. Professional photography equipment was used for on-site shooting within the orchard. High-quality images were obtained by choosing sunny and well-lit periods for the data collection. During the shooting process, special attention was paid to adjusting the camera settings, such as aperture, shutter speed and ISO sensitivity, to ensure image clarity and color accuracy. Each collection point had a dedicated person responsible for recording data, including the apple variety, tree age and specific shooting time and environmental conditions. The collection team was equipped with high-resolution digital SLR cameras (Nikon D850 (Japan), with a high resolution of 45.75 million pixels) and multiple lenses (including a 50 mm prime lens and a 24–70 mm zoom lens) to meet different shooting needs. Additionally, portable tripods and reflectors were used to stabilize the shots and adjust lighting. All equipment underwent strict dust and moisture protection treatments to ensure stable operation in outdoor environments. The images were taken at a resolution of

1920 \times 1080

pixels, with 24-bit color depth, in JPEG format, ensuring detail richness and usability in subsequent processing. Furthermore, all images underwent preliminary digital processing after collection, including exposure adjustment and contrast optimization, as shown in Figure 1, to improve visual effects and analysis accuracy. Finally, all images were formatted and labeled in preparation for image analysis and model training, ensuring the dataset supported efficient model training and guaranteed the model’s broad applicability and accuracy in practical operations.

3.2. Data Augmentation

3.2.1. Basic Enhancement Method

In the fields of computer vision and image processing, data augmentation is a commonly used technique to increase the diversity of a dataset through image transformations. These techniques simulate various shooting conditions that may be encountered in the real world, thereby helping deep learning models to enhance their generalization capabilities to unseen data. This section details four basic image enhancement methods: flipping, cropping, translating and rotating.

Firstly, image flipping operations include horizontal and vertical flips. Mathematically, this can be expressed as a coordinate transformation for each pixel in the image. For an image of size

M \times N

(where M is the height and N is the width), the new coordinates of any point

(x, y)

after a horizontal flip are

(x, N - 1 - y)

and after a vertical flip, they are

(M - 1 - x, y)

. These transformations can be expressed by the following equations:

Horizontal flip : (x, y) \mapsto (x, N - 1 - y),

(4)

Vertical flip : (x, y) \mapsto (M - 1 - x, y) .

(5)

These operations enable the model to recognize and process objects from different directions without being dependent on a specific orientation of the image. Secondly, the cropping operation is often used to generate a local view of an image. Assuming the new image area after cropping is

m \times n

with a starting point

(x_{1}, y_{1})

, then the new coordinates of any point

(x, y)

in the cropped image correspond to

(x + x_{1}, y + y_{1})

in the original image. This transformation is mathematically expressed as:

Cropping : (x, y) \mapsto (x + x_{1}, y + y_{1}),

(6)

where

(x_{1}, y_{1})

is randomly selected to ensure

x_{1} + m \leq M

and

y_{1} + n \leq N

. This method not only simulates visibility issues caused by camera angles or obstructions but also enhances the model’s ability to recognize different regions of the image. Following this, the translation operation is achieved by moving the image in either the horizontal or vertical direction. Let the translation vector be

(t_{x}, t_{y})

; then, the new position of a point

(x, y)

in the original image after translation is

(x + t_{x}, y + t_{y})

. This can be described by the following equation:

Translation : (x, y) \mapsto (x + t_{x}, y + t_{y}),

(7)

where

t_{x}

and

t_{y}

can be positive or negative, indicating the distance the image is moved in the respective direction. Translation operations help the model learn to capture object features at different positions, particularly showing more robust performance in handling image edges. Lastly, the rotation operation involves rotating the image around a point (usually the center) by an angle

θ

. Let the center coordinates of the image be

(c_{x}, c_{y})

; then, the new coordinates of any point

(x, y)

after rotation can be calculated using the rotation matrix

R

, where

R

is defined as:

R (θ) = [\begin{matrix} cos θ & - sin θ \\ sin θ & cos θ \end{matrix}],

(8)

(\begin{matrix} x^{'} \\ y^{'} \end{matrix}) = R (θ) \cdot (\begin{matrix} x - c_{x} \\ y - c_{y} \end{matrix}) + (\begin{matrix} c_{x} \\ c_{y} \end{matrix}) .

(9)

This transformation enhances the model’s adaptability to changes in the orientation of target objects, especially for objects with axial symmetry (such as apples), where rotational enhancement can significantly improve the model’s flexibility and accuracy in practical applications.

Through the aforementioned enhancement methods, training datasets can be effectively expanded in terms of coverage and scene complexity without incurring additional data collection costs. This solid data support allows deep learning models to perform more effectively and stably in practical applications.

3.2.2. Data Generation Based on Diffusion Models

Diffusion models, as emerging generative models, simulate the reverse process from a high-dimensional data distribution to pure noise, gradually constructing high-quality images. These models are ideally suited for generating complex natural images, such as apples with specific defects, based on the physical process of a random walk and gradual denoising. Mathematically, diffusion models can be described by a continuous process of noise addition and a reverse denoising process. During the forward process, noise is progressively added to the data until it is completely transformed into Gaussian noise. Specifically, this process can be defined by the following stochastic differential equation:

d x = f (x, t) d t + g (t) d w,

(10)

where

x

represents image data,

f (x, t)

is the drift term related to time t,

g (t)

is the diffusion coefficient and

w

denotes Brownian motion. In the reverse process, the goal of the diffusion model is to learn how to progressively recover a clear image from a state of pure noise. This is achieved by training a parameterized neural network to estimate the conditional probability

p (x_{t - 1} | x_{t})

, that is, estimating the distribution of the image at an earlier time point

x_{t - 1}

given the noisy image at time

x_{t}

. This step is usually optimized using a variational lower bound, with the specific denoising step expressed as:

x_{t - 1} = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - α_{t}^{2}}} ϵ_{θ} (x_{t}, t)),

(11)

where

α_{t}

is the noise level coefficient, and

ϵ_{θ}

is the noise prediction model parameterized by the neural network. In practical applications, the diffusion model is initially pretrained on an existing dataset of apple images. This step involves extensive forward and reverse iterations to ensure the model accurately captures the distribution characteristics of apple images. Subsequently, the model’s parameters are adjusted to focus more on generating images with specific defects. Additionally, generated images undergo a series of quality control steps to ensure the generated image quality meets training requirements. Filtered images that pass quality control are merged with the original dataset to train higher-performance segmentation models. This process significantly increases the number of rare defect samples in the dataset, thereby enhancing the model’s ability to recognize these challenging categories. By utilizing diffusion-based data generation technology, not only are data collection constraints overcome, but also the diversity and complexity of the dataset are significantly enhanced. This method is particularly suitable for addressing the imbalance in image data, where certain defect types are extremely rare in natural settings and traditional data collection methods struggle to efficiently gather sufficient sample volumes.

3.3. Proposed Method

3.3.1. Overall

The apple defect detection and quality-grading system proposed in this paper is based on a deep learning framework, integrating advanced image analysis techniques and machine learning algorithms, as shown in Figure 2. The model’s comprehensive architecture and the cohesive integration of its modules are particularly emphasized. The design aims to address the high accuracy requirements for apple defect identification while also meeting the challenges of processing speed and accuracy in practical applications. This section will detail the model’s construction process and the interconnections between its various modules. The core architecture of the system comprises four main parts: the Jump Connection SAM, Jump Connection Attention Mechanism, Maximum Entropy Selection Optimization and Jump Loss calculation. These modules work collaboratively to optimize the entire process from data input to defect detection and quality grading.

The Jump Connection SAM model employs an encoder-decoder structure, utilizing a deep convolutional network for feature extraction and image segmentation. The encoder gradually compresses the image through multiple convolutional and pooling layers, extracting high-dimensional features. Conversely, the decoder progressively restores the image’s spatial resolution and details through upsampling and convolutional layers. Importantly, jump connections directly link the low-level features from the encoder with the corresponding layers in the decoder, ensuring that details are not lost during upsampling. The mathematical expression for jump connections is given by:

C_{l} = U (C_{l + 1}) + F_{l} (E_{l})

(12)

where

E_{l}

and

C_{l + 1}

represent the features at layer l in the encoder and decoder, respectively, U denotes the upsampling operation and

F_{l}

signifies the feature fusion function.

To further enhance the model’s ability to recognize key features, a jump connection attention mechanism has been introduced. This mechanism dynamically adjusts the importance of the feature map by computing attention weights between different feature layers. Specifically, an attention map is calculated for each feature map to either enhance or suppress certain features:

A_{l} = σ (C o n v (E_{l}; θ_{l}))

(13)

where

C o n v

represents the convolution operation,

θ_{l}

are the convolution parameters and

σ

is the activation function used to generate attention weights for each channel.

During training, the Maximum Entropy Selection Optimization strategy is employed to automatically select the most informative samples for training. This strategy is based on the model’s current uncertainty, prioritizing data points with the highest entropy, which are the samples the model finds most challenging to distinguish. The entropy of a sample is calculated using the following formula:

H (x) = - \sum_{c} p_{c} log p_{c}

(14)

where

p_{c}

is the probability that the model predicts the sample belongs to class c. The Jump Loss is designed to optimize gradient transmission during training, particularly in preventing gradient vanishing in deep networks. The loss at each layer depends not only on the final output layer’s loss but also includes the difference between the output at that layer and the true label. By calculating the loss after each jump connection, it ensures that even the deeper parts of the network effectively update gradients:

L = \sum_{l} λ_{l} {∥ Y - {\hat{Y}}_{l} ∥}^{2}

(15)

where Y is the true label,

{\hat{Y}}_{l}

is the predicted output at layer l and

λ_{l}

is the weight of the loss at that layer. Through these designs and optimizations, the system developed in this study can effectively detect defects in apple images and achieve high accuracy in quality grading. This method, which utilizes a combination of advanced technologies and algorithms, not only enhances processing speed but also significantly improves the reliability and accuracy of the system in practical applications.

3.3.2. Jump Connection SAM Model

In this study, the Jump Connection SAM serves as the core component of the apple defect-detection and quality-grading system, as shown in Figure 3 and Figure 4.

Emphasizing efficiency and performance in handling complex image tasks, the SAM model employs an encoder-decoder structure enhanced by jump connections. These connections facilitate effective information flow throughout the model, enabling more precise image detail restoration and improved segmentation accuracy. The architecture of the Jump Connection SAM model is divided into two major parts: the encoder and the decoder. The encoder is primarily responsible for extracting image features, utilizing a multi-layer convolutional network structure. Each layer consists of two convolutional layers followed by a maximum pooling layer. The convolutional layers employ

3 \times 3

kernels and ReLU activation functions to ensure nonlinear processing capabilities. The subsequent maximum pooling layers reduce feature dimensions while preserving essential characteristics. The output channel count in the encoder increases layer by layer, starting from 64 channels and doubling after each pooling stage, reaching up to 1024 channels. The decoder’s role is to restore the spatial information from the high-dimensional features extracted by the again encoder. Comprising alternating upsampling layers and convolutional layers, each upsampling is followed by a

3 \times 3

convolution to integrate features. Convolutional layers in the decoder also utilize ReLU activation functions, ensuring effective nonlinear feature transformation. The channel count in the decoder decreases symmetrically from the encoder, eventually producing an output matching the input image size.

The design of jump connections is a distinctive feature of the SAM model. These connections directly transmit low-level detailed features from the encoder to the corresponding decoder layers, aiding in the restoration of details potentially lost during upsampling. Specifically, the output from each encoder layer not only proceeds to the next layer for further processing but is also carried over to the corresponding decoder layer via jump connections. This configuration allows the model to utilize a richer context during feature reconstruction, significantly enhancing segmentation precision and image detail restoration. The mathematical foundation of the Jump Connection SAM model relies on an effective feature fusion and information transfer mechanism, expressed mathematically as follows:

x_{l + 1} = F_{d e c} (U (x_{l}) + F_{s k i p} (x_{e_{n - l}}))

(16)

where

x_{l}

represents the features at layer l in the decoder,

U (\cdot)

denotes the upsampling operation,

F_{d e c} (\cdot)

is the convolution operation in the decoder,

x_{e_{n - l}}

corresponds to the features from the encoder layer aligned with

x_{l}

and

F_{s k i p} (\cdot)

indicates the feature fusion function in the jump connection. The advantages of the Jump Connection SAM model’s design are manifested in several aspects:

Information Integrity: Jump connections allow the model to utilize low-level features from the encoding stage during decoding, aiding in the restoration of detailed image information. This capability is especially beneficial in high-resolution imagery, where it significantly improves segmentation accuracy.
Enhanced Efficiency: By increasing the feature channels layer by layer during the encoding phase, the model deepens its learning capacity. During the decoding phase, the reduction in feature channels layer by layer optimizes computational efficiency, enabling the model to maintain high processing speeds even with complex images.
Generalization Ability**: The introduction of jump connections reduces information loss during training, enhancing the model’s generalization capability across diverse datasets, thereby stabilizing segmentation outcomes.

Through this efficient and precise network structure design, the Jump Connection SAM model offers a reliable solution for apple defect-detection and quality-grading tasks, demonstrating the significant potential of deep learning applications in the agricultural field.

3.3.3. Jump Connection Attention Mechanism

The Jump Connection Attention Mechanism introduced in this study builds upon the foundation of self-attention, incorporating the characteristics of jump connections to further enhance feature utilization efficiency and precision in image-segmentation tasks, as shown in Figure 5. Unlike the global focus of traditional self-attention, which emphasizes capturing relationships within entire input sequences, the Jump Connection Attention Mechanism is specifically designed for image segmentation. It underscores the connections between local features as well as the integration of features across different layers.

In the context of image segmentation, not only is it crucial to capture global information, but the emphasis is also on the interplay between local features and the integration across hierarchical levels. By embedding the attention scoring mechanism within traditional jump connections, the network is enhanced to recognize and utilize significant features during feature fusion more effectively. Particularly when merging features from different layers, this mechanism dynamically adjusts the contributions of various features based on their relevance to the current task. The mathematical description of the Jump Connection Attention Mechanism is as follows:

Attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(17)

Here, Q, K and V represent the query, key and value, respectively, which are used to generate the attention weights in self-attention mechanisms. Within jump connections, Q typically originates from the current layer of the decoder, while K and V are derived from corresponding layers of the encoder or the output of a preceding layer. This arrangement allows for feature fusion at each stage of jump connections, adapting the significance based on the context.

α_{i j} = \frac{exp (e_{i j})}{\sum_{k = 1}^{N} exp (e_{i k})}

(18)

e_{i j} = a (s_{i - 1}, h_{j})

(19)

where a is the learned alignment model,

s_{i - 1}

is the previous decoding state,

h_{j}

represents the features from the jump connection and

α_{i j}

denotes the attention weights. Through this mechanism, each layer of jump connections can dynamically adjust its contribution to the current task based on the relevance of features from different levels. This approach significantly enhances the efficiency of feature utilization and the expressive power of the network. The design advantages of the Jump Connection Attention Mechanism include:

Enhanced Feature Fusion Efficiency: By introducing attention mechanisms within each jump connection, the model not only simply merges features from different layers but dynamically adjusts the importance of each feature according to the current task, focusing more on the features that are crucial for the specific segmentation task.
Improved Segmentation Accuracy: Especially in handling complex or detail-rich images, such as minor defects or diseased areas on the surface of apples, the Jump Connection Attention Mechanism ensures that these details are not overlooked during feature fusion, thus enhancing overall segmentation precision.
Enhanced Model Generalization: The introduction of attention mechanisms allows the model to learn more generalized feature representations during training, reducing the risk of overfitting and enabling the model to perform well on unseen data.

3.3.4. Maximum Entropy Selection Optimization

The Maximum Entropy Selection Optimization strategy is grounded in the concept of entropy from information theory, which measures the uncertainty of a system and is used here to assess the potential contribution of each data point to model training. This strategy involves calculating the predictive entropy of each sample to determine its level of uncertainty, and then selecting those samples with the highest entropy values for training. The entropy of a sample is calculated using the following formula:

H (x) = - \sum_{c = 1}^{C} p_{c} (x) log p_{c} (x)

(20)

where

p_{c} (x)

represents the probability that the sample x belongs to category c and C is the total number of categories. This formula computes the entropy of the predicted outcome for sample x, reflecting the model’s uncertainty about the classification result for that sample. A higher entropy value indicates greater uncertainty in the model’s prediction, suggesting that the sample contains a rich amount of information, making it more suitable for training the model. This can enhance the model’s ability to learn from complex situations and improve its generalization capabilities.

When applied to the SAM, Maximum Entropy Selection Optimization significantly boosts the model’s performance in image-segmentation tasks, particularly when dealing with images that exhibit high variability and complexity. Traditional training of image-segmentation models often employs random or fixed-interval sample selection strategies, which can lead to inefficient training, especially in cases of unbalanced samples or when certain categories have fewer samples. In contrast, Maximum Entropy Selection Optimization dynamically evaluates the information content of each sample and prioritizes the training of those samples with the highest information content. This method not only accelerates the training process but also enhances the model’s ability to capture and learn discriminative features effectively.

3.3.5. Jump Loss

In the context of deep learning, particularly in deep networks, the design of the loss function is critical for the training outcomes and overall performance of the model. Traditional loss functions are typically computed at the final layer of the network. Even with the application of the backpropagation algorithm, deep networks can face issues like vanishing or exploding gradients, which limit the stability and efficiency of model training. To address these challenges, a novel loss function, termed Jump Loss, is proposed in this work. It calculates loss at every jump connection, facilitating deeper information transfer and more effective gradient flow. The design of the Jump Loss function is based on the principle that each layer of the decoder should use features transmitted from the corresponding layer of the encoder for accurate reconstruction. This design not only improves the efficiency of information utilization but also strengthens the connections between different layers within the network, particularly beneficial for those deep layers that might otherwise be difficult to train due to gradient vanishing. The mathematical expression for Jump Loss is as follows:

L = \sum_{l = 1}^{L} λ_{l} L_{l} = \sum_{l = 1}^{L} λ_{l} (\frac{1}{N} \sum_{n = 1}^{N} {∥y_{n}^{(l)} - {\hat{y}}_{n}^{(l)}∥}^{2})

(21)

Here, L denotes the total number of layers in the network,

L_{l}

represents the local loss at layer l,

λ_{l}

is the loss weight for layer l, N is the number of samples in the batch,

y_{n}^{(l)}

is the actual label of the nth sample at layer l and

{\hat{y}}_{n}^{(l)}

is the corresponding predicted output. This loss function takes into account the outputs of all intermediate layers, focusing not only on the errors of the final output but also optimizing the expressive capabilities of the intermediate layers, thereby enhancing the overall learning effectiveness of the network.

The design of Jump Loss can be interpreted from perspectives of information theory and gradient flow. In traditional single-loss functions, gradients must propagate through multiple layers to reach the bottom of the network, which can lead to gradient vanishing in deep networks as information is progressively lost during transmission, making it challenging to effectively train lower network parameters. Jump Loss, by introducing a loss at every layer, directly reinforces the gradient signal, ensuring that each layer is directly supervised, thereby effectively preventing the problem of gradient vanishing. Furthermore, Jump Loss also aids in enhancing the generalization ability of the model. As each layer is directly accountable for the output, the network can learn more diverse and robust feature representations, which is crucial for detailed image processing, such as identifying subtle defects and quality variations in apple defect-detection and quality-grading tasks. In tasks like apple defect detection and quality grading, Jump Loss offers several advantages:

Improved Accuracy: By computing loss at each layer, the model can learn features at different levels more meticulously, which is crucial for precisely identifying minor surface defects on apples.
Enhanced Model Stability: Jump Loss provides more stable gradient signals during training, avoiding common issues of training instability in traditional deep models.
Accelerated Convergence: Since each layer is directly responsible for the final outcome, the model can quickly adjust its direction early in training, reducing ineffective iterations and speeding up convergence.

3.4. Evaluation Metrics

In the study of apple defect detection and quality grading, the metrics used to assess model performance are crucial for validating the effectiveness of the methodologies employed. Precision, recall, accuracy and mean Intersection over Union (mIoU) have been selected as the primary evaluation metrics to measure the model’s performance in identifying apple defects and assessing quality.

These metrics enable a comprehensive evaluation of the model’s performance in tasks related to apple defect detection and quality grading. The balance between precision and recall is particularly crucial, as an excessively high false positive rate can lead to unnecessary losses in practical production, while a high false negative rate could compromise product quality. Accuracy provides a holistic assessment of performance, whereas mIoU focuses more on the accuracy of segmentation in complex image backgrounds.

3.5. Baseline Models

For the validation of the newly proposed method for apple defect detection and quality grading, several deep learning models have been selected as baselines for comparison. These models include U-Net [46], SegNet [47], PSPNet [48], UNet++ [49], DeepLabv3+ [50] and HRNet [51], all of which have demonstrated outstanding performance in the field of image segmentation.

By comparing with these established models, a more comprehensive evaluation of the performance of the newly proposed method is facilitated, and a deeper understanding of its advantages and limitations in practical applications is gained. Not only is the technological advancement of the new method validated, but an in-depth analysis of each model’s performance also reveals their potential applications in tasks related to apple defect detection and quality grading.

3.6. Experimental Setup

3.6.1. Testbed and Platform

Initially, in terms of hardware configuration, the experiments were conducted on a server equipped with an NVIDIA Tesla V100 GPU(New York, capital of United States). The Tesla V100 GPU, known for its powerful computing capabilities and efficient parallel processing, provides the necessary hardware support for deep learning models, particularly suitable for handling large datasets and complex model architectures. Additionally, the server was also equipped with ample RAM and high-speed SSD storage to ensure the efficiency of data loading and processing. Regarding software configuration, all models were developed and trained within a Python 3.9 environment, utilizing the TensorFlow 2.16.1 and Keras 2.6.0 frameworks. TensorFlow offers a flexible and powerful platform supporting various types of deep learning models, while Keras, with its simplicity and modular design, facilitates the construction of models and the iterative process of experimentation.

3.6.2. Training and Test Strategy

For the training strategy, the Adam optimizer [52] was chosen to adjust network weights due to its combination of momentum and adaptive learning rate features, which automatically adjust the learning rate for each parameter during training, aiding in rapid convergence and enhancing training outcomes. As for hyperparameter settings, the initial learning rate was set at

0.001

, balancing the speed of convergence while avoiding instability in training due to overly large step sizes. Additionally, to address potential overfitting issues, a learning rate decay strategy was implemented, where the learning rate would automatically be halved whenever there was no improvement in performance on the validation set over ten consecutive training epochs, with the minimum reduction reaching

10^{- 6}

. This dynamic adjustment of the learning rate allows for fine-tuning of the model in later training phases to optimize performance. Each model underwent training for 50 epochs, a duration sufficient to reach convergence in complex image-segmentation tasks. Furthermore, an early stopping strategy was also implemented to prevent overfitting, where if no further improvement in performance on the validation set was observed over 20 consecutive training cycles, training would be terminated prematurely. This not only conserves computational resources but also prevents the model from overfitting on training data at the expense of generalization capability.

In terms of model evaluation, a five-fold cross-validation method was employed to ensure the reliability and consistency of experimental results. In this method, the entire dataset was evenly divided into five subsets, with each subset taking turns serving as the test set while the remaining four subsets were used for training. This approach fully utilizes limited data resources by conducting multiple training and testing iterations to assess the average performance of the model, thus more accurately reflecting its behavior on unseen data. This is particularly crucial for evaluating apple defect-detection and quality-grading models, as it reduces random errors in the model assessment process, providing more robust performance metrics. Through these experimental settings, every research activity was conducted under controlled conditions, minimizing experimental errors and ensuring the reliability of the results.

4. Results and Discussion

4.1. Defect-Segmentation Experiment Results

In this section of this paper, the main objective of the experimental design is to validate the performance of the proposed deep learning models in apple defect detection and quality grading. By performing defect segmentation on apple images, the experiment aims to evaluate the effectiveness of different models in accurately identifying and segmenting surface defects on apples. The results are measured using four key performance indicators: Precision, Recall, Accuracy and mIoU. These metrics collectively reflect the models’ capabilities in identifying defect areas, including accuracy of recognition, rate of missed defects, overall performance and precision of the predicted areas.

Table 2 and Table 3 show varying levels of performance across different models. The U-Net model achieves a Precision of 0.80, Recall of 0.77, Accuracy of 0.79 and mIoU of 0.78, indicating a good baseline performance in apple defect-segmentation tasks, though it may have limitations in handling complex or subtly defined defects. The SegNet model, with improved performance over U-Net, has Precision, Recall, Accuracy and mIoU of 0.82, 0.79, 0.81 and 0.80, respectively. This improvement suggests better feature extraction and spatial information retention attributed to its unique decoder design, which uses pooling indices from the encoder to guide up-sampling, thus better restoring image details. PSPNet further enhances model performance with Precision, Recall, Accuracy and mIoU reaching 0.85, 0.82, 0.84 and 0.83, respectively. This enhancement primarily originates from PSPNet’s pyramid pooling module, which captures context at various scales, enhancing the model’s comprehension of global image structures. UNet++ and DeepLabv3+ show advances in all metrics, demonstrating the superiority of deep supervision and multi-scale feature fusion in complex image-segmentation tasks. UNet++ enhances feature transmission and reuse through dense connections at each up-sampling node; meanwhile, DeepLabv3+ enhances segmentation accuracy and robustness through dilated convolutions, which expand the receptive field. HRNet further advances Precision, Recall, Accuracy and mIoF to 0.91, 0.88, 0.90 and 0.89, respectively. Its high performance benefits from a unique multi-scale parallel processing architecture that maintains high-resolution information flow, effectively balancing speed and precision. The method described in this document achieves the best performance in all metrics, with Precision, Recall, Accuracy and mIoU of 0.93, 0.90, 0.91 and 0.92, respectively, demonstrating that advanced feature fusion technology and optimized network architecture design significantly enhance the precision and robustness of defect detection.

Theoretical analysis shows that the differences in performance among the models reflect the innovative aspects of their architectural designs. For instance, innovations in feature fusion and multi-scale processing in UNet++ and DeepLabv3+ make them more effective in handling images with complex backgrounds and subtle differences. HRNet optimizes the efficiency of information transmission and utilization by maintaining high-resolution feature flows. The method presented in this paper further optimizes these strategies, incorporating new network modules and training strategies, achieving the best performance in the apple defect-detection task, fully demonstrating the potential and future prospects of deep learning in image processing. The mathematical characteristics and design philosophies of these models provide robust theoretical support for solving practical problems and offer viable directions for future research.

4.2. Quality-Grading Experiment Results

In this section of this study, the primary objective was to verify and compare the performance of different deep learning models in the task of apple quality grading. This task focuses particularly on the models’ ability to accurately judge the quality levels of apples, which is crucial for agricultural production and commodity classification. The experiment evaluates model performance using three key metrics: precision, recall and accuracy, which collectively describe the effectiveness and reliability of the models in quality grading.

From Table 4 and Table 5, it is observed how each model performs in the task of apple quality grading. The U-Net model, serving as a baseline, shows a balanced performance with a precision of 0.78, recall of 0.75 and accuracy of 0.77, but there is room for improvement. The SegNet model slightly outperforms U-Net with precision, recall and accuracy of 0.80, 0.77 and 0.79, respectively. This could be attributed to its unique encoder-decoder architecture and the effective use of pooling indices, which helps preserve more spatial information, thus more accurately restoring critical features in quality grading. PSPNet demonstrates further improvements with a precision of 0.83, recall of 0.80 and accuracy of 0.82, thanks to its pyramid pooling module that captures context information at different scales, crucial for identifying and classifying apples of various quality levels. UNet++ and DeepLabv3+ excel in all assessed metrics, with precisions of 0.85 and 0.87, recalls of 0.82 and 0.84 and accuracies of 0.84 and 0.86, respectively. UNet++ enhances information flow and feature integration through deep supervision and nested skip connections, significantly boosting the model’s learning and prediction capabilities. DeepLabv3+, with its atrous convolution strategy, expands the receptive fields and enhances the model’s ability to capture image details, thus performing exceptionally well in precise grading. HRNet further enhances performance, achieving precision, recall and accuracy of 0.89, 0.86 and 0.88, benefiting from its high-resolution network structure that maintains high-resolution information throughout the network, aiding in improving classification accuracy. The method described in this document outperforms all models, with precision, recall and accuracy reaching 0.91, 0.88 and 0.90, respectively. This superior performance is due to the integration of various network optimization techniques and efficient training strategies, such as advanced feature fusion technologies and effective loss functions, significantly enhancing the model’s accuracy in recognizing apple quality levels.

Theoretically, the mathematical characteristics and architectural designs of different models are key factors leading to these variations in results. For instance, PSPNet’s pyramid pooling effectively integrates information at various scales, adapting to the multi-scale feature requirements of apple quality grading. DeepLabv3+ and HRNet, through atrous convolution and high-resolution continuous connections, provide richer contextual information and continuous detail features, crucial for accurately identifying subtle quality differences. The method proposed in this paper, by integrating the advantages of the above technologies and introducing a network structure and algorithms optimized for specific tasks, achieves optimal performance. These models not only reflect the advancement of deep learning in image processing but also highlight the importance of model design and selection in facing complex application scenarios.

4.3. Different Loss Function Ablation Experiment

The main objective of the experiment is to evaluate the impact of various loss functions on model performance in apple defect-segmentation and quality-grading tasks. By comparing the performance of Cross-Entropy Loss, Focal Loss and Jump Loss in these tasks, the experiment aims to reveal the effects and applicability of different loss functions in handling unbalanced datasets and enhancing feature learning. This experiment is crucial for understanding the advantages and limitations of each loss function in practical applications and assists in selecting or designing loss functions that are better suited for specific tasks, as shown in Table 6.

In the defect-segmentation task, the model utilizing Cross-Entropy Loss demonstrated basic performance, with a precision of 0.83, a recall of 0.80 and an accuracy of 0.82. As a common loss function suitable for multi-class classification problems, Cross-Entropy Loss tends to perform inadequately when faced with class imbalance. Subsequently, the model using Focal Loss showed improved performance, with a precision of 0.88, a recall of 0.85 and an accuracy of 0.87. Focal Loss adjusts the weight of different class samples, reducing the contribution of easy-to-classify samples and thus focusing more on those that are difficult to classify, improving performance under class imbalance conditions. Jump Loss performed best in the defect-segmentation task, achieving a precision of 0.93, a recall of 0.90 and an accuracy of 0.91. By incorporating loss calculations at every network layer, Jump Loss enhances deep learning, ensuring effective transmission and learning of deep features, which is crucial for image-segmentation tasks that require precise pixel-level predictions. In the quality-grading task, the model using Cross-Entropy Loss showed basic classification capabilities with a precision of 0.82, a recall of 0.80 and an accuracy of 0.81, reflecting limited performance improvement potential in the face of class imbalance. The model with Focal Loss performed better, indicating its advantages in dealing with class imbalance, with a precision of 0.86, a recall of 0.84 and an accuracy of 0.85. Jump Loss also demonstrated the best performance in the quality-grading task, with a precision of 0.91, a recall of 0.88 and an accuracy of 0.90, proving its effectiveness in integrating multi-level features and enhancing classification accuracy.

Theoretical analysis reveals that the philosophical design and mathematical characteristics of different loss functions are the fundamental reasons for these experimental results. Cross-Entropy Loss focuses on the probability of each sample being classified correctly, suitable for basic classification tasks but limited under severe class imbalance. Focal Loss modifies Cross-Entropy Loss by reducing the weight of easy samples and increasing the influence of difficult samples, effectively improving the recognition of minority classes, which is crucial in apple quality grading where some quality levels may have significantly fewer samples than others. The design of Jump Loss, by incorporating loss calculations at every layer of the network, not only mitigates the loss of information during transmission but also enhances the network’s ability to capture details, especially in complex image-segmentation tasks where it can better handle details of edges and small regions, thereby achieving higher segmentation and classification accuracy.

4.4. Limitations and Future Work

In the current study, an apple defect-detection and quality-grading system based on deep learning was successfully developed, integrating various models and techniques such as the Jump Connection SAM model, Jump Connection Attention Mechanism and Maximum Entropy Sampling Optimization. These significantly enhanced segmentation and classification precision in complex image backgrounds. Despite experimental results demonstrating superior performance over existing methods, several limitations within the practical application of the system remain to be addressed in future work. Firstly, although the system performs well with complex backgrounds and apples at different maturity stages, its generalization capability requires enhancement. The model primarily trains and tests on specific datasets from two particular geographic locations and harvest seasons. Such data limitation might reduce performance when facing broader natural variations, such as apples maturing under different regional climates. Future research could expand data collection to include apple images from more areas and various seasons, enhancing the model’s adaptability and generalization. Secondly, the loss functions used, while showing good performance in experiments, still need further evaluation under specific conditions like highly imbalanced data distributions or extreme noise. For instance, Jump Loss, despite fostering deep feature learning, could lead to overfitting, particularly in categories with sparse data. Future studies could explore integrating Jump Loss with other regularization techniques or novel loss functions to better balance model training stability and prediction accuracy.

5. Conclusions

In this study, an advanced deep learning-based system for apple defect detection and quality grading was developed and validated. This system integrates various state-of-the-art image-processing techniques and machine learning methods to provide an efficient and accurate automated detection solution tailored to the practical needs of the apple industry. The thorough experimental validation not only demonstrated the potential applications of deep learning technologies in agriculture but also provided substantial technical support for practical production. The main contribution of this paper is the introduction of a comprehensive system that combines the Jump Connection SAM model, Jump Connection Attention Mechanism and Maximum Entropy Selection Optimization. The integration of these technologies significantly enhanced the system’s ability to process complex images, particularly excelling in the key tasks of apple defect segmentation and quality grading.

Author Contributions

Conceptualization, X.G., S.L. and M.D.; Data curation, X.S. and L.H.; Formal analysis, X.S., W.T. and Y.Z.; Funding acquisition, M.D.; Investigation, Y.Z.; Methodology, X.G. and S.L.; Project administration, M.D.; Resources, X.S. and L.H.; Software, X.G., Y.L. and W.T.; Supervision, Y.Z.; Validation, S.L., Y.L. and W.T.; Visualization, Y.L. and L.H.; Writing—original draft, X.G., S.L., X.S., Y.L., L.H., W.T., Y.Z. and M.D.; Writing—review & editing, M.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Modern Agricultural Industrial Technology System Beijing Innovation Team (BAIC08-2024-YJ03).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Vasylieva, N.; Harvey, J. Production and trade patterns in the world apple market. Innov. Mark. 2021, 17, 16–25. [Google Scholar] [CrossRef]
Musacchi, S.; Serra, S. Apple fruit quality: Overview on pre-harvest factors. Sci. Hortic. 2018, 234, 409–430. [Google Scholar] [CrossRef]
Zhang, Y.; Wa, S.; Liu, Y.; Zhou, X.; Sun, P.; Ma, Q. High-accuracy detection of maize leaf diseases CNN based on multi-pathway activation function module. Remote Sens. 2021, 13, 4218. [Google Scholar] [CrossRef]
Gong, L.; Meng, X.; Liu, N.; Bi, J. Evaluation of apple quality based on principal component and hierarchical cluster analysis. Trans. Chin. Soc. Agric. Eng. 2014, 30, 276–285. [Google Scholar]
Hampson, C.; Quamme, H.; Hall, J.; MacDonald, R.; King, M.; Cliff, M. Sensory evaluation as a selection tool in apple breeding. Euphytica 2000, 111, 79–90. [Google Scholar] [CrossRef]
Vasighi-Shojae, H.; Gholami-Parashkouhi, M.; Mohammadzamani, D.; Soheili, A. Ultrasonic based determination of apple quality as a nondestructive technology. Sens. Bio-Sens. Res. 2018, 21, 22–26. [Google Scholar] [CrossRef]
Paz, P.; Sánchez, M.T.; Pérez-Marín, D.; Guerrero, J.E.; Garrido-Varo, A. Evaluating NIR instruments for quantitative and qualitative assessment of intact apple quality. J. Sci. Food Agric. 2009, 89, 781–790. [Google Scholar] [CrossRef]
Wang, F.; Zhao, C.; Yang, H.; Jiang, H.; Li, L.; Yang, G. Non-destructive and in-site estimation of apple quality and maturity by hyperspectral imaging. Comput. Electron. Agric. 2022, 195, 106843. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, X.; Liu, Y.; Zhou, J.; Huang, Y.; Li, J.; Zhang, L.; Ma, Q. A time-series neural network for pig feeding behavior recognition and dangerous detection from videos. Comput. Electron. Agric. 2024, 218, 108710. [Google Scholar] [CrossRef]
Zhang, Y.; Wa, S.; Zhang, L.; Lv, C. Automatic plant disease detection based on tranvolution detection network with GAN modules using leaf images. Front. Plant Sci. 2022, 13, 875693. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, Y.; Ma, X. A New Strategy for Tuning ReLUs: Self-Adaptive Linear Units (SALUs). In Proceedings of the ICMLCA 2021; 2nd International Conference on Machine Learning and Computer Application, Shenyang, China, 17–19 December 2021; VDE: Frankfurt, Germany, 2021; pp. 1–8. [Google Scholar]
Nie, M.; Zhao, Q.; Xu, Y.; Shen, T. Machine vision-based apple external quality grading. In Proceedings of the 2019 Chinese Control And Decision Conference (CCDC), Nanchang, China, 3–5 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 5961–5966. [Google Scholar]
Sun, Y.; Lu, R.; Lu, Y.; Tu, K.; Pan, L. Detection of early decay in peaches by structured-illumination reflectance imaging. Postharvest Biol. Technol. 2019, 151, 68–78. [Google Scholar] [CrossRef]
Bhargava, A.; Bansal, A. Automatic detection and grading of multiple fruits by machine learning. Food Anal. Methods 2020, 13, 751–761. [Google Scholar] [CrossRef]
Su, W.H.; Slaughter, D.C.; Fennimore, S.A. Non-destructive evaluation of photostability of crop signaling compounds and dose effects on celery vigor for precision plant identification using computer vision. Comput. Electron. Agric. 2020, 168, 105155. [Google Scholar] [CrossRef]
Patel, K.K.; Kar, A.; Khan, M. Development and an application of computer vision system for nondestructive physical characterization of mangoes. Agric. Res. 2020, 9, 109–124. [Google Scholar] [CrossRef]
Alencastre-Miranda, M.; Johnson, R.M.; Krebs, H.I. Convolutional neural networks and transfer learning for quality inspection of different sugarcane varieties. IEEE Trans. Ind. Inform. 2020, 17, 787–794. [Google Scholar] [CrossRef]
Genze, N.; Bharti, R.; Grieb, M.; Schultheiss, S.J.; Grimm, D.G. Accurate machine learning-based germination detection, prediction and quality assessment of three grain crops. Plant Methods 2020, 16, 157. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Feng, X.; Liu, Y.; Han, X. Apple quality identification and classification by image processing based on convolutional neural networks. Sci. Rep. 2021, 11, 16618. [Google Scholar] [CrossRef] [PubMed]
Zou, X.; Wang, C.; Luo, M.; Ren, Q.; Liu, Y.; Zhang, S.; Bai, Y.; Meng, J.; Zhang, W.; Su, S.W. Design of electronic nose detection system for apple quality grading based on computational fluid dynamics simulation and k-nearest neighbor support vector machine. Sensors 2022, 22, 2997. [Google Scholar] [CrossRef] [PubMed]
Hemamalini, V.; Rajarajeswari, S.; Nachiyappan, S.; Sambath, M.; Devi, T.; Singh, B.K.; Raghuvanshi, A. Food quality inspection and grading using efficient image segmentation and machine learning-based system. J. Food Qual. 2022, 2022, 5262294. [Google Scholar] [CrossRef]
Wieme, J.; Mollazade, K.; Malounas, I.; Zude-Sasse, M.; Zhao, M.; Gowen, A.; Argyropoulos, D.; Fountas, S.; Van Beek, J. Application of hyperspectral imaging systems and artificial intelligence for quality assessment of fruit, vegetables and mushrooms: A review. Biosyst. Eng. 2022, 222, 156–176. [Google Scholar] [CrossRef]
Ismail, N.; Malik, O.A. Real-time visual inspection system for grading fruits using computer vision and deep learning techniques. Inf. Process. Agric. 2022, 9, 24–37. [Google Scholar] [CrossRef]
Hao, S.; Zhou, Y.; Guo, Y. A brief survey on semantic segmentation with deep learning. Neurocomputing 2020, 406, 302–321. [Google Scholar] [CrossRef]
Mo, Y.; Wu, Y.; Yang, X.; Liu, F.; Liao, Y. Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing 2022, 493, 626–646. [Google Scholar] [CrossRef]
Salvi, M.; Acharya, U.R.; Molinari, F.; Meiburger, K.M. The impact of pre-and post-image-processing techniques on deep learning frameworks: A comprehensive review for digital pathology image analysis. Comput. Biol. Med. 2021, 128, 104129. [Google Scholar] [CrossRef]
Siddique, N.; Paheding, S.; Elkin, C.P.; Devabhaktuni, V. U-net and its variants for medical image segmentation: A review of theory and applications. IEEE Access 2021, 9, 82031–82057. [Google Scholar] [CrossRef]
Mazurowski, M.A.; Dong, H.; Gu, H.; Yang, J.; Konz, N.; Zhang, Y. Segment Anything Model for medical image analysis: An experimental study. Med. Image Anal. 2023, 89, 102918. [Google Scholar] [CrossRef]
Zuo, C.; Qian, J.; Feng, S.; Yin, W.; Li, Y.; Fan, P.; Han, J.; Qian, K.; Chen, Q. Deep learning in optical metrology: A review. Light. Sci. Appl. 2022, 11, 39. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, W.; Yip, H.; Qu, C.; Hu, H.; Chen, X.; Lee, T.; Yang, X.; Yang, B.; Kumar, P.; et al. SIC50: Determining drug inhibitory concentrations using a vision transformer and an optimized Sobel operator. Patterns 2023, 4, 100686. [Google Scholar] [CrossRef]
Tian, R.; Sun, G.; Liu, X.; Zheng, B. Sobel edge detection based on weighted nuclear norm minimization image denoising. Electronics 2021, 10, 655. [Google Scholar] [CrossRef]
Elgezouli, D.D.E.; Abdoon, M.; Belhaouari, S.B.; Almutairi, D.K. A Novel Fractional Edge Detector Based on Generalized Fractional Operator. Eur. J. Pure Appl. Math. 2024, 17, 1009–1028. [Google Scholar] [CrossRef]
Chang, Q.; Li, X.; Li, Y.; Miyazaki, J. Multi-directional Sobel operator kernel on GPUs. J. Parallel Distrib. Comput. 2023, 177, 160–170. [Google Scholar] [CrossRef]
Liu, Y.; Li, M.; Ma, Q. Efficient Apple Maturity and Damage Assessment: A Lightweight Detection Model with GAN and Attention Mechanism. arXiv 2023, arXiv:2310.09347. [Google Scholar]
Weng, W.; Zhu, X. INet: Convolutional networks for biomedical image segmentation. IEEE Access 2021, 9, 16591–16603. [Google Scholar] [CrossRef]
Du, G.; Cao, X.; Liang, J.; Chen, X.; Zhan, Y. Medical Image Segmentation based on U-Net: A Review. J. Imaging Sci. Technol. 2020, 64, jist0710. [Google Scholar] [CrossRef]
Li, Q.; Jia, W.; Sun, M.; Hou, S.; Zheng, Y. A novel green apple segmentation algorithm based on ensemble U-Net under complex orchard environment. Comput. Electron. Agric. 2021, 180, 105900. [Google Scholar] [CrossRef]
Zhang, Y.; Lv, C. TinySegformer: A lightweight visual segmentation model for real-time agricultural pest detection. Comput. Electron. Agric. 2024, 218, 108740. [Google Scholar] [CrossRef]
Wu, J.; Fu, R.; Fang, H.; Liu, Y.; Wang, Z.; Xu, Y.; Jin, Y.; Arbel, T. Medical sam adapter: Adapting Segment Anything Model for medical image segmentation. arXiv 2023, arXiv:2304.12620. [Google Scholar]
Zhang, K.; Liu, D. Customized Segment Anything Model for medical image segmentation. arXiv 2023, arXiv:2304.13785. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
Zhang, R.; Jiang, Z.; Guo, Z.; Yan, S.; Pan, J.; Ma, X.; Dong, H.; Gao, P.; Li, H. Personalize Segment Anything Model with one shot. arXiv 2023, arXiv:2305.03048. [Google Scholar]
Liu, X. A SAM-based method for large-scale crop field boundary delineation. In Proceedings of the 2023 20th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Madrid, Spain, 11–14 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Sun, L.; Liang, K.; Song, Y.; Wang, Y. An improved CNN-based apple appearance quality classification method with small samples. IEEE Access 2021, 9, 68054–68065. [Google Scholar] [CrossRef]
Huang, X.; Xu, T.; Zhang, X.; Zhu, Y.; Wu, Z.; Xu, X.; Gao, Y.; Wang, Y.; Rao, X. ALIKE-APPLE: A Lightweight Method for the Detection and Description of Minute and Similar Feature Points in Apples. Agriculture 2024, 14, 339. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4. Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. Dataset augmentation.

Figure 2. This diagram illustrates the overall process of the deep learning-based apple defect detection and quality-grading system proposed in this paper.

Figure 3. This schematic shows the overall structure of the Jump Connection SAM (Segment Anything Model), including interactions between the image encoder, prompt encoder and the lightweight mask decoder.

Figure 4. This illustration demonstrates how the proposed model combines image and textual prompt information to generate accurate apple defect-segmentation masks.

Figure 5. This figure displays the working principle of the Jump Connection Attention Mechanism, which includes three key steps: alignment, fusion and decoding. It clearly indicates how features from different layers are effectively fused through jump connections to enhance feature expression during the decoding process, optimizing the efficiency and accuracy of image-segmentation tasks.

Table 1. Dataset distribution.

Grade	Num.	Description
A	1500	No apparent defects; Diameter greater than 75 mm
B	1300	Minor imperfections acceptable; Diameter ranging from 65 mm to 75 mm
C	1200	Apparent mild to moderate defects; Diameter between 55 mm and 65 mm
D	100	Significant visible defects; Diameter less than 55 mm or greater than 75 mm

Table 2. Defect-segmentation experiment results.

Model	Precision	Recall	Accuracy	mIoU	F1-Score	FPS
U-Net	0.80	0.77	0.79	0.78	0.78	45.1
SegNet	0.82	0.79	0.81	0.80	0.80	35.6
PSPNet	0.85	0.82	0.84	0.83	0.83	25.9
UNet++	0.87	0.84	0.86	0.85	0.85	23.8
DeepLabv3+	0.89	0.86	0.88	0.87	0.87	29.3
HRNet	0.91	0.88	0.90	0.89	0.89	31.0
Proposed Method	0.93	0.90	0.91	0.92	0.91	32.7

Table 3. Defect-segmentation experiment result details by proposed method.

Grade	Precision	Recall	Accuracy	mIoU	F1-Score
Grade A	0.95	0.93	0.94	0.94	0.94
Grade B	0.92	0.90	0.92	0.92	0.91
Grade C	0.90	0.88	0.89	0.89	0.89
Grade D	0.88	0.85	0.88	0.88	0.86

Table 4. Quality-grading experiment results.

Model	Precision	Recall	Accuracy	F1-Score	FPS
U-Net	0.78	0.75	0.77	0.76	45.1
SegNet	0.80	0.77	0.79	0.78	35.6
PSPNet	0.83	0.80	0.82	0.81	25.9
UNet++	0.85	0.82	0.84	0.83	23.8
DeepLabv3+	0.87	0.84	0.86	0.85	29.3
HRNet	0.89	0.86	0.88	0.87	31.0
Proposed Method	0.91	0.88	0.90	0.89	32.7

Table 5. Quality-grading experiment result details by proposed method.

Grade	Precision	Recall	Accuracy	F1-Score
Grade A	0.94	0.92	0.93	0.93
Grade B	0.92	0.90	0.91	0.91
Grade C	0.90	0.88	0.89	0.89
Grade D	0.88	0.86	0.87	0.87

Table 6. Different loss function ablation experiment.

Model	Precision	Recall	Accuracy	F1-Score
Segment Task—Cross Entropy Loss	0.83	0.80	0.82	0.81
Segment Task—Focal Loss	0.88	0.85	0.87	0.86
Segment Task—Jump Loss	0.93	0.90	0.91	0.91
Quality-Grading Task—Cross Entropy Loss	0.82	0.80	0.81	0.81
Quality-Grading Task—Focal Loss	0.86	0.84	0.85	0.85
Quality-Grading Task—Jump Loss	0.91	0.88	0.90	0.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, X.; Li, S.; Su, X.; Li, Y.; Huang, L.; Tang, W.; Zhang, Y.; Dong, M. Application of Advanced Deep Learning Models for Efficient Apple Defect Detection and Quality Grading in Agricultural Production. Agriculture 2024, 14, 1098. https://doi.org/10.3390/agriculture14071098

AMA Style

Gao X, Li S, Su X, Li Y, Huang L, Tang W, Zhang Y, Dong M. Application of Advanced Deep Learning Models for Efficient Apple Defect Detection and Quality Grading in Agricultural Production. Agriculture. 2024; 14(7):1098. https://doi.org/10.3390/agriculture14071098

Chicago/Turabian Style

Gao, Xiaotong, Songwei Li, Xiaotong Su, Yan Li, Lingyun Huang, Weidong Tang, Yuanchen Zhang, and Min Dong. 2024. "Application of Advanced Deep Learning Models for Efficient Apple Defect Detection and Quality Grading in Agricultural Production" Agriculture 14, no. 7: 1098. https://doi.org/10.3390/agriculture14071098

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Application of Advanced Deep Learning Models for Efficient Apple Defect Detection and Quality Grading in Agricultural Production

Abstract

1. Introduction

2. Related Work

2.1. Traditional Operators Applied to Image Grading

2.2. UNet

2.3. SAM

3. Materials and Method

3.1. Dataset Collection

3.2. Data Augmentation

3.2.1. Basic Enhancement Method

3.2.2. Data Generation Based on Diffusion Models

3.3. Proposed Method

3.3.1. Overall

3.3.2. Jump Connection SAM Model

3.3.3. Jump Connection Attention Mechanism

3.3.4. Maximum Entropy Selection Optimization

3.3.5. Jump Loss

3.4. Evaluation Metrics

3.5. Baseline Models

3.6. Experimental Setup

3.6.1. Testbed and Platform

3.6.2. Training and Test Strategy

4. Results and Discussion

4.1. Defect-Segmentation Experiment Results

4.2. Quality-Grading Experiment Results

4.3. Different Loss Function Ablation Experiment

4.4. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI