Next Article in Journal
WMLinks: Wearable Smart Devices and Mobile Phones Linking through Bluetooth Low Energy (BLE) and WiFi Signals
Previous Article in Journal
Adaptive Transmission Strategy for Non-Uniform Coding of 360 Videos
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Lightweight and Optimized Multi-Label Fruit Image Classification: A Combined Approach of Knowledge Distillation and Image Enhancement

1
School of Economics, Beijing Technology and Business University, Beijing 100048, China
2
School of Information Science and Technology, Shihezi University, Shihezi 832003, China
3
The School of Mathematics and Statistics, Beijing Jiaotong University, Beijing 100044, China
4
Maynooth International Engineering College, Fuzhou University, Fuzhou 350108, China
5
Institutes of Science and Development, Chinese Academy of Sciences, Beijing 100864, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(16), 3267; https://doi.org/10.3390/electronics13163267
Submission received: 25 July 2024 / Revised: 10 August 2024 / Accepted: 15 August 2024 / Published: 17 August 2024
(This article belongs to the Special Issue Applications of Artificial Intelligence(AI) in Agriculture)

Abstract

:
In our research, we aimed to address the shortcomings of traditional fruit image classification models, which struggle with inconsistent lighting, complex backgrounds, and high computational demands. To overcome these challenges, we developed a novel multi-label classification method incorporating advanced image preprocessing techniques, such as Contrast Limited Adaptive Histogram Equalization and the Gray World algorithm, which enhance image quality and color balance. Utilizing lightweight encoder–decoder architectures, specifically MobileNet, DenseNet, and EfficientNet, optimized with an Asymmetric Binary Cross-Entropy Loss function, we improved model performance in handling diverse sample difficulties. Furthermore, Multi-Label Knowledge Distillation (MLKD) was implemented to transfer knowledge from large, complex teacher models to smaller, efficient student models, thereby reducing computational complexity without compromising accuracy. Experimental results on the DeepFruit dataset, which includes 21,122 images of 20 fruit categories, demonstrated that our method achieved a peak mean Average Precision (mAP) of 90.2% using EfficientNet-B3, with a computational cost of 7.9 GFLOPs. Ablation studies confirmed that the integration of image preprocessing, optimized loss functions, and knowledge distillation significantly enhances performance compared to the baseline models. This innovative method offers a practical solution for real-time fruit classification on resource-constrained devices, thereby supporting advancements in smart agriculture and the food industry.

1. Introduction

The rapid evolution of Artificial Intelligence (AI) technology has led to its extensive application in areas like medical diagnostics [1], self-driving cars [2], language translation [3], and image identification [4]. These advancements have spurred innovation and progress across these domains. The ability of AI to handle and analyze large datasets, derive significant insights, and assist in decision-making and predictions has markedly increased both efficiency and accuracy. In the domain of image recognition, the integration of deep learning and machine vision has empowered AI to accomplish tasks such as facial recognition and scene comprehension at levels that are on par with, or even exceed, those of humans, thereby propelling the field of intelligent image processing to new heights.
In recent years, image-based fruit classification has emerged as a significant focus. The rapid urbanization and increasing population have highlighted the necessity for automated fruit classification in modern agriculture and the food industry. Real-time efficient fruit sorting not only enhances agricultural production and logistical efficiency but also plays a critical role in the food processing industry by ensuring the quality and safety of products [5]. Traditional methods rely heavily on manual labor, which is both time-consuming and error-prone [6]. Thus, employing AI for the automatic identification and classification of fruit images can substantially improve accuracy and efficiency, while also reducing labor costs [7].
However, several significant challenges are encountered in advancing research on fruit image classification. Firstly, due to variations in lighting, environment, and angles, the color and brightness of fruit images can change significantly, increasing the difficulty for classification models to recognize them [8]. Secondly, although traditional deep learning models exhibit excellent accuracy in classification, their high demand for computational resources and complex model parameters make efficient deployment on resource-constrained edge devices challenging [9]. Therefore, developing a lightweight fruit image classification model that can operate efficiently under varying conditions has significant practical significance and application value.
In order to address these challenges, this paper puts forth a novel lightweight multi-label fruit image classification method that is based on illumination and color balance.
  • By incorporating advanced image preprocessing techniques such as Contrast Limited Adaptive Histogram Equalization (CLAHE) and the Gray World algorithm, our method significantly enhances the image quality and robustness of the classification model under diverse environmental conditions.
  • Utilizing Multi-Label Knowledge Distillation (MLKD), we design a lightweight deep learning model that drastically reduces computational load and storage requirements while maintaining high classification accuracy. This makes it highly suitable for efficient operation on resource-constrained edge devices.
  • Our approach achieves high-precision recognition across 20 fruit categories, demonstrating the practical application value of a lightweight fruit image classification model that operates effectively and efficiently under varying conditions.
The following is a description of the structure of this paper: Section 2 presents a review of existing literature on fruit image classification. Section 3 details the DeepFruit dataset and the image preprocessing techniques employed. Section 4 describes the multi-label classification model, including the encoder–decoder architectures and the Asymmetric Binary Cross-Entropy Loss function, as well as the MLKD method. Section 5 outlines the experimental setup and the experimental results, including performance evaluations, comparison of different loss functions, and ablation studies. Finally, Section 6 concludes our research findings and suggests directions for future work.

2. Related Work

2.1. Existing Methods for Fruit Image Classification

The classification of fruit images has constituted a prominent research avenue within the domain of computer vision. The accelerated advancement of deep learning technology has prompted a surge in the development and deployment of new methodologies and models as evidenced by a proliferation of literature within this domain. Traditional image classification methods mainly rely on manual feature extraction, such as color, texture, and shape [10,11,12]. However, these methods often perform poorly when faced with complex backgrounds and diverse fruit morphologies. This is mainly because manual features lack robustness in different lighting circumstances and complex environments, and the feature extraction process usually need numerous manual effort, making it inefficient.
The rise of deep learning, particularly the convolutional neural network (CNN), has improved the image classification performance. Deep networks such as AlexNet [13], VGGNet [14], and ResNet [15] have demonstrated excellent performance on large-scale datasets like ImageNet and have been widely applied to fruit image classification tasks. For example, [16] utilized ResNet-50 to classify various fruits, achieving significant results. Additionally, some researchers have proposed region-based fruit classification methods that combine object detection models (like Faster R-CNN) with classification networks to achieve higher classification accuracy [17]. Moreover, recent studies have explored the application of attention mechanisms and reinforcement learning in fruit classification, dynamically focusing on important regions in the images to further improve classification performance [18]. These methods not only enhance classification accuracy but they also improve the model’s ability to generalize, enabling its application in different real-world scenarios.

2.2. Application of Image Preprocessing Techniques

In the context of image classification tasks, the implementation of image preprocessing techniques represents a pivotal step in enhancing the accuracy of the classification process. Image preprocessing techniques include image enhancement, denoising, color correction, and more, which can improve image quality and highlight target features, thereby enhancing the learning effect of models [19]. CLAHE is a commonly used image enhancement technique that enhances local contrast, making image details clearer [20]. For instance, in mineral image classification, CLAHE is used to address image quality inconsistency caused by lighting condition changes, improving the model’s capability to recognize different mineral features [21].
Furthermore, color correction techniques have been extensively utilized in numerous image classification tasks. The Gray World algorithm and Retinex theory represent two prominent approaches to color correction. The Gray World algorithm postulates that the average color in an image should be gray. To achieve color balance, the algorithm adjusts the color distribution of the image [22]. In medical image processing, the Gray World algorithm is applied to correct color deviations in tissue slice images, thereby improving the accuracy of lesion detection [23]. Retinex theory, based on the human visual system’s perception of brightness and color, adjusts the image’s color to maintain consistent color performance under different lighting conditions [24]. For example, in remote sensing image analysis, Retinex theory is used to correct lighting changes in satellite images, thereby enhancing the recognition ability of surface features [25]. These preprocessing methods demonstrate their broad application prospects in improving the robustness and accuracy of image classification models.

2.3. Application of Lightweight Models in Classification Tasks

The advent of mobile devices and the IoT has led to a growing interest in the application of lightweight models in image classification tasks. The aim of lightweight models is the reduction in the computational and storage demands of models whilst maintaining accurate classification, thus enabling their operation in resource-restricted environments. MobileNet [26] is a representative lightweight model which significantly reduces model parameters and computational load through depthwise separable convolutions, performing excellently on mobile devices. Other lightweight models such as ShuffleNet [27] and EfficientNet [28] also achieve the efficient use of computational resources while maintaining high classification performance. For instance, in face recognition tasks, using MobileNet significantly enhances the real-time processing capability on mobile devices [29], while in object detection tasks for autonomous driving, the application of EfficientNet improves detection accuracy and speed [30].
In medical image classification tasks, lightweight models also demonstrate their advantages. Ref. [31] used MobileNetV2 to classify chest X-rays, achieving efficient and accurate diagnostic results. Moreover, knowledge distillation techniques, which transfer the knowledge of large models to smaller ones, are also an effective lightweight method [32]. This method has been applied in pathological image analysis, transferring the high performance of complex models to simplified models, enabling efficient image processing on resource-constrained medical devices [33]. The application of these methods allows for efficient image classification on resource-limited devices, providing feasibility for practical applications.

3. Data Preprocessing

3.1. Introduction to the DeepFruit Dataset

DeepFruit [34] is a labeled image dataset developed specifically for fruit classification. The dataset includes 21,122 images of 20 different fruits, aiming to support research in fruit detection, recognition, and classification. The images are captured on plates of different sizes, colors, and shapes, with varying angles, lighting conditions, and distances to ensure diversity and broad applicability of the dataset. The dataset has been subjected to a series of preprocessing techniques, including image rotation, scale normalization, and cropping, with the objective of achieving greater uniformity across the images. The DeepFruit dataset has been randomly divided into an 80% training set (16,899 images) and a 20% test set (4223 images), and is publicly available for free access by all researchers.
Each image in the DeepFruit dataset contains combinations of four to five different fruits. The 20 fruit categories in the dataset include mango, grape, plum, kiwi, pear, apple, orange, banana, pomegranate, strawberry, pineapple, fig, peach, apricot, avocado, zucchini, lemon, lime, guava, and raspberry. The number of images per fruit category varies; for example, there are 4360 images of mangoes and 4860 images of grapes, while plums and kiwis each have 4860 images. Figure 1 provides a detailed listing of the total number of images for each fruit category and their frequency of appearance in different combinations. For instance, grapes appear in combinations 1 and 5; thus, they have fruit set IDs 1 and 5 in the table. The dataset offers a rich variety of image combinations to support the development of multi-type fruit recognition models.

3.2. Illumination Balance Methods: Adaptive Histogram Equalization

CLAHE is a method that enhances the local contrast of an image, effectively improving its visual effect under varying lighting conditions. CLAHE has the potential to divide the image into discrete blocks, or “tiles”, and apply histogram equalization within each block, thus avoiding the noise amplification issue introduced by traditional histogram equalization [35].
The calculation process of CLAHE involves several steps. The input image is initially partitioned into non-overlapping, relatively small blocks, which are referred to as tiles. Within each tile, histogram equalization is applied to enhance local contrast, ensuring that the details within each region are accentuated. To prevent noise amplification, a contrast limiting step is implemented. This involves setting a clipping threshold for the histogram of each tile and redistributing any parts of the histogram that exceed this threshold. Finally, bilinear interpolation is performed on the equalized tiles to synthesize and reconstruct the entire processed image, ensuring smooth transitions between the tiles.
The contrast limiting parameter (clip limit) in CLAHE is a key adjustment parameter used to control the intensity of histogram equalization. The formula is as follows:
H = min ( H , clip limit )
where H is the original histogram and H is the clipped histogram. Through equalization and interpolation synthesis, the local contrast of the image is significantly improved, while avoiding the over-enhancement issue of traditional histogram equalization.

3.3. Color Balance Methods: Gray World Algorithm and Retinex Theory

The Gray World Algorithm is based on the assumption that, in a natural scene, the average color of an image should be neutral gray. According to this assumption, the algorithm adjusts the average values of the color channels in the image so that they achieve the same gray value, thereby balancing the color.
To begin with, the average values of the red, green, and blue channels are computed. These are calculated by summing the values of each channel across all pixels in the image and then dividing by the total number of pixels:
R ¯ = 1 N i = 1 N R i G ¯ = 1 N i = 1 N G i B ¯ = 1 N i = 1 N B i
where N is the total number of pixels, and R i , B i , G i are the red, blue, and green channel values of the i-th pixel, respectively.
Retinex theory is inspired by the human visual system’s perception of brightness and color, simulating the eye’s adaptation to changes in illumination to achieve color balance. The Retinex algorithm processes images at multiple scales and normalizes them to eliminate the effects of varying illumination conditions on image colors.
Initially, the color values of the image are converted into logarithmic space, which helps in handling the large dynamic range of pixel values. This transformation is expressed as:
L ( x , y ) = log ( I ( x , y ) )
where I ( x , y ) is the intensity value of the pixel at position ( x , y ) .
Then, multi-scale weighted mean filtering is applied to the image to estimate the ambient illumination. The illumination-adjusted value L R ( x , y ) is calculated by subtracting the logarithm of a weighted sum of the neighboring pixel values from the logarithmic value of the pixel itself:
L R ( x , y ) = log ( I ( x , y ) ) log i = k k j = k k w ( i , j ) I ( x + i , y + j )
where w ( i , j ) are the filter weights. Finally, the image details are enhanced by stretching the contrast of the filtered result, resulting in an image with improved visibility of details and balanced colors.
Figure 2 shows the examples of illumination balance and color balance, which demonstrate the impact of different image enhancement techniques on fruit image quality. The “Original Image” shows the raw input with no preprocessing. “CLAHE Image” illustrates the result after applying Contrast Limited Adaptive Histogram Equalization, which enhances local contrast by redistributing the intensity values. “Gray World Image” shows the outcome of applying the Gray World Algorithm for color correction, which assumes the average color of a scene is gray and balances the colors accordingly. “Retinex Image” displays the result of the Retinex algorithm, which adjusts the image based on perceived brightness and color to simulate human vision adaptation to varying light conditions. These techniques significantly improve the robustness of the fruit classification model under different lighting and color conditions.

4. Model and Methods

4.1. Multi-Label Classification Methods

The classification networks examined in this study are structured with two main components, an encoder and a decoder, as shown in Figure 3. The encoder’s function is to extract key features from the input images and convert them into latent space representations. The decoder then uses these representations to generate a sequence of component outputs. The complexity of the decoder can range from a basic linear learnable projection that directly maps the latent space to the output layer, to advanced configurations such as a multi-functional attention module that handles label interrelationships via a query mechanism, or an autoregressive decoder that sequences predictions based on prior outputs. The subsequent sections provide a detailed overview of these modules as implemented and assessed in this research.
The encoder module plays a crucial role in image recognition tasks by extracting feature representations from images, transforming the raw image data into latent space vectors suitable for further processing. In this study, we selected several convolutional neural network architectures that have shown excellent performance in image classification, including MobileNet, DenseNet, Inception, Xception, and EfficientNet. These networks have not only performed well in single-label classification tasks but also demonstrated outstanding performance on large-scale datasets such as ImageNet.
We developed the decoder module using two different strategies. The first strategy is the Global Average Pooling (GAP) decoder. This method involves projecting the extracted features into a one-dimensional vector, which is then converted into output logits via a learnable linear projection layer. The number of logits corresponds to the total number of classes. The second strategy employs a group decoding mechanism with attention features, known as the ML-Decoder. Derived from the transformer decoder, this approach aims to efficiently handle the computational complexity inherent in multi-label classification tasks. As the number of categories increases, the computational load typically increases quadratically. To address this, we removed the self-attention blocks and introduced a group decoding system. This system uses a single query token to decode multiple components, thereby reducing the number of query tokens required and optimizing computational efficiency.
By optimizing the design of both the encoder and decoder modules, it was possible to achieve both efficient feature extraction and label prediction in tasks of multi-label classification. The modular construction of the model not only enhances its scalability and computational efficiency but also improves its adaptability to different image recognition tasks. Figure 3 provides a detailed visual representation of two decoders, which allows a more intuitive understanding of their respective working principles and performance differences.

4.2. Loss Function

In multi-label learning tasks, a commonly used loss function treats each neuron in the output layer as an independent binary classifier and calculates the total binary cross-entropy loss of these classifiers. This paper adopts an improved loss function—Asymmetric Binary Cross-Entropy Loss Function—to enhance the model’s performance when dealing with easy and hard samples.
In order to evaluate the performance of a single binary classifier, the traditional binary cross-entropy loss function is employed. The formula is as follows:
L = y log ( p ) + ( 1 y ) log ( 1 p )
where y is the true label, and p is the predicted probability. In multi-label classification, this loss function is typically extended to sum the losses over all labels.
To more effectively handle easy and hard samples, we introduce the Asymmetric Binary Cross-Entropy Loss. This loss function adjusts the loss calculation method to make the model focus more on hard-to-classify samples during training.
Specifically, the positive and negative parts of the Asymmetric Binary Cross-Entropy Loss function are weighted separately to provide greater flexibility in sample differentiation:
L + = ( 1 p ) γ + ( y log ( p ) ) L = p γ ( ( 1 y ) log ( 1 p ) )
where γ + is the focusing parameter for the positive sample part and γ is the focusing parameter for the negative sample part.
Total Loss Calculation: The total classification loss is calculated by summing losses over all labels:
L total = k = 1 K L k
where L k is the loss for the k-th label. For each label’s loss L k , we apply the above Asymmetric Binary Cross-Entropy Loss formula.

Parameter Selection

In this study, we selected the focusing parameters γ + = 0 and γ = 5 based on a combination of literature review and experimental optimization. Initially, we referred to the experimental results and parameter settings from the work of [36] to determine an appropriate range for the focusing parameters. Based on this range, we conducted a grid search to systematically explore different combinations of γ + and γ , aiming to identify the best parameters that enhance model performance. The final choice of γ + = 0 and γ = 5 was made because these values effectively reduce the contribution of easily classified negative samples to the total loss, thereby encouraging the model to focus more on hard-to-classify samples. This strategy significantly improved the model’s generalization ability and robustness, particularly in scenarios where the sample set did not exhibit severe class imbalance but did include a distinction between easy and hard samples. The experimental results supporting this parameter selection are presented in the subsequent section of the paper.

4.3. Multi-Label Knowledge Distillation

In order to further enhance the model’s performance and minimize the number of parameters and computational complexity, we propose the MLKD method [37]. The objective of this method is to enhance the performance of the student network in multi-label learning. This is achieved by transferring knowledge from the teacher network to the student network, while simultaneously reducing the number of parameters and computational cost.
The MLKD approach is specifically designed to address the challenges inherent in multi-label learning, where traditional knowledge distillation methods—originally developed for single-label tasks—often fall short. The teacher models in this framework are typically larger and more complex networks, pretrained on extensive datasets. These teacher models are trained using standard multi-label classification techniques, ensuring they can effectively learn the complex object-label mappings required in multi-label scenarios. The essence of MLKD lies in the transfer of knowledge from the teacher to the student model as shown in Figure 4.
This is achieved through two primary mechanisms: multi-label logits distillation and label-wise embedding distillation. In multi-label logits distillation, the teacher’s output logits, which represent the probability distribution over multiple labels, are used to guide the student’s learning process. The goal is to minimize the divergence between the logits of the teacher and the student, ensuring that the student model approximates the teacher’s decision-making process as closely as possible. In contrast, label-wise embedding distillation focuses on preserving the structural relationships learned by the teacher within the student’s representations. This includes maintaining the compactness of intra-class embeddings (Class-aware Embedding Distillation, CD) and enhancing the dispersion of inter-class embeddings (Instance-aware Embedding Distillation, ID). By doing so, the student model is encouraged to learn not just the correct classifications but also the underlying feature representations that differentiate between classes.
The final objective function is:
L L 2 D = L B C E + λ M L D L M L D + λ C D L C D + λ I D L I D
Unlike traditional transfer learning, where a pretrained model is fine-tuned on a new task, MLKD involves a more nuanced process of knowledge transfer. It does not merely adapt the weights of the teacher model to the student model; instead, it actively distills both the output (logits) and the intermediate feature representations (embeddings). This dual-level distillation process allows the student model to inherit the teacher’s knowledge more effectively, leading to better generalization with fewer parameters. And the MLKD method facilitates the reduction in parameters in the student model by focusing on essential features and discarding redundant information. Since the teacher model already captures the complex feature interactions necessary for multi-label classification, the student model can be streamlined to only retain the most relevant aspects of this knowledge.
The student model, created through knowledge distillation, typically exhibits a reduction in parameters and a decrease in computational complexity when compared to the teacher model. The process of knowledge distillation enables the student model to learn the essential information from the teacher model, thereby facilitating the achievement of efficient feature representation and decision-making at a reduced scale. This renders the student model more suitable for deployment in resource-constrained environments, such as mobile devices and edge computing devices.

4.4. Evaluation Metrics

In this section, we introduce the evaluation metrics used to assess the performance and complexity of our model. The selected metrics are recall, precision, floating point operations (FLOPs), mean Average Precision (mAP), and parameters as shown in Table 1. These metrics give a thorough understanding of model’s accuracy, recall capability, overall performance across classes, computational complexity, and scale.

5. Experiments and Results

5.1. Experiment Settings

Table 2 provides a comprehensive overview of the experimental setup, including the hyperparameters for each model, the specifications of the training environment, and the configurations of the mobile devices used for testing. This detailed setup ensures the reproducibility of the experiments and validates the performance of the models across different environments.

5.2. Performance Evaluation Metrics Analysis

In our experiments, we evaluated two decoding approaches: a GAP-based decoder and a ML-Decoder. We tested these decoders with various encoders to assess their performance and computational efficiency.
Among the models utilizing the GAP decoder, the EfficientNet-B3 achieved the highest mAP with 88.5%, while it required a significant computational cost of 15.7 billion multiply–add operations per image inference. In comparison, MobileNetV3 demonstrated superior computational efficiency, achieving a mAP of 85.2% with a mere 3.0 billion operations per inference. Table 3 illustrates that high computational complexity does not always equate to better performance. For instance, while ResNet50 and InceptionV4 networks demonstrated good performance, they did not significantly outperform less complex models.
Our analysis indicated that some of the larger models, such as ResNet101 and InceptionResNetV2, suffered from overfitting as observed in their training and validation accuracy curves. This suggests that their performance is limited by the amount of available data. However, models like EfficientNet-B3 and DenseNet201, despite their size, were able to generalize well on the test dataset, indicating that they were better at handling the variance in the data.
The ML-Decoder, designed to be a drop-in replacement for the GAP-based decoder, did not consistently outperform the GAP decoder across all tested encoders. For instance, when paired with the EfficientNet-B0 encoder, the ML-Decoder achieved a slightly higher mAP of 84.3% compared to the GAP decoder. However, with other encoders, such as MobileNetV3, the ML-Decoder’s performance was slightly inferior, with a mAP of 83.1%. This discrepancy highlights the need for further investigation into the specific conditions under which the ML-Decoder excels.
All models with the GAP decoder demonstrated a mAP exceeding 80%, thereby substantiating the efficacy of this straightforward yet robust decoding methodology. The performance of different models and decoders are summarized in Table 3, providing a comprehensive benchmark of their effectiveness and computational requirements.

5.3. Comparison of Different Loss Functions

First, to determine the optimal focusing parameters γ + and γ for our Asymmetric Loss, we conducted a series of experiments using grid search method. Figure 5 illustrates the effect of varying γ + while keeping γ constant at several values. The mAP scores were recorded for each combination, and it was observed that γ = 5 consistently provided the highest performance when γ + = 0 . The trends in the graph indicate that as γ + increases, the mAP score tends to decrease, suggesting that higher values of γ + may cause the model to overly focus on easy samples, thus reducing overall accuracy.
Subsequently, this study investigated how various loss functions influence the performance of our optimized multi-label fruit image classification model, as shown in Table 4. Selecting and evaluating these loss functions was crucial for achieving optimal training outcomes. Our detailed analysis highlighted that the choice of loss function plays a critical role in determining model effectiveness, especially in multi-label classification settings. We tested different parameter configurations for the Asymmetric Loss function to identify the most effective setup for improving the model’s mAP. The best result, a mAP of 88.7%, was achieved with the parameters γ = 5 and γ + = 0 . This outcome underscores the model’s ability to maintain high accuracy and generalize effectively in complex multi-label classification tasks.

5.4. Performance Metrics before and after Knowledge Distillation

In this section, we present a detailed comparison of the performance metrics of our models before and after applying the knowledge distillation technique. The metrics include the mAP, the number of parameters, and the inference time required for processing 100 new images. The results of our experiments are summarized in Table 5. The table shows that knowledge distillation significantly reduces the number of parameters and inference time while maintaining comparable or even improved mAP scores.
The data presented in Table 5 illustrate the impact of knowledge distillation on various models. After applying the distillation process, we observe a consistent reduction in the number of parameters for all models. For instance, MobileNetV3’s parameters decrease from 5.4 million to 3.2 million, representing a significant reduction in model complexity. Similarly, the inference time required to process 100 images also decreases substantially, demonstrating improved efficiency suitable for deployment in resource-constrained environments.
Despite the reduction in complexity and inference time, the mAP scores remain stable or even show slight improvements. For example, MobileNetV3’s mAP increases from 85.2% to 85.5%, indicating that the knowledge distillation process not only preserves but can also enhance model performance. This balance between reduced computational load and maintained or improved accuracy underscores the effectiveness of the knowledge distillation method in optimizing models for practical applications, particularly in mobile and edge computing scenarios.
In addition to evaluating the inference time, we also analyzed the computation time during the training phase for our models, particularly focusing on the impact of incorporating MLKD. Table 6 summarizes the computation time for training our models compared to other methods.
As shown in Table 6, our method with MLKD requires approximately 4.5 h for training, which is slightly higher than other methods. However, the increased computation time is acceptable, and the reduced inference time and improved mAP score demonstrate that the additional computation time is a worthwhile trade-off for enhanced model performance and efficiency.
These comparisons highlight the practical benefits of knowledge distillation, making it a valuable technique for developing efficient, high-performance models in real-world applications.

5.5. Confusion Matrix Analysis

The confusion matrix demonstrates, as shown in Figure 6a, the classification performance of our model across 20 different fruit categories before knowledge distillation, and Figure 6b shows the confusion matrix after knowledge distillation. High accuracy is seen in categories like grape (95.5%), pomegranate (95.0%), and raspberry (94.7%), indicating the model’s strong capability to correctly classify these fruits. Medium accuracy is observed in categories such as apple (91.3%) and strawberry (86.4%). However, lower accuracy is evident in categories like peach (79.2%) and plum (82.7%), suggesting some difficulty in distinguishing these fruits from others. Misclassifications are generally low, with off-diagonal values predominantly below 5%, indicating that while some confusion exists, the model performs reliably in most cases.

5.6. Ablation Experiment

In our research, a series of ablation experiments were conducted to verify the effectiveness of the proposed model optimization strategies. The baseline model utilized a traditional CNN network without any modifications, serving as the performance comparison benchmark. To assess the impact on performance and computational efficiency, a lightweight model was implemented with a streamlined architecture. Additionally, image preprocessing techniques, such as CLAHE and the Gray World algorithm, were introduced to enhance image quality. The effect of improving classification accuracy was evaluated by applying the Asymmetric Binary Cross-Entropy Loss. Lastly, the combined optimization approach integrated the lightweight model architecture, image preprocessing techniques, and loss function modification to examine the overall performance improvement.
The ablation experimental results are summarized in the Table 7. The combined optimization approach achieved the highest mAP of 88.7%, demonstrating the effectiveness of integrating these strategies. The baseline model, without any enhancements, achieved a mAP of 81.5%. By incorporating a lightweight architecture, the mAP improved to 84.2%, highlighting the benefits of reducing computational load. Image preprocessing further boosted the mAP to 85.4%, showing the importance of enhancing image quality. The application of the Asymmetric Binary Cross-Entropy Loss increased the mAP to 87.1%, indicating its significant impact on classification performance. Overall, the combined optimization strategy provided the best results, validating the synergy of these enhancements.

6. Conclusions and Future Work

This study proposes a lightweight multi-label fruit image classification method based on illumination and color balance, aiming to address the challenges of color and brightness deviations caused by varying illumination, environments, and angles in multi-classification problems, while overcoming the computational burden of traditional deep learning models. The application of image preprocessing methods, such as CLAHE and the Gray World algorithm, has proven effective in improving image quality and enhancing the robustness of the classification model against changes in environmental conditions.
To ensure efficient multi-label classification, our study developed a lightweight deep learning model that dramatically lowers computational and storage demands while preserving classification accuracy, allowing it to perform efficiently on devices with limited resources. Our experimental results show that this approach achieved high accuracy and low computational overhead in classifying 20 different fruit types, confirming its practicality and effectiveness in real-world scenarios.
Despite the achievements of this study, there are several directions for further exploration and improvement.
Expand Dataset Diversity: Although the DeepFruit dataset covers a variety of fruit categories, the types and forms of fruits in real-world applications are far more diverse. Future research can further expand the scale and diversity of the dataset to include more types of fruits, enhancing the model’s generalization ability.
Real-Time Applications and Edge Computing: To achieve true real-time fruit classification, it is crucial to explore how to deploy and optimize the model on edge devices. Through edge computing, data transmission delays can be reduced, and classification speed and efficiency can be improved.
Multi-Modal Data Fusion: Current research is primarily based on image data. Future studies could consider integrating other types of data, such as temperature, humidity, and ambient light information. By fusing multi-modal data, the accuracy and robustness of the model’s classification can be improved.
Through further exploration and practice in these research directions, we believe that the performance and application value of fruit image classification technology can be significantly enhanced, providing stronger guidance for the development of smart agriculture and food industry.

Author Contributions

Conceptualization, J.Z. (Juce Zhang) and H.L.; methodology, J.Z. (Juce Zhang), Y.L. and Y.G.; software, Y.G. and J.Z. (Jiayi Zhou); validation, J.Z. (Juce Zhang), Y.G. and Z.Y.; formal analysis, Y.L. and J.Z. (Jiayi Zhou); investigation, C.W.; resources, Y.L. and Z.Y.; data curation, Z.Y.; writing—original draft preparation, J.Z. (Juce Zhang); writing—review and editing, H.L. and Y.L.; visualization, C.W.; supervision, H.L.; project administration, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Acknowledgments

The authors would like to express their gratitude to the anonymous reviewers for their valuable feedback, which proved instrumental in enhancing the quality of the final version of the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Richens, J.G.; Lee, C.M.; Johri, S. Improving the accuracy of medical diagnosis with causal machine learning. Nat. Commun. 2020, 11, 3923. [Google Scholar] [CrossRef] [PubMed]
  2. Ni, J.; Chen, Y.; Chen, Y.; Zhu, J.; Ali, D.; Cao, W. A survey on theories and applications for self-driving cars based on deep learning methods. Appl. Sci. 2020, 10, 2749. [Google Scholar] [CrossRef]
  3. Camgoz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10023–10033. [Google Scholar]
  4. Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710. [Google Scholar]
  5. Nandi, C.S.; Tudu, B.; Koley, C. An automated machine vision based system for fruit sorting and grading. In Proceedings of the 2012 Sixth International Conference on Sensing Technology (ICST), Kolkata, India, 18–21 December 2012; pp. 195–200. [Google Scholar]
  6. Tian, H.; Wang, T.; Liu, Y.; Qiao, X.; Li, Y. Computer vision technology in agricultural automation—A review. Inf. Process. Agric. 2020, 7, 1–19. [Google Scholar]
  7. Meenu, M.; Kurade, C.; Neelapu, B.C.; Kalra, S.; Ramaswamy, H.S.; Yu, Y. A concise review on food quality assessment using digital image processing. Trends Food Sci. Technol. 2021, 118, 106–124. [Google Scholar] [CrossRef]
  8. Mendoza, F.; Dejmek, P.; Aguilera, J.M. Calibrated color measurements of agricultural foods using image analysis. Postharvest Biol. Technol. 2006, 41, 285–295. [Google Scholar] [CrossRef]
  9. Xiang, Q.; Wang, X.; Li, R.; Zhang, G.; Lai, J.; Hu, Q. Fruit image classification based on Mobilenetv2 with transfer learning technique. In Proceedings of the 3rd International Conference on Computer Science and Application Engineering, Sanya, China, 22–24 October 2019; pp. 1–7. [Google Scholar]
  10. Yang, Y. Fruit Image Classification Using Convolution Neural Networks. Highlights Sci. Eng. Technol. 2023, 34, 110–119. [Google Scholar] [CrossRef]
  11. Gill, H.S.; Khehra, B.S. Fruit image classification using deep learning. Multica Sci. Technol. 2022, 2, 38–41. [Google Scholar]
  12. Hossain, M.S.; Al-Hammadi, M.; Muhammad, G. Automatic fruit classification using deep learning for industrial applications. IEEE Trans. Ind. Inform. 2018, 15, 1027–1034. [Google Scholar] [CrossRef]
  13. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  14. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  15. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  16. Sukhetha, P.; Hemalatha, N.; Sukumar, R. Classification of fruits and vegetables using ResNet model. agriRxiv 2021, 1–5. [Google Scholar] [CrossRef]
  17. Sa, I.; Ge, Z.; Dayoub, F.; Upcroft, B.; Perez, T.; McCool, C. Deepfruits: A fruit detection system using deep neural networks. Sensors 2016, 16, 1222. [Google Scholar] [CrossRef]
  18. Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
  19. Jähne, B. Digital Image Processing; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
  20. Pizer, S.M.; Amburn, E.P.; Austin, J.D.; Cromartie, R.; Geselowitz, A.; Greer, T.; ter Haar Romeny, B.; Zimmerman, J.B.; Zuiderveld, K. Adaptive histogram equalization and its variations. Comput. Vis. Graph. Image Process. 1987, 39, 355–368. [Google Scholar] [CrossRef]
  21. Gao, Q.; Long, T.; Zhou, Z. Mineral identification based on natural feature-oriented image processing and multi-label image classification. Expert Syst. Appl. 2024, 238, 122111. [Google Scholar] [CrossRef]
  22. Buchsbaum, G. A spatial processor model for object colour perception. J. Frankl. Inst. 1980, 310, 1–26. [Google Scholar] [CrossRef]
  23. Xu, J.; Tu, L.; Zhang, Z.; Qiu, X. A medical image color correction method base on supervised color constancy. In Proceedings of the 2008 IEEE International Symposium on IT in Medicine and Education, Xiamen, China, 12–14 December 2008; pp. 673–678. [Google Scholar]
  24. Land, E.H.; McCann, J.J. Lightness and retinex theory. J. Opt. Soc. Am. 1971, 61, 1–11. [Google Scholar] [CrossRef] [PubMed]
  25. Rahman, Z.u.; Jobson, D.J.; Woodell, G.A. Multi-scale retinex for color image enhancement. In Proceedings of the 3rd IEEE international Conference on Image Processing, Lausanne, Switzerland, 19 September 1996; Volume 3, pp. 1003–1006. [Google Scholar]
  26. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  27. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 18–20 June 2018; pp. 6848–6856. [Google Scholar]
  28. Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
  29. Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
  30. Li, Y.; Wang, H.; Dang, L.M.; Nguyen, T.N.; Han, D.; Lee, A.; Jang, I.; Moon, H. A deep learning-based hybrid framework for object detection and recognition in autonomous driving. IEEE Access 2020, 8, 194228–194239. [Google Scholar] [CrossRef]
  31. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
  32. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
  33. Ab Wahab, M.N.; Nazir, A.; Ren, A.T.Z.; Noor, M.H.M.; Akbar, M.F.; Mohamed, A.S.A. Efficientnet-lite and hybrid CNN-KNN implementation for facial expression recognition on raspberry pi. IEEE Access 2021, 9, 134065–134080. [Google Scholar] [CrossRef]
  34. Latif, G.; Mohammad, N.; Alghazo, J. DeepFruit: A dataset of fruit images for fruit classification and calories calculation. Data Brief 2023, 50, 109524. [Google Scholar] [CrossRef] [PubMed]
  35. Zimmerman, J.B.; Pizer, S.M.; Staab, E.V.; Perry, J.R.; McCartney, W.; Brenton, B.C. An evaluation of the effectiveness of adaptive histogram equalization for contrast enhancement. IEEE Trans. Med. Imaging 1988, 7, 304–312. [Google Scholar] [CrossRef] [PubMed]
  36. Ridnik, T.; Ben-Baruch, E.; Zamir, N.; Noy, A.; Friedman, I.; Protter, M.; Zelnik-Manor, L. Asymmetric loss for multi-label classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 82–91. [Google Scholar]
  37. Yang, P.; Xie, M.K.; Zong, C.C.; Feng, L.; Niu, G.; Sugiyama, M.; Huang, S.J. Multi-label knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 17271–17280. [Google Scholar]
  38. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Figure 1. Examples of the DeepFruit dataset.
Figure 1. Examples of the DeepFruit dataset.
Electronics 13 03267 g001
Figure 2. Examples of illumination balance and color balance.
Figure 2. Examples of illumination balance and color balance.
Electronics 13 03267 g002
Figure 3. The meta-structure used for the evaluated models.
Figure 3. The meta-structure used for the evaluated models.
Electronics 13 03267 g003
Figure 4. Knowledge distillation process.
Figure 4. Knowledge distillation process.
Electronics 13 03267 g004
Figure 5. Effect of asymmetric focusing parameters on mAP score.
Figure 5. Effect of asymmetric focusing parameters on mAP score.
Electronics 13 03267 g005
Figure 6. Comparison of confusion matrices before and after knowledge distillation. (a) Confusion matrix of different categories before knowledge distillation. (b) Confusion matrix of different categories after knowledge distillation.
Figure 6. Comparison of confusion matrices before and after knowledge distillation. (a) Confusion matrix of different categories before knowledge distillation. (b) Confusion matrix of different categories after knowledge distillation.
Electronics 13 03267 g006
Table 1. Evaluation metrics, calculation methods, and significance.
Table 1. Evaluation metrics, calculation methods, and significance.
MetricCalculationSignificance
Precision
Precision = T P T P + F P
This measure helps us to understand the proportion of true positive samples among those that have been predicted as positive by the model. It provides insight into the accuracy of positive predictions.
Recall
Recall = T P T P + F N
This indicates the proportion of actual positive samples that are correctly predicted as positive by the model. It offers insight into the model’s capacity to discern positive samples.
mAP
mAP = 1 N i = 1 N A P i

A P = k = 1 n ( P ( k ) × rel ( k ) ) number of relevant documents
This evaluates the overall performance of information retrieval and classification systems, especially for multi-label classification problems. It is the average precision across all classes.
FLOPs
FLOPs = H out × W out × C out × ( C in × K H × K W ) × 2
This measure is designed to assess the computational complexity of the model by counting the number of floating-point operations that would be required for a single forward pass.
Parameters
Parameters = C out × ( C in × K H × K W + 1 )
This indicates the total number of trainable parameters in the model, which provides an insight into the model’s scale and potential capacity.
Table 2. Experimental setup: model hyperparameters, training environment, and mobile testing environment.
Table 2. Experimental setup: model hyperparameters, training environment, and mobile testing environment.
CategorySpecification
Model Hyperparameters
ResNet50 (Baseline)Epochs: 50, Optimizer: SGD, Learning Rate: 0.001, Batch Size: 32, Momentum: 0.9, Weight Decay: 0.0001
MobileNetEpochs: 50, Optimizer: Adam, Dropout Rate: 0.2, Learning Rate: 0.001, Batch Size: 32
DenseNetEpochs: 50, Optimizer: SGD, Weight Decay: 0.0005, Momentum: 0.9, Learning Rate: 0.0001, Batch Size: 16
EfficientNetEpochs: 50, Optimizer: RMSprop, Dropout Rate: 0.3, Learning Rate: 0.001, Batch Size: 32, Weight Decay: 0.0001
Training Environment
GPUNVIDIA Tesla V100 (16 GB)
CPUIntel Xeon Gold 6248 (2.5 GHz, 20 cores)
RAM256 GB
Mobile Test Environment
Google Pixel 4CPU: Snapdragon 855, RAM: 6 GB, OS: Android 11, Framework: TFLite
Table 3. Performance evaluation and benchmark.
Table 3. Performance evaluation and benchmark.
EncoderDecodermAP (%)GFLOPsMParams
DenseNet121GAP84.722.67.6
DenseNet169GAP85.926.913.6
DenseNet201GAP86.434.319.4
MobileNetV1GAP82.14.53.8
MobileNetV3GAP85.23.05.4
EfficientNetB0GAP84.34.65.3
EfficientNetB1GAP83.87.88.6
EfficientNetB2GAP85.012.911.6
EfficientNetB3GAP88.515.718.7
ResNet50GAP84.125.023.5
InceptionV4GAP83.720.022.2
DenseNet121ML-Decoder83.622.67.6
MobileNetV3ML-Decoder83.13.05.4
EfficientNet-B0ML-Decoder84.34.65.3
Table 4. Comparison of different loss functions.
Table 4. Comparison of different loss functions.
Loss FunctionParameter SettingmAP (%)
Binary Cross-Entropy85.12
Focal Loss [38] α = 0.25 , γ = 2 86.45
Asymmetric Loss γ = 5 , γ + = 0 88.7
Table 5. Performance metrics before and after knowledge distillation.
Table 5. Performance metrics before and after knowledge distillation.
ModelmAP (%)Parameters (M)Inference Time (s)
Before Distillation
MobileNetV385.25.41.25
DenseNet12184.77.61.41
EfficientNet-B084.35.31.22
After Distillation
MobileNetV385.53.20.55
DenseNet12185.04.80.70
EfficientNet-B084.53.10.68
Table 6. Comparison of computation time across different models.
Table 6. Comparison of computation time across different models.
ModelComputation Time (hours)Inference Time (s)
ResNet503.51.30
InceptionV44.21.35
MobileNetV32.81.25
DenseNet1213.01.41
EfficientNet-B03.21.22
Our Method (with MLKD)4.50.68
Table 7. Ablation experiment results.
Table 7. Ablation experiment results.
Ablation ExperimentmAP (%)
Baseline Model81.5
Lightweight Model84.2
Image Preprocessing85.4
Loss Function Modification87.1
Knowledge Distillation88.5
Combined Optimization90.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Lu, Y.; Guo, Y.; Wu, C.; Liu, H.; Yu, Z.; Zhou, J. Lightweight and Optimized Multi-Label Fruit Image Classification: A Combined Approach of Knowledge Distillation and Image Enhancement. Electronics 2024, 13, 3267. https://doi.org/10.3390/electronics13163267

AMA Style

Zhang J, Lu Y, Guo Y, Wu C, Liu H, Yu Z, Zhou J. Lightweight and Optimized Multi-Label Fruit Image Classification: A Combined Approach of Knowledge Distillation and Image Enhancement. Electronics. 2024; 13(16):3267. https://doi.org/10.3390/electronics13163267

Chicago/Turabian Style

Zhang, Juce, Yao Lu, Yi Guo, Chengkai Wu, Hengjun Liu, Zhuoyi Yu, and Jiayi Zhou. 2024. "Lightweight and Optimized Multi-Label Fruit Image Classification: A Combined Approach of Knowledge Distillation and Image Enhancement" Electronics 13, no. 16: 3267. https://doi.org/10.3390/electronics13163267

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop