Next Article in Journal
Irreversibility, Dissipation, and Its Measure: A New Perspective
Previous Article in Journal
Fundamental Oscillation Modes in Neutron Stars with Hyperons and Delta Baryons
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Banana Ripeness Detection Model Based on Improved YOLOv9c Multifactor Complex Scenarios

1
College of Intelligent Equipment, Shandong University of Science and Technology, Taian 271019, China
2
Technology Department, Shandong Xinhuaan Information Technology Co., Ltd., Qingdao 266041, China
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(2), 231; https://doi.org/10.3390/sym17020231
Submission received: 19 December 2024 / Revised: 20 January 2025 / Accepted: 21 January 2025 / Published: 5 February 2025
(This article belongs to the Section Computer)

Abstract

:
With the advancement of machine vision technology, deep learning and image recognition have become research hotspots in the non-destructive testing of agricultural products. Moreover, using machine vision technology to identify different ripeness stages of fruits is increasingly gaining widespread attention. During the ripening process, bananas undergo significant appearance and nutrient content changes, often leading to damage and food waste. Furthermore, the transportation and sale of bananas are subject to time-related factors that can cause spoilage, necessitating that staff accurately assess the ripeness of bananas to mitigate unwarranted economic losses for farmers and the market. Considering the complexity and diversity of testing environments, the detection model should account for factors such as strong and weak lighting, image symmetry (since there will be symmetrical banana images from different angles in real scenes to ensure model stability), and other factors, while also eliminating noise interference present in the image itself. To address these challenges, we propose methods to improve banana ripeness detection accuracy under complex environmental conditions. Experimental results demonstrate that the improved ESD-YOLOv9 model achieves high accuracy in these conditions.

1. Introduction

Bananas are one of the world’s most important crops [1] and a key commodity in international trade due to their widespread global sales [2]. Bananas are cultivated in over 120 countries and regions, with China’s primary production areas including Guangdong, Guangxi, Yunnan, Fujian, and Taiwan. Ethylene levels primarily influence the ripening process of bananas during and after harvest. During the initial post-harvest phase, bananas release minimal ethylene, however, as they mature, the ethylene emission rate rises, accelerating the respiratory peak. This process further stimulates ethylene production, accelerating ripening. A sequence of physiological transformations associated with ripening and senescence takes place as banana fruits mature. These changes include a transition of the peel from green to yellow and the pulp becoming soft and sweet. As the fruit continues to ripen and senesce, bananas may develop black spots or decay [3,4]. However, consumers usually select bananas based on their appearance, so black spots on overripe bananas or signs of decay are unacceptable in the market. Moreover, due to their high respiratory activity, unripe bananas must undergo ripening before sale to ensure desirable flavor, quality, and uniform skin color [5]. Under normal circumstances, consumers usually do not buy unripe bananas due to their quality issues. Both under-ripe and over-ripe bananas significantly reduce their economic value, leading to losses for supermarkets. Consequently, it is imperative to assess the ripeness of bananas prior to sale, as it directly affects their sensory quality and economic value.
According to international standards, fruit quality testing should consider three aspects: ripeness, geometry, and defects [6]. Ripeness can generally be judged based on fruit color grade. Traditional ripeness evaluation relies on experts to manually evaluate the appearance of the fruit, which is a time-consuming and labor-intensive task and can lead to errors in judgment due to external factors [7]. In the field of fruit classification and recognition, two predominant methodologies are utilized: the first entails the extraction of features such as texture, color, and shape, and the subsequent construction of a classifier using conventional machine learning techniques. The second involves the development of a deep neural network to enhance the accuracy of fruit classification and recognition. The deployment of deep neural networks is more appropriate for agricultural production due to economic considerations. In the field of fruit ripeness detection, numerous studies employ multispectral and hyperspectral sensors to capture images. However, the cost of multispectral hyperspectral equipment generally ranges from USD 3000 to USD 10,000, in contrast to the significantly lower cost of a visible camera or smartphone, which is approximately USD 100 to USD 150. Given the practical constraints of the agricultural production process, which does not readily accommodate high-cost technologies, the utilization of the latter is more aligned with practical research and applications [8]. Based on these facts, deep learning techniques have been rapidly integrated into various areas of agricultural research, encompassing fruit ripening detection in complex environments. Nisa, Y. A. developed the Ambon banana ripeness grading system, which implements the image enhancement function through convolutional neural networks (CNNs) to increase the contrast of banana images. The data were classified using the CNN method with an accuracy of 95.87% [9]. Arunima, P. L. introduced a convolutional neural network (CNN)-based deep learning approach for classifying banana ripeness, and the developed CNN model achieved an accuracy of 95% [10]. Inspired by the mammalian olfactory system, Zhao developed and proposed a portable system for predicting fruit ripeness based on colorimetric sensing combinatorics and a deep convolutional neural network (DCNN). This system achieved an accuracy of 97.39% on the validation dataset and 82.20% on the test dataset for the assessment of fruit ripeness [11]. Wang utilized an enhanced AlexNet model to classify the ripeness of bananas with an accuracy of 96.67% [12]. In the study of fruit maturity detection using convolutional neural networks, researchers have achieved a high degree of accuracy, which is attributable to the two-stage target detection algorithm itself. Although the single-stage target detection algorithm is fast, it is slightly insufficient in accuracy. The YOLO series, as a popular single-stage target detection algorithm, highlights that improving detection accuracy is one of the researchers’ primary goals. Wei proposed a detection model, based on an improved YOLOv3 algorithm, for the ripening stages of Shine-Muscat grapes. The improved detection model yielded an average precision (AP) of 96.73% and an F1 score of 91% on the test dataset, representing respective increases of 3.87% and 3% over the original network model [13]. Gai, R. proposed an enhanced YOLOv4 deep learning algorithm for detecting the ripeness of cherry fruits. The F1 score of the enhanced model is 94.7, representing an improvement of 0.15 over the original model [14]. Despite the higher F1 value obtained in this study, the results of the detection in complex environments still require human intervention to improve the picking efficiency. Wang proposed an enhanced target detection algorithm based on YOLOv5n for the real-time recognition and ripeness detection of cherry tomatoes, achieving an average accuracy of 95.2% [15]. The algorithm incorporates a contextual attention (CA) module subsequent to the C3 module in the backbone network to mitigate interference from complex backgrounds, thereby providing a valuable reference for the detection of cherry tomato ripeness across various scenes. Yang successfully enhanced the YOLOv7 model to address the issue of low recognition accuracy due to the high density, severe occlusion, and overlap of apple fruits. The model’s mAP at IoU thresholds of 0.5 and 0.5: 0.95 for apple recognition are 96.5% and 92.3%, respectively, representing improvements of 5.6% and 8.7% over the baseline model [16]. Kazama, E. H. utilized an enhanced YOLOv8 model, termed RFCAConv, which incorporates corrected convolutional blocks for the classification of coffee fruit ripening stages. The model achieved an accuracy of 74.20% in terms of mAP at an IoU threshold of 0.5, representing a 1.9% improvement over the standard YOLOv8n model [17]. This model is dedicated to enhancing the lightweight model further to advance the progress of intelligent harvesting systems, so there is more room for improvement. Chen used an improved network model based on YOLOv8 for detecting strawberry ripening, and the accuracy, recall, mAP0.5 and F1 scores of the improved model in complex environments were 88.20%, 89.80%, 92.10%, and 88.99%, which were higher than those of the original YOLOv8 network by 4.8%, 2.9%, 2.05% and 3.88%, respectively [18]. Li accurately graded the ripeness of sweet cherries in complex environments—including sunny, cloudy, under branch and leaf shades, and distant views—using the YOLOX model. The average accuracy values of the improved model for unripe, semi-ripe, and ripe sweet cherries were 84.18%, 76.66%, and 82.47%, respectively [19]. Despite advancements in deep learning for fruit ripeness detection, challenges remain in achieving sufficient detection accuracy and efficiency in complex natural scenarios influenced by lighting, angles, noise, and other factors. To address low detection accuracy in multi-factor and complex scenarios, we integrate the SENetV2 attention mechanism to enhance feature extraction, employ DualConv group convolution to optimize the YOLOv9 downsampling convolutional layer for better feature processing, and adopt the EIoU loss function to further boost accuracy. Our objective is to enhance detection accuracy based on YOLOv9c, and experiments confirm that the improved ESD-YOLOv9 model achieves high accuracy in complex, multi-factor environments.
The main contributions of this paper are as follows:
  • A dataset comprising four distinct ripeness levels of bananas was constructed, and a range of data augmentation techniques, including high light, noise, low light, rotational, and flipping, were employed to enlarge the dataset. These techniques were utilized to simulate the multifactorial and complex conditions encountered during banana harvesting and distribution, thereby enhancing the diversity and robustness of the dataset.
  • The introduction of the SENetV2 attention mechanism enhances the network model’s feature extraction capabilities in complex, multifactorial scenarios, and enhances the model’s focus on banana maturity features without a substantial increase in computational expense.
  • For banana detection in shape-similar scenarios, the accuracy and efficiency of the bounding box regression are improved by improving the loss function, thus improving the accuracy and efficiency of the model detection by further optimizing the bounding box alignment.
  • We propose a banana ripeness detection model based on YOLOv9c, which enhances the accuracy of ripeness detection by integrating the SENetV2 attention mechanism, refining the model’s downsampling method through DualConv group convolution, and introducing the EIoU loss function to augment the accuracy and efficiency of bounding box regression. This model serves as a valuable technical reference for the accurate detection of banana ripeness.

2. Materials and Methods

2.1. Image Collection and Dataset Construction

Bananas are beloved tropical fruits. They have many varieties and grow in a wide range of areas. Based on their uses, bananas are categorized into edible and ornamental types. Edible bananas can be classified into several types, including sweet bananas, red bananas, small bananas, and plantains [20]. Bananas are indigenous to Southeast Asia and South Asia [21]. The cultivation of bananas necessitates consideration of temperature, light, rainfall, and humidity, as well as environmental factors, such as soil, altitude, pests, and diseases [22]. Consequently, global banana cultivation is primarily concentrated in tropical and subtropical regions [21]. Two banana varieties were used in this study, both from vegetable and fruit plantations in Shouguang City, Weifang City, Shandong Province, China. The two varieties were the Golden Finger Banana and Small Banana. The former has golden yellow skin, full flesh, and a delicate taste; the fruit shape is usually straight or slightly curved, with a pointed or blunt tip [23]. The skin of the green ripe fruit is turquoise, and the skin is golden yellow when ripe. The latter is small and exquisite in appearance, quite similar to ordinary bananas when small, with a skin color of mostly turquoise, and a dark yellow skin when ripe [23]. Both varieties of bananas are grown in greenhouses most of the time. As bananas ripen with longer storage time, the color of the peel changes. The freshly harvested bananas, free from any damage or defects, were stored in the laboratory at a temperature of 25 ± 3 °C and relative humidity of 50 ± 5% after being acquired from the plantations. To ensure the generalizability of the experimental results, we included Small Bananas at the ripening and over-ripening stages, along with bananas that had rotted due to storage or growth issues, to complete the dataset. The banana dataset we constructed was composed of four maturity levels: unripe, ripe, overripe, and rotten. Compared to banana datasets in other agricultural fields, our dataset was more targeted, focusing on specific research needs. Furthermore, constructing a dataset enables the strict definition of data collection methods and indicators, thereby ensuring data uniformity and reliability. Most importantly, our dataset aligns closely with the requirements of the machine learning model, ensuring that the research meets our expected objectives.

2.1.1. Image Acquisition

The banana images utilized in this study were photographed using a mobile device—a cell phone manufactured in Shenzhen, China. Images were captured from all angles while maintaining a shooting distance of 5–20 cm. Photographs of the bananas were taken from their initial day in the laboratory storage room until the onset of numerous black spots and rot. Among them, there were 642 images of the unripe stage, 1063 images of the ripe stage, 872 images of the overripe stage, and 432 images of the rotten stage. The images of this stage were all collected in the laboratory (117°07.9866′ E, 36°12.0015′ N); the banana variety was the Gold Finger Banana. In addition, in order to obtain as many pictures of bananas from growth to rot as possible, we purchased overripe and rotten Small Bananas from the market for image collection, including 110 images of the overripe stage and 92 images of the rotten stage. The image collection work at this stage was also completed in the laboratory. This process spanned 20 days, resulting in the capture of 3211 raw image data points, excluding low-quality photographs. The images had a resolution of 416 × 416 pixels and were saved in JPG format. As shown in Figure 1, the datasets were categorized into four classes—unripe, ripe, overripe, and rotten—based on the bananas’ stages of ripeness, with a class ratio of approximately 2:4:4:1.

2.1.2. Data Annotation and Datasets Production

Bananas were labeled using the labeling software LabelImg with the following rules: (1) The smallest possible outer rectangle was selected as the bounding box, ensuring that the target was entirely enclosed and closely aligned with the boundaries of the box. (2) Each target was assigned a distinct bounding box, with no sharing of boxes between multiple targets. (3) For targets near the image edge, the bounding box was maximized to include as many bananas as possible. (4) In contrast, fruits with indistinct or difficult-to-recognize features were not labeled. We saved the labeled results as .txt files in YOLO format and labeled the bananas taken at different ripeness periods, after which, they were saved in the corresponding files, respectively. We divided the datasets into a training set and a test set at a ratio of 4:1, obtaining 2560 images for training and 641 images for testing.

2.2. Data Augmentation and Expansion

An insufficient training sample size during training may result in overfitting [24], causing the model to perform well on trained data but poorly on new datasets. This results in an undergeneralized model. In order to increase the feature diversity to prevent overfitting and enhance the generalization ability and robustness of the model, we added the collected banana images of the same varieties in different periods to the sample to expand the sample. In order to ensure the authenticity of the obtained image, we took into account the symmetry of the banana image, which refers to the property of an object that remains unchanged after some transformation (such as rotation, reflection, or translation). We further expanded our dataset by applying data augmentation techniques such as rotating, scaling, flipping, injecting noise, and adjusting brightness, as shown in Figure 2. We finally obtained 10,115 banana images from different periods of time. We expanded the obtained pictures into each category according to the original ratio, and finally obtained 8092 banana pictures in the training set and 2023 pictures in the test set. These operations not only expanded the datasets but also simulated complex real-world scenes through the injection of various types of noise to mimic interference from multiple factors.

2.3. YOLOv9 Detection Network

There were two main classes of deep learning-based target detection methods: the first involves two-stage target detection algorithms based on region suggestion, such as R-CNN [25], Fast R-CNN [26], and Faster R-CNN [27]; and the second involves single-stage target detection algorithms based on regression, such as YOLO [28,29,30], RetinaNet [31], and EfficientDet [32]. Among the second class of methods for target detection, the regression-based target detection algorithm YOLOv1 has received a lot of attention from researchers since it was proposed by Redmon [28] in 2016. As of 2024, the YOLO network family has been updated to its 11th generation. Compared with other versions of the YOLO series, the advantages of YOLOv9c mainly lie in its higher balance between inference speed and accuracy, as well as its ability to optimize small target detection and complex scenes.

2.4. Construction of Model

2.4.1. ESD-YOLOv9 Model

YOLOv9 is a more innovative version of the YOLO series of target detection networks, known for its powerful detection performance. The publicly available YOLOv9c demonstrates superior recognition accuracy and detection performance, making it the foundational network for this study. Dataset validation confirms that although YOLOv9 [33] exhibits robust detection performance, bananas are more susceptible to interference in complex backgrounds. Furthermore, the intrinsic noise in the images necessitates a highly accurate and stable network model for detecting banana ripeness. To address these challenges, we propose enhancements to the YOLOv9c model. This is reflected in the following three aspects: first, the DualConv group convolution layer is introduced to improve the downsampling method of the model, compared with the Conv of the original model, the DualConv group convolution can further improve the accuracy while reducing the computational cost. Second, to augment the model’s feature extraction capabilities, we incorporate a novel attention mechanism, SENetV2, which facilitates the representation of all stages of banana ripeness within complex scenarios. In addition, by analyzing datasets of bananas at different maturity levels in similarly shaped scenarios, we introduce the EIoU loss function to improve bounding box regression precision and efficiency. The bounding box alignment process is also refined to further enhance ripeness detection accuracy. The improved network structure is shown in Figure 3.

2.4.2. SENetV2

SENetV2, proposed by Mahendran N, is an innovative feature extraction mechanism that enhances performance by adjusting channel relationships within the convolutional network. The efficiency of feature extraction is significantly improved through the fusion of spatial, channel, and global representations within the network structure. SENetV2 [34] further refines feature representation and the integration of global information via a multi-branch architecture. This enhancement is achieved through the introduction of a novel module, squeeze-and-aggregate excitation (SaE), which integrates the squeezing and excitation operations of SENetV1 across a multi-branch fully connected layer, thereby improving the network’s global representation learning. Figure 4 illustrates the internal mechanisms of the SaE module. The outputs from the convolutional layer are directed into multi-branch fully connected (FC) layers, which are approximately sized at 32 with a base of 4, followed by the excitation process. The split inputs are recombined at the end to restore their original dimensions.

2.4.3. DualConv

DualConv, proposed by Jiachen Zhong and others [35], is an innovative convolutional network architecture that processes the entire input feature map by distributing N convolutional filters into G groups. Each group processes the entire input feature map, with M/G channels being processed by both 3 × 3 and 1 × 1 convolutional kernels, while the remaining (M − M/G) channels are exclusively processed by 1 × 1 convolutional kernels. The outcomes of applying both 3 × 3 and 1 × 1 convolutional kernels are summarized in Figure 5. M corresponds to the number of input channels (i.e., the depth of the input feature maps), N signifies the number of convolutional filters, corresponding to the number of output channels (i.e., the depth of the output feature maps), and G denotes the number of groups involved in group and dual convolutions. In the DualConv architecture, 3 × 3 convolutional kernels are employed to extract spatial features from the feature map, while 1 × 1 convolutional kernels are utilized for feature integration. Each group’s convolutional kernels process a subset of the channels independently before the outputs are concatenated. This approach facilitates an efficient exchange of information and integration across different feature map channels. Consequently, DualConv not only preserves the network’s depth and representational capacity but also mitigates computational complexity.

2.4.4. EIoU Loss Function

The loss function of the original YOLOv9c comprises three components: classification loss, object confidence loss, and localization loss. The localization loss is represented by the CIoU loss function, which is defined as follows:
C I o U = I o U ρ 2 ( b , b g t ) c 2 ,
L o s s C I o U = 1 C I o U ,
α = ν ( 1 I o U ) + ν ,
ν = 4 π 2 a r c t a n w g t h g t a r c t a n w h 2 ,
where IoU represents the intersection over union of the predicted frame with the ground-truth frame; b indicates the center of the predicted frame; b g t signifies the center of the ground-truth frame; ρ 2 is the Euclidean distance between the centers of the predicted and ground-truth frames; c is the length of the diagonal of the smallest enclosing rectangle for both the predicted and ground-truth frames; ν denotes the shape loss, which assesses the agreement between the predicted and ground-truth frames with respect to the aspect ratios of width and height; w and h correspond to the width and height of the frame, respectively.
When the aspect ratio of the predicted box matches that of the ground-truth box, the loss function captures the discrepancy in aspect ratio via the shape loss term. In such cases, the aspect ratio loss term becomes ineffective, potentially leading to an unreasonable optimization of similarity by the loss function. Therefore, the EIoU loss function [36] is introduced as a new loss function, defined as follows:
L o s s E I o U = 1 I o U + ρ 2 ( b , b g t ) ( w c ) 2 + ( h c ) 2 + ρ 2 ( w , w g t ) ( w c ) 2 + ρ 2 ( h , h g t ) ( h c ) 2 ,
The EIoU loss function comprises three components: the IoU loss, the distance loss, and the positional loss. This function optimizes target detection through the following mechanisms: (1) Enhancing the positioning accuracy of the bounding box by minimizing the distance between the centers of mass of the predicted and actual boxes. (2) Ensuring that the predicted boxes by EIoU are closer in shape to the actual boxes by penalizing discrepancies in height and width. (3) The loss function is integrated with the size of the smallest enclosing box that contains both the predicted and true boxes, thereby increasing its sensitivity to the object’s size and position. The EIoU loss function has been enhanced by preserving the beneficial aspects of the CIoU loss function, concurrently reducing the discrepancies in width and height between the predicted and actual boxes. This refinement enhances the regression performance, leading to more rapid convergence and superior results in insulator fault localization.

2.5. Experiments

2.5.1. ESD-YOLOv9 Model Experimentation

In this study, the SENetV2 attention mechanism was added to the ELAN-2 block layer of the neck network and the head network, and the SENetV2 attention mechanism retained more original information about the different maturity levels of the banana, enhancing the feature representation capability. Meanwhile, we innovatively replaced the downsampled convolutional layer in YOLOv9c with a DualConv group convolutional layer, resulting in a slight improvement in the accuracy of our model compared to the initial model. The DualConv group convolution improves detection accuracy while reducing computational cost. In addition, we introduced the EIoU loss function to replace the CIoU loss function of the original model; we validated the excellent performance of the ESD-YOLOv9 model in detecting banana ripening through extensive experiments. Firstly, we developed our experimental model through a series of ablation studies and achieved significant insights with respect to data. We trained the model on 8092 banana images of different maturity stages in the training set and then tested it on 2023 test set images. Through the experiment, we obtained the precision, recall, mAP50, mAP50-95, and F1 indicators of four different maturity levels to reflect the performance of the ESD-YOLOv9 model.

2.5.2. Ablation Study of the Improved ESD-YOLOv9

To assess the impact of the enhanced ESD-YOLOv9 algorithm on the model’s performance in detecting banana ripeness, we systematically compared the improved model against the original model, evaluating the contribution of each enhancement. In the ablation experiment, we introduced the SENetV2 attention mechanism, DualConv group convolution, and EIoU loss function on the basic model of YOLOv9c, compared each improvement pairwise, and verified the performance of the ESD-YOLOv9 model by comparing it with the initial model.

2.5.3. Comparative Analysis of Different Target Detection Networks

To qualitatively assess the detection outcomes of the enhanced ESD-YOLOv9 model, we compared its performance on the banana image test set against that of RT-DETR, Faster-RCNN, SSD, and other prevalent models in the YOLO family, including YOLOv5, YOLOv7, YOLOv8, YOLOv10, and the latest YOLOv11. The effectiveness of our proposed method is demonstrated by comparing it with a variety of different object detection models.

3. Model Evaluation Metrics and Experimental Parameter Indicators

3.1. Model Evaluation Metrics

Methods for determining banana ripeness should take into account both the accuracy and performance of the assay. The model’s detection accuracy was evaluated using precision, recall, and the F1 score as metrics to assess its improvement. The average accuracy of mAP50 and mAP50-95 with thresholds of 50% and 50–95%, respectively, were chosen as evaluation metrics in terms of model performance. In addition, we used computational costs, measured in GFLOPs, as one of the criteria for evaluating model performance. The formula is as follows:
P r e c i s i o n = T P T P + F P × 100 % ,
R e c a l l = T P T P + F N × 100 % ,
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l × 100 % ,
A P = Σ 1 k P × R K ,
m A P = Σ 1 k A P K × 100 % ,
where TP, FP, and FN denote the counts of true positive, false positive, and false negative detections, respectively.Precision (P) represents the accuracy of the model’s detection capabilities.Recall (R) denotes the proportion of real targets correctly identified by the model. The F1 score is a composite measure that integrates model precision and recall, serving as a critical indicator of model performance. AP refers to the average precision for each individual defect class, while mAP calculates the mean average precision across all K defect classes.

3.2. Experimental Environment Configuration and Network Parameters

The training and validation of the models in this study were conducted on the AutoDL Arithmetic Cloud server rental platform, operating on Windows 11. The device was configured with a CPU featuring 12 vCPUs (Intel (R) Xeon (R) Platinum 8255C CPU @2.50G Hz) and a GPU (NVIDIA GeForce RTX 3090). Python and PyTorch were used as the programming language and training framework, respectively, to train and improve the YOLOv9 model. The experimental conditions are shown in Table 1.
During training, the pre-training weights were not used for training because the network structure of the algorithmic model changed. The optimizer used SGD for weight updating, with the weight decay coefficient set to 0.0005, the learning rates set to lr0 = 0.01 and lrf = 0.01, and the dynamics parameter set to 0.937. The warm-up method was trained, the learning rate was updated using a one-dimensional linear interpolation, and the learning rate was updated using a cosine annealing algorithm after warm-up. Table 2 shows more detailed hyperparameter settings.

4. Result and Discussion

4.1. Result Analysis

4.1.1. Experimental Results of the ESD-YOLOv9 Model

The performance of the ESD-YOLOv9 model in detecting unripe, ripe, overripe, and rotten fruits is presented in Table 3. The table reveals that the model achieved an accuracy of 94%, a recall of 98.8%, a mAP50 of 98.9%, a mAP50-95 of 95.9%, and an F1 score of 96.3%.
The improved model further integrates the feature and semantic information of bananas in various environments, enabling it to accurately extract granular features of the banana surface. This allows for precise detection of bananas at different ripeness levels, as shown in Figure 6, which indicates that the model can accurately and efficiently recognize bananas with different ripeness levels under different complex backgrounds and conditions involving external technological interference. In conclusion, the refined ESD-YOLOv9 model demonstrates high accuracy in detecting banana ripeness levels and robust performance in dealing with various challenges, including different angles, injected noise, lighting variations, and image distortions.

4.1.2. Results of Ablation Experiments

To verify the effectiveness of each improved module in detecting banana ripeness under complex environments, YOLOv9c was used as the base model for ablation experiments. The primary evaluation metrics were mAP50 and mAP50-95, with parameters serving as auxiliary references. Table 4 presents the results, where ✓ denotes the inclusion of a module, and × denotes its exclusion.
The ablation experiment results are shown in Table 5, with “-” denoting that no modifications were made to the original structure.
As can be seen from the test results in the table above, with the simultaneous introduction of the DualConv module, the SENetV2 module, and the modification of the EIoU loss function in the network, the detection performance of the network was optimally improved by 0.7% for mAP50 and 1.2% for mAP50-95, while the computational cost was reduced by 3.3.

4.1.3. Comparison Results of Different Object Detection Models

As shown in Table 6, the models proposed in this paper have the highest mAP50 and mAP50-95 values, and the mAP50, mAP50-95, and F1 scores of the improved ESD-YOLOv9 model are 98.9%, 95.5%, and 96.3%, respectively. The results indicate the effectiveness of the enhancement method introduced in this study.
In order to more intuitively observe the data comparison of each model during the experimental process, we plotted the mAP50-95 of each model as a visual data plot, as shown in Figure 7. The figure illustrates that the model proposed in this paper achieves the best overall efficacy during the 120 rounds of training. All models converge to a fitting state at around 80 rounds, with our model converging the fastest. Given the outstanding performance of the YOLO series algorithms in object detection, to enhance the clarity of model comparison, we have zoomed in on the data following 80 epochs. The zoomed image is indicated by the arrow.
In addition, the P-curve, R-curve, P-R-curve, and F1-curve of the improved ESD-YOLOv9 model in this study are shown in Figure 8. The highest mAP50 was observed for the sample-rich ripe banana fruit category, with a value of 99.4%; the overripe banana fruit category followed with an mAP50 value of 99.2%; the rotted banana fruit category achieved an mAP50 value of 98.8%; and the unripe banana fruit category had the lowest mAP50.

4.2. Discussion

In this paper, we use an improved YOLOv9c detection model for accurate detection and identification of banana fruits at different stages of ripeness in complex scenarios due to the influence of multiple factors during banana picking and selling. From the ablation experiments, we know that the introduction of the DualConv group convolution into the network resulted in a slight improvement in the test performance of the improved model, with a 0.5% improvement in mAP50, and a 0.3% improvement in mAP50-95, as well as a reduction in the model computational costs (GFLOPs) by 1.4. With the introduction of SENetV2 in the model, the detection performance of the model improved by 0.3% for mAP50 and 0.6% for mAP50-95, with a small increase in computational cost. Adding an attention mechanism allows the model to focus on key object features, often increasing computational costs. However, SENetV2 achieves improved detection performance with minimal computational overhead compared to other mechanisms. With the simultaneous introduction of the DualConv group convolution and the SENetV2 attention mechanism, the detection performance of the model is further improved, with a 0.6% increase in the mAP50 and a 0.8% increase in the mAP50-95, with the same computational cost decreasing by 1.1. In addition, improving the loss function of the original YOLOv9 to EIoU resulted in a 0.2% increase in mAP50 and a 0.4% increase in mAP50-95. The ESD-YOLOv9 model achieved mAP50 and mAP50-95 values of 98.9% and 95.5%, respectively, when compared with the latest state-of-the-art models. These results demonstrate that our proposed detection method is effective for accurately detecting banana ripeness in complex scenarios.
Although the improved network model achieves good results in terms of accuracy, it still has certain limitations. Our research relies on hardware with moderate computing power, which ensures strong computational efficiency. However, further analysis is needed to evaluate performance on lower-powered hardware, which will be the focus of future research. In terms of banana categories, this study only involves two types of banana varieties. Considering the market demand for diversified varieties, we will spend more time on banana varieties and try our best to meet the market demand for accurate identification of the maturity of different types of bananas. In terms of the model itself, compared with YOLOv9c, the size of the improved model has increased by 1.8 M and the number of parameters has increased by 3.556 M. Considering practical constraints, enhancing the accuracy of the target detection model inevitably increases the number of parameters and decreases the FPS, limiting its real-time processing capabilities. On the other hand, since our research was completed on a cloud server, the compatibility of the device should also be considered in actual deployment.Therefore, future research will address the limitations of this study, with a focus on improving hardware performance, accommodating diverse banana varieties, and optimizing real-world deployment to advance sustainable agricultural practices.

5. Conclusions

In this study, we propose an improved YOLOv9c network model for accurately detecting and recognizing banana fruits at different stages of ripeness in complex scenarios due to the influence of multiple factors during banana picking and selling. Based on the YOLOv9c target detection network, the SENetV2 attention mechanism is introduced to focus on the surface features of bananas in different scenarios, the DualConv group convolution is introduced to improve the downsampling Conv layer of the model to improve the feature processing capability of the model, and the EIoU loss function is introduced to improve the bounding box regression and optimize the bounding box alignment, further improving the detection accuracy of the network model. Compared to the original YOLOv9c network, the overall detection performance of the improved model is effectively enhanced. These results can provide a valuable reference for the accurate detection of fruit ripeness in picking gardens and supermarkets.

Author Contributions

Conceptualization, G.W. and Y.G.; methodology, Y.G. and F.X.; software, Y.G. and W.S.; validation, Y.H., W.S. and F.X.; formal analysis, Y.G.; investigation, Y.G.; data curation, Y.G., Y.H. and Q.L.; writing—original draft preparation, Y.G.; writing—review and editing, G.W.; supervision, G.W.; funding acquisition, G.W. and Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Shandong Province Science and Technology-based Small and Medium-sized Enterprises Innovation Capacity Enhancement Project [2024TSGC0158].

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments

We ensure that all individuals listed in this section have agreed to this confirmation, thank the editors and the anonymous reviewers for their valuable comments and suggestions.

Conflicts of Interest

Author Mr. Qiang Liu was employed by the company “Shandong Xinhuaan information Technology Co., Ltd.”. The authors declare that this study received funding from the Shandong Province Science and Technology-based Small and Medium-sized Enterprises Innovation Capacity Enhancement Project. The funder had the following involvement with the study: data curation.

References

  1. Ashokkumar, K.; Elayabalan, S.; Shobana, V.G.; Sivakumar, P.; Pandiyan, M. Nutritional value of cultivars of Banana (Musa spp.) and its future prospects. J. Pharmacogn. Phytochem. 2018, 7, 2972–2977. [Google Scholar] [CrossRef]
  2. Bebber, D.P. The long road to a sustainable banana trade. Plants People Planet 2023, 5, 662–671. [Google Scholar] [CrossRef]
  3. Sanaeifar, A.; Bakhshipour, A.; De La Guardia, M. Prediction of banana quality indices from color features using support vector regression. Talanta 2016, 148, 54–61. [Google Scholar] [CrossRef] [PubMed]
  4. Santoyo-Mora, M.; Sancen-Plaza, A.; Espinosa-Calderon, A.; Barranco-Gutierrez, A.I.; Prado-Olivarez, J. Nondestructive quantification of the ripening process in banana (Musa AAB Simmonds) using multispectral imaging. J. Sens. 2019, 2019, 6742896. [Google Scholar] [CrossRef]
  5. Watharkar, R.B.; Chakraborty, S.; Srivastav, P.P.; Srivastava, B. Physicochemical and mechanical properties during storage-cum maturity stages of raw harvested wild banana (Musa balbisiana, BB). J. Food Meas. Charact. 2021, 15, 3336–3349. [Google Scholar] [CrossRef]
  6. Hernández-Sánchez, N.; Moreda, G.P.; Herre-ro-Langreo, A.; Melado-Herreros, Á. Assessment of internal and external quality of fruits and vegetables. In Imaging Technologies and Data Processing for Food Engineers; Springer: Berlin/Heidelberg, Germany, 2016; pp. 269–309. [Google Scholar]
  7. Bhargava, A.; Bansal, A. Fruits and vegetables quality evaluation using computer vision: A review. J. King Saud-Univ.-Comput. Inf. Sci. 2021, 33, 243–257. [Google Scholar] [CrossRef]
  8. Lv, X.; Zhang, X.; Gao, H.; He, T.; Lv, Z.; Zhangzhong, L. When Crops meet Machine Vision: A review and development framework for a low-cost nondestructive online monitoring technology in agricultural production. Agric. Commun. 2024, 2, 100029. [Google Scholar] [CrossRef]
  9. Nisa, Y.A.; Sari, C.A.; Rachmawanto, E.H.; Yaacob, N.M. Ambon Banana Maturity Classification Based on Convolutional Neural Network (CNN). Sink. J. Dan Penelit. Tek. Inform. 2023, 7, 2568–2578. [Google Scholar] [CrossRef]
  10. Arunima, P.L.; Gopinath, P.P.; Lekshmi, P.R.G.; Esakkimuthu, M. Digital assessment of post-harvest Nendran banana for faster grading: CNN-based ripeness classification model. Postharvest Biol. Technol. 2024, 214, 112972. [Google Scholar] [CrossRef]
  11. Zhao, M.; You, Z.; Chen, H.; Wang, X.; Ying, Y.; Wang, Y. Integrated Fruit Ripeness Assessment System Based on an Artificial Olfactory Sensor and Deep Learning. Foods 2024, 13, 793. [Google Scholar] [CrossRef] [PubMed]
  12. Wang, L.-M.; Jiang, Y. Automatic classification of banana ripeness based on deep learning. Food Mach. 2022, 38, 149–154. [Google Scholar]
  13. Wei, X.; Xie, F.; Wang, K.; Song, J.; Bai, Y. A study on Shine-Muscat grape detection at maturity based on deep learning. Sci. Rep. 2023, 13, 4587. [Google Scholar] [CrossRef] [PubMed]
  14. Gai, R.; Chen, N.; Yuan, H. A detection algorithm for cherry fruits based on the improved YOLO-v4 model. Neural Comput. Appl. 2023, 35, 13895–13906. [Google Scholar] [CrossRef]
  15. Wang, C.; Wang, C.; Wang, L.; Wang, J.; Liao, J.; Li, Y.; Lan, Y. A lightweight cherry tomato maturity real-time detection algorithm based on improved YOLOV5n. Agronomy 2023, 13, 2106. [Google Scholar] [CrossRef]
  16. Yang, H.; Liu, Y.; Wang, S.; Qu, H.; Li, N.; Wu, J.; Yan, Y.; Zhang, H.; Wang, J.; Qiu, J. Improved apple fruit target recognition method based on YOLOv7 model. Agriculture 2023, 13, 1278. [Google Scholar] [CrossRef]
  17. Kazama, E.H.; Tedesco, D.; Carreira, V.S.; Júnior, M.R.B.; de Oliveira, M.F.; Ferreira, F.M.; Júnior, W.M.; da Silva, R.P. Monitoring coffee fruit maturity using an enhanced convolutional neural network under different image acquisition settings. Sci. Hortic. 2024, 328, 112957. [Google Scholar] [CrossRef]
  18. Chen, Y.; Xu, H.; Chang, P.; Huang, Y.; Zhong, F.; Jia, Q.; Chen, L.; Zhong, H.; Liu, S. CES-YOLOv8: Strawberry Maturity Detection Based on the Improved YOLOv8. Agronomy 2024, 14, 1353. [Google Scholar] [CrossRef]
  19. Li, Z.; Jiang, X.; Shuai, L.; Zhang, B.; Yang, Y.; Mu, J. A real-time detection algorithm for sweet cherry fruit maturity based on YOLOX in the natural environment. Agronomy 2022, 12, 2482. [Google Scholar] [CrossRef]
  20. Ploetz, R.C.; Evans, E.A. The future of global banana production. Hortic. Rev. 2015, 43, 311–352. [Google Scholar]
  21. Ray, J.D.; Subandiyah, S.; Rincon-Florez, V.A.; Prakoso, A.B.; Mudita, I.W.; Carvalhais, L.C.; Markus, J.E.R.; O’Dwyer, C.A.; Drenth, A. Geographic expansion of banana blood disease in Southeast Asia. Plant Dis. 2021, 105, 2792–2800. [Google Scholar] [CrossRef]
  22. Arvanitoyannis, I.S.; Mavromatis, A. Banana cultivars, cultivation practices, and physicochemical properties. Crit. Rev. Food Sci. Nutr. 2009, 49, 113–135. [Google Scholar] [CrossRef] [PubMed]
  23. Ferdaus, M.H.; Prito, R.H.; Rasel, A.A.S.; Ahmed, M.; Saykot, M.J.H.; Shanta, S.S.; Akter, S.; Das, A.C.; Islam, M.M.; Hasan, M.; et al. BananaImageBD: A Comprehensive Banana Image Dataset for Classification of Banana Varieties and Detection of Ripeness Stages in Bangladesh. Data Brief 2024, 58, 111239. [Google Scholar] [CrossRef]
  24. Li, H.; Rajbahadur, G.K.; Lin, D.; Bezemer, C.-P.; Jiang, Z.M. Keeping Deep Learning Models in Check: A History-Based Approach to Mitigate Overfitting. IEEE Access 2024, 12, 70676–70689. [Google Scholar] [CrossRef]
  25. He, K.; Gkioxari, G.; Dollár, P. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  26. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  27. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  28. Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  29. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
  30. Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  31. Lin, T. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  32. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
  33. Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
  34. Narayanan, M. SENetV2: Aggregated dense layer for channelwise and global representations. arXiv 2023, arXiv:2311.10807. [Google Scholar]
  35. Zhong, J.; Chen, J.; Mian, A. DualConv: Dual Convolutional Kernels for Lightweight Deep Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 9528–9535. [Google Scholar] [CrossRef]
  36. Zhang, Y.-F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Figure 1. Pictures of bananas at different levels of ripeness.
Figure 1. Pictures of bananas at different levels of ripeness.
Symmetry 17 00231 g001
Figure 2. Picture of a banana after data enhancement.
Figure 2. Picture of a banana after data enhancement.
Symmetry 17 00231 g002
Figure 3. Improved YOLOv9c structure diagram.
Figure 3. Improved YOLOv9c structure diagram.
Symmetry 17 00231 g003
Figure 4. SENetV2 structure diagram.
Figure 4. SENetV2 structure diagram.
Symmetry 17 00231 g004
Figure 5. DualConv group convolution schematic.
Figure 5. DualConv group convolution schematic.
Symmetry 17 00231 g005
Figure 6. Model detection sample image.
Figure 6. Model detection sample image.
Symmetry 17 00231 g006
Figure 7. Comparison of ESD-YOLOv9 with different object detection models.
Figure 7. Comparison of ESD-YOLOv9 with different object detection models.
Symmetry 17 00231 g007
Figure 8. Improved ESD-YOLOv9 training result curves.
Figure 8. Improved ESD-YOLOv9 training result curves.
Symmetry 17 00231 g008
Table 1. Experimental conditions.
Table 1. Experimental conditions.
ParametersConfiguration
GPUNVIDIA RTX 3090/24 GB
Python3.8 (ubuntu20.04)
PyTorch2.0.0
Operating systemWindows 11
Accelerated environmentCUDA 11.8.0
Table 2. Detailed hyperparameters used in the experiment.
Table 2. Detailed hyperparameters used in the experiment.
ItemValue
OptimizerSGD
Momentum0.937
Batch size16
Weight decay0.0005
Training epochs120
Initial learning rate0.01
Table 3. Test results for different maturity levels.
Table 3. Test results for different maturity levels.
Maturity LevelPrecisionRecallmAP50mAP50-95F1 Score
unripe80.9%198.1%95.4%89.4%
ripe99.4%97.8%99.4%94.6%98.6%
overripe97.1%199.2%97.6%98.5%
rotten98.7%97.4%98.8%94.2%98.0%
Average94.0%98.8%98.9%95.9%96.3%
Table 4. Improvement results of each module on YOLOv9c.
Table 4. Improvement results of each module on YOLOv9c.
No.ECASENetV2SCConvDualConvEIoUPrecisionRecallmAP50mAP50-95Params/M
1××××96.2%99.8%97.2%87.7%51.006
2××××94.0%98.7%98.5%95.3%51.350
3××××97.3%92.6%97.9%87.9%49.156
4××××93.9%98.3%98.7%95.0%51.594
5××××93.8%98.7%98.4%95.1%51.006
6××96.1%94.4%97.4%87.3%49.157
7××95.5%98.5%98.0%90.3%51.594
8××94.2%98.7%98.0%89.6%51.424
9××94.0%98.8%98.9%95.9%51.938
Table 5. Results of the ablation experiments.
Table 5. Results of the ablation experiments.
ModelAttention
Mechanism
ConvLoss FunctionPrecisionRecallmAP50mAP50-95GFLOPs
YOLOv9c---93.8%99.2%98.2%94.7%238.9
YOLOv9c--EIoU93.8%98.7%98.4%95.1%238.9
YOLOv9c-DualConv-93.9%98.3%98.7%95.0%237.5
YOLOv9cSENetV2--94.0%98.7%98.5%95.3%239.2
YOLOv9c-DualConvEIoU94.0%99.2%98.8%95.6%237.5
YOLOv9cSENetV2-EIoU94.3%99.0%98.6%95.4%239.2
YOLOv9cSENetV2DualConv-94.2%99.0%98.8%95.5%237.8
YOLOv9cSENetV2DualConvEIoU94.0%98.8%98.9%95.9%235.6
Table 6. Comparison of training results of different models.
Table 6. Comparison of training results of different models.
ModelAverage PrecisionPrecisionRecallPrarm/MFPSF1 Score
UnripeRipeOverripeRottenmAP50mAP50-95
YOLOv5X93.0%99.4%98.5%98.5%97.3%92.3%91.9%98.9%86.19319.5495.3%
YOLOv794.9%99.5%98.0%98.2%97.6%91.3%91.5%98.3%36.49240.2294.8%
YOLOv8n90.2%99.5%99.5%99.3%97.1%93.9%94.3%97.5%3.006162.8995.9%
YOLOv10n94.0%99.4%99.3%99.1%97.9%93.4%93.9%97.0%2.962172.3495.4%
RT-DETR97.5%98.5%95.9%97.0%97.2%91.2%93.7%96.9%31.99136.3695.3%
Faster-RCNN75.9%66.6%73.5%63.1%69.8%64.7%42.5%82.1%138.4311.5456.0%
SSD80.1%74.8%79.4%84.7%86.2%82.9%86.2%37.4%146.529.9452.2%
YOLOv1195.7%99.4%99.2%98.9%98.3%95.4%93.5%99.0%2.582120.4196.1%
Ours98.1%99.4%99.2%98.8%98.9%95.5%94.0%98.8%51.63944.9896.3%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, G.; Gao, Y.; Xu, F.; Sang, W.; Han, Y.; Liu, Q. A Banana Ripeness Detection Model Based on Improved YOLOv9c Multifactor Complex Scenarios. Symmetry 2025, 17, 231. https://doi.org/10.3390/sym17020231

AMA Style

Wang G, Gao Y, Xu F, Sang W, Han Y, Liu Q. A Banana Ripeness Detection Model Based on Improved YOLOv9c Multifactor Complex Scenarios. Symmetry. 2025; 17(2):231. https://doi.org/10.3390/sym17020231

Chicago/Turabian Style

Wang, Ge, Yuteng Gao, Fangqian Xu, Wenjie Sang, Yue Han, and Qiang Liu. 2025. "A Banana Ripeness Detection Model Based on Improved YOLOv9c Multifactor Complex Scenarios" Symmetry 17, no. 2: 231. https://doi.org/10.3390/sym17020231

APA Style

Wang, G., Gao, Y., Xu, F., Sang, W., Han, Y., & Liu, Q. (2025). A Banana Ripeness Detection Model Based on Improved YOLOv9c Multifactor Complex Scenarios. Symmetry, 17(2), 231. https://doi.org/10.3390/sym17020231

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop