Research on the Strawberry Recognition Algorithm Based on Deep Learning

Zhang, Yunlong; Zhang, Laigang; Yu, Hanwen; Guo, Zhijun; Zhang, Ran; Zhou, Xiangyu

doi:10.3390/app132011298

Open AccessArticle

Research on the Strawberry Recognition Algorithm Based on Deep Learning

by

Yunlong Zhang

¹,

Laigang Zhang

^1,*

,

Hanwen Yu

²

,

Zhijun Guo

³,

Ran Zhang

¹ and

Xiangyu Zhou

¹

School of Mechanical and Automotive Engineering, Liaocheng University, Liaocheng 252000, China

²

School of Mechanical and Electronic Engineering, Shandong Jianzhu University, Jinan 250024, China

³

Institute of Information Science and Technology, Hunan Normal University, Changsha 410081, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(20), 11298; https://doi.org/10.3390/app132011298

Submission received: 11 September 2023 / Revised: 10 October 2023 / Accepted: 11 October 2023 / Published: 14 October 2023

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

The aim is to distinguish maturity, reduce labor costs, and solve the occurrence of false or missed detections caused by strawberry leaf occlusion.

Abstract

In view of the time-consuming and laborious manual picking and sorting of strawberries, the direct impact of image recognition accuracy on automatic picking and the rapid development of deep learning(DL), a Faster Regions with Convolutional Neural Network features (R-CNN) strawberry recognition method that combines Mixup data augmentation, a ResNet(Residual Network)50 backbone feature extraction network and a Soft-NMS (Non-Maximum Suppression) algorithm, named the MRS Faster R-CNN, is proposed. In this paper, the transfer learning backbone feature extraction network VGG (Visual Geometry Group) 16 and ResNet50 are compared, and the superior ResNet50 is selected as the backbone network of MRS Faster R-CNN. The data augmentation method of Mixup image fusion is used to improve the learning and generalization ability of the model. The redundant bboxes (bounding boxes) are removed through Soft-NMS to obtain the best region proposal. The freezing phase is added to the training process, effectively reducing the occupation of video memory and shortening the training time. After experimental verification, the optimized model improved the AP (Average Precision) values of mature and immature strawberries by 0.26% and 5.34%, respectively, and the P(Precision) values by 0.81% and 6.34%, respectively, compared to the original model (R Faster R-CNN). Therefore, the MRS Faster R-CNN model proposed in this paper has great potential in the field of strawberry recognition and maturity classification and improves the recognition rate of small fruit and overlapping occluded fruit, thus providing an excellent solution for mechanized picking and sorting.

Keywords:

Faster R-CNN; mixup data augmentation; Soft-NMS; ResNet50; maturity classification

1. Introduction

Strawberries are a common and highly nutritious fruit in the current market. Due to their rich carotenoids and low-fat content, they are also used in beauty, skincare, and eye care. In addition, regular consumption can supplement a large amount of vitamin C to promote health, making them highly favored by consumers [1]. To meet the growing market demand, the strawberry planting industry has shown significant growth momentum in China. With the development of the strawberry planting industry, the cost of employing labor has been increasing year by year, and deep learning (DL) technology has been widely applied in the field of image recognition. The use of image recognition technology to extract information such as the shape, size, color, and defects of strawberries has gradually become precise and efficient, providing more powerful technical support for replacing manual recognition, greatly reducing the cost and time costs in the product production process. In the process of strawberry production and sales, strawberry image recognition technology based on DL has broad application prospects [2,3] (pp. 454–458).

Image recognition is an important CV (computer vision) technology. In the past decade, DL has become the most important AI (artificial intelligence) technology in the field of image recognition, and its development and application have achieved tremendous achievements [4,5,6,7,8]; it is widely used in the field of transportation, such as license plate recognition [9] (pp. 1–10), unmanned obstacle avoidance [10], etc. In terms of the development of DL in the field of agriculture, DJI and other brands of UAVs (Unmanned Aerial Vehicles) spray pesticides and fertilizer [11]. Research on fruit images, such as strawberry recognition, is exceptionally broad-based both domestically and internationally, providing technical solutions for future unmanned strawberry picking and classification. Zongmei Gao et al. [12] used the hyperspectral imaging system (HIS) and AlexNet Convolutional Neural Network (CNN) for strawberry fruit maturity classification and achieved good results, but its experimental conditions require higher requirements. Xiao et al. [13] (pp. 1–12) used a Swin transformer model to identify the maturity of apples and pears. This method has achieved good classification results, but the data set of this experiment is not extensive enough to show the real growth environment of fruits. Li X et al. [14] (pp. 1072–1077) proposed a method that uses the Ostu algorithm to separate the background and target and then trains the CaffeNet model. This method can effectively complete the recognition task of mature strawberries. However, the Ostu algorithm has a poor segmentation effect when the area of the object and background in the image are very different.

Niu et al. [15] (pp. 1–10) proposed a YOLO-Plum algorithm to identify the maturity of plums. The algorithm can well identify mature plums, but the data set is relatively simple, and the identification of small plums is insufficient. Gai et al. [16] (pp. 13,895–13,906) proposed an improved YOLO v4 algorithm for identifying cherries. The mAP (mean Average Precision) value was 15% higher than that of YOLOv4, which can effectively realize the recognition of small-volume cherries. However, the recognition of occluded cherries is not very good, and the cherries are dense and often missed. Wang Guohui et al. [17] (pp. 1–12) proposed a strawberry appearance quality identification method, ResNeXt-SVM (Support Vector Machine), based on the ResNeXt network and SVM, which balances the accuracy and efficiency of strawberry recognition. This method can meet the requirements of accuracy, improve detection efficiency, and have great application potential. The disadvantage is that the superiority of the results is based on a large amount of data. When the amount of data is small, the result is not ideal.

Ridho MF et al. [18] (pp. 157–162) proposed a strawberry quality evaluation robot that utilizes Single Shot MultiBox Detector (SSD) CNN and CV technology in a single board computer (SBC). This method effectively distinguishes between high-quality and low-quality strawberries while maintaining a high frame rate. However, its anti-occlusion performance is poor. An QL et al. [19] (pp. 124,363–124,372) proposed an improved SDNet algorithm based on the YOLOX model, which effectively improves the accuracy of strawberry detection and the spatial interaction ability of the algorithm, providing technical support for unmanned farms. However, strawberry growth state monitoring based on DL depends on the server, and there are still limitations.

Strawberry maturity classification is more complicated, and the recognition results are more confusing. Zhang YC et al. [20] (p. 106,586) proposed a new, lightweight deep neural network that is RTSD-Net based on YOLOv4 micro. The network adopts fewer parameters and a simplified structure, effectively improving the efficiency of strawberry detection, and has good potential in automated strawberry picking. However, the problem of missed detection is obvious. Image recognition plays a crucial role in the production process of strawberries. For different stages of strawberry (maturity) picking and classification, image recognition can provide technical means for agricultural production and effectively reduce economic losses. In the process of strawberry image recognition, traditional methods (Gabor algorithm, SIFT algorithm, etc.) mainly focus on the image preprocessing and feature extraction stages, and the accuracy is affected by external factors such as lighting changes, fruit reflection, occlusion, etc.

To improve accuracy, the above studies used DL visual technology, convolutional neural networks (CNN) and other methods for strawberry image recognition to improve the yield rate. Among these detection results, YOLO-plumb and SDNet are better, and they can achieve accurate identification of small targets. However, they belong to the single-stage detection algorithm compared with the Faster R-CNN two-stage detection algorithm, and there are some deficiencies in the recognition rate. It is precisely because of the existence of a RPN (Region Proposal Network) in Faster R-CNN that it can find the whole image globally, so the recognition rate will be higher. Fu et al. [21] (pp. 45–50) used Faster R-CNN and ZFNet (Zeiler and Fergus network) algorithms to realize the recognition of kiwis in the field. The recognition rate of this method reached 92.3%, and the recognition rate of the separated fruit reached an astonishing 96.7%. However, this method has a poor recognition effect on occluded and overlapping fruits—there are many missed and false detections.

Therefore, this paper designs experiments and proposes three detection algorithms (R Faster R-CNN, MVS Faster R-CNN and MRS Faster R-CNN). The effect of R Faster R-CNN and MVS Faster R-CNN in detecting small strawberries and overlapping strawberries is not ideal, and especially with MVS Faster R-CNN, the overall detection accuracy is not high. Therefore, MRS Faster R-CNN is proposed to solve this problem. The performance of the algorithm in the experiment is ideal. Strawberries with a small volume and overlapping occlusion are identified. The recognition precision of a mature strawberry is as high as 95.12%, which greatly improves the detection accuracy because the MRS Faster R-CNN strawberry maturity recognition method combines Mixup data augmentation, the ResNet50 backbone feature extraction network and the Soft-NMS algorithm. These improvements make it possible to greatly reduce the occurrence of missed detection and better complete the task of accurate small targets and occluded object recognition and classification. In China, the planting area of strawberries has exceeded 139.97 thousand hectares, and the planting area is continuously increasing. However, compared with the technology of developed countries, there is still a significant gap in China, and further improvements in computer technology and mechanization levels are needed [22] (pp. 395–402).

In this work, MRS Faster R-CNN successfully completed the recognition and classification of strawberries and shows good robustness. In the field of strawberry production in the future, this method has significant application potential in distinguishing strawberry maturity, reducing labor costs, and solving the problem of false detection or missed detection caused by strawberry leaf occlusion and fruit overlap.

2. Materials and Methods

2.1. Strawberry Image Acquisition

When acquiring strawberry image data, many factors need to be considered. Shooting in different scenes may result in different image quality and effects, which may have adverse effects on model training and performance. Therefore, attention should be paid to factors such as lighting conditions, shooting distance and angle, and image resolution. Another important consideration is to obtain a sufficient sample size for effective model training while supporting data classification and annotation for the use of the dataset in model training. The strawberry images in the article were composed of 527 images that were taken by oneself, uploaded online and preprocessed. The downloaded images are from Kaggle and Baidu’s web pages. The images taken by the authors were obtained by using a Sony FDR-AX100E camera with 20 million pixels in different greenhouses in the Dongchangfu District, Liaocheng City, Shandong Province, China (36.4346° N, 115.9885° E) in April. The images contain different growth stages and environmental conditions of strawberries, which help to improve the generalization and learning ability of the model (as shown in Figure 1).

2.2. Strawberry Image Preprocessing

After obtaining strawberry image data, to improve the training effect of the DL strawberry image detection model, preprocessing is carried out on the strawberry image, including the aspects described below.

2.2.1. Image Clipping

Clipping an image can remove unimportant areas and only retain the strawberry portion. This can reduce the amount of training data and the impact of background, thus improving training efficiency and precision. Strawberry images can be clipped according to their shape and size (as shown in Figure 2).

2.2.2. Image Enhancement

When acquiring strawberry images, the overall brightness increases if the lighting conditions are strong, but if the lighting conditions are weak, the overall brightness decreases. In particular, the image taken in the greenhouse is affected by the light reflected by the transparent film, which causes the image to be too bright. Too strong or too weak light can cause certain pixels in the image to have similar colors, resulting in a lower contrast ratio between the strawberry area and the surrounding area. To achieve this, histogram equalization technology can be used to convert the original image into the corresponding grayscale image based on the cumulative distribution function. This method maps pixel values between 0 and 255 through nonlinear stretching, resulting in a uniform distribution of pixel values, thereby improving the contrast ratio between the strawberry and the surrounding area to a certain extent. Please refer to Equation (1) for the specific mapping method.

S_{k} = \sum_{j = 0}^{k} \frac{n_{j}}{n} k \in N^{*}

(1)

Among them,

n_{j}

represents the number of pixels in the image with a grayscale of

j

,

n

represents the total number of pixels, and k represents the total number of grayscale levels. By performing gray processing and histogram equalization processing, the image quality can be improved (as shown in Figure 3).

2.2.3. Image Denoising

Due to differences in image resolution, low-resolution images may exhibit more noise. This article uses Gaussian filters to denoise images. Use a 3 × 3 Gaussian kernel image for convolution operation to smooth and reduce noise, thereby reducing the number of noise occurrences. Due to the processing of two-dimensional images, it is necessary to use a two-dimensional Gaussian distribution function to define the numerical size of the convolutional kernel (please refer to Equation (2)):

G (x, y) = \frac{1}{2 π σ^{2}} e^{- \frac{x^{2} + y^{2}}{2 σ^{2}}}

(2)

Assuming the size of the convolutional kernel is (2k + 1) × (2k + 1), the calculation of the values of each element of the convolutional kernel is shown in Equation (3):

H_{i, j} = \frac{1}{2 π σ^{2}} e^{- \frac{{(i - k - 1)}^{2} + {(j - k - 1)}^{2}}{2 σ^{2}}}

(3)

2.2.4. Image Scaling and Data Augmentation

Scaling the images to a uniform size (600 × 600) can give the training data a consistent size and facilitate the training of the Faster R-CNN model.

Data augmentation generates more training data and improves the generalization ability of DL models by performing operations such as rotation, translation, flipping, and Mixup image fusion on the original image [23] (as shown in Figure 4). The mathematical expression for Mixup data augmentation is shown in Equations (4) and (5).

\tilde{x} = λ x_{i} + (1 - λ) x_{j}

(4)

\tilde{y} = λ y_{i} + (1 - λ) y_{j}

(5)

2.3. Introduction of Faster R-CNN Network Structure and Optimization

Faster R-CNN [24,25,26,27] is the latest network structure of the R-CNN two-stage object detection series, which has achieved significant improvements in detection speed and precision. Compared to single-stage detection, its algorithm is more complex and slower, but its detection precision will be higher. The main steps of Faster R-CNN proposed in the article include image feature extraction, RPN (Region Proposal Network), ROI (Region of Interest) pooling, classification and regression. Firstly, the input image size P × Q is adjusted to fix the short edge to 600. Then, the backbone feature extraction network of Faster R-CNN uses transfer learning lightweight networks such as (ResNet50, Vgg16, etc.) to obtain a shared feature layer (Feature Map). Afterward, the shared feature layer passes the data into the RPN and ROI Pooling layers, incorporating NMS, an automatic learning mechanism for generating high-quality region proposals, avoiding many redundant regions in traditional methods and significantly improving detection efficiency. Finally, in the classification and bounding box regression stages, using the previously extracted features, Faster R-CNN performs object classification and fine adjustment of bounding box positions through fully connected layers. This stage not only improves the precision of object detection but also enables the model to make more accurate predictions of object positions. Overall, Faster R-CNN achieved optimization of precision and speed in object detection by integrating region proposal networks into an end-to-end learning network, demonstrating excellent performance (as shown in Figure 5a). Figure 5b is the MRS Faster R-CNN structural framework. It can be seen in the structural block diagram that the data set is expanded and improved. The ResNet50 backbone network is used, and the Soft-NMS module is added to the RPN structure.

2.3.1. ResNet50 Backbone Feature Extraction Network

ResNet50 [28] (pp. 770–778) was invented by He K., Zhang X., Ren S., and Sun J. It has achieved ImageNet classification, ImageNet detection, ImageNet localization, COCO detection, COCO in the 2015 ILSVRC and first place in COCO segmentation competitions and has since shot to fame. Neural networks are not necessarily better with more layers. ResNet’s structural highlights can solve the problem of traditional deep networks causing gradient vanishing and exploding with increasing depth, leading to a rapid increase in error rate. It can effectively achieve a decrease in error rate with increasing depth, effectively improving the detection precision and convergence speed of high-depth networks.

The residual module can effectively learn identity mapping, which is achieved through “shortcut connections”. If the identity mapping is the optimal model, simply set the weight in the weight layer to 0 (Figure 6). Assuming the input is

X

, the relationship between layers is

Ϝ (X)

. The output of the residual block is y (Equation (6)), and what the model wants to learn is the residual of

Ϝ (X, \{W_{i}})

instead of

y

, the equation changes to Equation (7). If the number of channels of

X

is not the same as the number of channels of

Ϝ (X)

, it cannot be directly superimposed, and it needs to go through a 1 × 1 convolutional kernel to raise the dimension (Figure 7). In a network with the same number of layers, it can directly connect input data to the output end, thereby reducing a large number of parameters and effectively solving the degradation problem of deep neural networks.

y = Ϝ (X, \{W_{i}}) + X

(6)

Ϝ (X, \{W_{i}}) = y - X

(7)

The ResNet50 processing process includes 7 × 7 convolutional layers and 3 × 3 maximum pooling layers, and four stages, including final average pooling, fully connected layer, and Softmax normalization operations. The four stages are composed of 3, 4, 6, 3 blocks, each containing three types of convolutional kernels, each of which is 1 × 1, 3 × 3, 1 × 1 (Table 1). There is a BN (Batch Normalization) layer before the activation function ReLU, which can effectively accelerate the network convergence speed. This paper adopts input of 600 × 600 images, conv1 to conv4 for feature extraction, and conv4 to Average pool for classification. The following Figure 8 shows the ResNet50 flowchart, with dashed lines indicating the number of channels needs up-sampling normalization and solid lines indicating the number of channels does not need up-sampling normalization.

2.3.2. VGG16 Backbone Feature Extraction Network

VGG16 [29] consists of five convolutional blocks, three fully connected layers, and Softmax output layers. The maximum pooling layer is used as the boundary between layers, and the ReLU (Rectified Linear Unit) function is used for all hidden layer activation functions. VGG16 uses convolutional layers with multiple 3 × 3 convolutional kernels instead of convolutional layers with a larger convolutional kernel, which can effectively reduce parameters, improve network operation efficiency, and increase the nonlinear fitting ability of the network (Figure 9).

2.3.3. Region Proposal Network

The main function of RPN is to anchor boxes [7,30,31], determine the objects included in the feature map, lay the foundation for more precise classification and regression in the future and improve detection precision. There are three aspect ratios (1:2, 1:1 and 2:1) and three sizes (128, 256 and 512) for anchor boxes in the article, resulting in nine anchor boxes (Figure 10).

RPN is a convolutional network consisting of a 3 × 3 convolutional layer and two 1 × 1 convolutional layers. After 3 × 3 convolution, the feature map enters a 1 × 1 convolution layer and outputs feature information for 18 channels and 36 channels, respectively (Figure 11) (Eighteen (18) can be understood as 9 × 2, 9: represents 9 anchor boxes, 2: used to determine whether the object is truly contained. Thirty-six (36) is understood as 9 × 4, 9: represents 9 anchor boxes, 4 is used to adjust the anchor boxes).

2.3.4. Soft-NMS, ROI Pooling, Classification and Regression

After RPN processing, due to the potential overlap between candidate regions, the MRS Faster R-CNN proposed in this article uses the Soft-NMS algorithm [32,33] to preserve the most prominent feature regions in the image and remove redundant regions proposal. The Soft-NMS algorithm can effectively solve the problem of objects with low scores being eliminated due to high object overlap in traditional NMS algorithms. By using the Soft-NMS algorithm, redundant information can be effectively reduced without worrying about being mistakenly eliminated due to high object overlap, effectively improving the precision and efficiency of the algorithm Figure 12 and Figure 13.

In order to handle region proposals of different sizes, Faster R-CNN introduced a ROI (Region of Interest) pooling layer. The ROI pooling layer converts region proposals of different sizes into fixed-size feature representations for subsequent classification and bounding box regression. The features of ROI pooling are inputted into the Fully Connected Layer, and finally, object classification and bounding box regression are performed through two branches. The classification branch uses the Softmax function to output the probabilities of each category, while the bounding box regression branch predicts the offset between the region proposal and the actual bounding box.

2.4. Experimental Description and Evaluation Methods

The experiment used an Intel i7-12700H processor, 32 G memory, NVIDIA RTX3090 graphics card, 24 G video memory, CUDA11.6, Python 3.9, Windows 11 operating system, PyCharm compiler, and a PyTorch2.0 DL framework.

2.4.1. Production of Data Sets

The experimental data were obtained with a Sony FDR-AX100E camera photography, uploaded online, and the images were preprocessed with a total of 527 datasets. Firstly, it is necessary to annotate the strawberry object dataset (Figure 14). The image annotation software used in this article is labelimg, which is used to create the VOC (Visual Object Classes) dataset. The target is labeled at the corresponding position, and the label of the strawberry object is set to two types: caomei and NO. caomei represents mature strawberries, and NO represents immature strawberries. Next, generate XML (extensible markup language) and txt (text) files corresponding to the image data, and then divide the dataset into a training set, validation set (474 pieces), and test set (53 pieces) in a 9:1 ratio. Then, send the data into the network for training.

2.4.2. Evaluation Analysis and Experimental Design

In order to evaluate the performance of the MRS Faster R-CNN model in strawberry recognition and classification, this article compared four common evaluation indicators in the field of object detection, including Precision (P), Recall (R), Average Precision (AP) and

F_{1}

. The calculation equations for these indicators are shown in Equations (8)–(11).

P = \frac{T P}{T P + F P}

(8)

R = \frac{T P}{T P + F N}

(9)

A P = \int_{0}^{1} P (R) d R

(10)

F_{1} = \frac{2 P R}{P + R} = \frac{2 T P}{2 T P + F P + F N}

(11)

In the equation, TP (True Positive) represents the correct detection of a positive result: that is, the test result is correct and the result is positive. FP (False Positive) indicates that a positive result was mistakenly detected: that is, the detection result was incorrect and the result was positive. TN (True Negative) represents the correct detection of negative results: that is, the test result is correct and the result is negative. FN (False Negative) represents the erroneous detection of a negative result: the test result is incorrect and the result is negative. Precision represents the proportion of correctly predicted positive samples to all predicted positive samples. The higher the numerical value, the more accurate the recognition and the better the model. Recall represents the proportion of correctly predicted positive samples to all actual positive samples. The AP value is the proportion of the area enclosed by the PR (Precision–Recall) curve, which measures the performance of the learned model in each category. The

F_{1}

value can be generalized as weighting and harmonizing different weights on accuracy and recall. The higher the

F_{1}

value, the better the performance of the network.

A total of three experiments were conducted in the article, and a comparative analysis was conducted. Experiment 3 used the method proposed in this article MRS Faster R-CNN. When conducting experimental training, consider the performance differences of the equipment. In this experiment, add a freeze-training stage. Freeze-training freezes the model backbone, but the feature extraction network does not change, which will greatly reduce the video memory space occupied by the operation, improve the running speed, and better complete the experiment. Experiment 1, named R Faster R-CNN, utilizes ResNet50 as the backbone feature extraction network and NMS to eliminate surplus region proposal and performs translation, rotation, denoising and clipping operations on the dataset. The hyperparameters in Table 2 were used for the experiment. Experiment 2, named MVS Faster R-CNN, utilizes VGG16 as the backbone feature extraction network and Soft-NMS to eliminate the surplus region proposal, and the Mixup image mixed data enhancement method was added. The hyperparameters in Table 3 were used for the experiment. Experiment 3 utilizes ResNet50 as the backbone feature extraction network and Soft-NMS to eliminate the surplus region proposal, and the Mixup image mixed-data enhancement method was added. The hyperparameters in Table 4 were used for the experiment.

3. Results

Indicated by the yellow arrows (highlight the comparison target of the experiment) in the predicted images of R Faster R-CNN and MRS Faster R-CNN in Figure 15a, NMS directly removes the prediction box of the occluded strawberry, while in Figure 15e, Soft-NMS is used to retain the prediction box of the occluded strawberry intact. It can be found that MRS Faster R-CNN (the method proposed in this article) is more accurate in recognizing small, blurred and occluded objects, thanks to the application of the Mixup image hybrid enhancement algorithm and Soft-NMS algorithm. The accuracy of the prediction boxes in Figure 15e–h in MVS Faster R-CNN is lower than that in Figure 15i–l in MRS Faster R-CNN, indicating that the ResNet50 backbone feature extraction network is more suitable for object recognition compared to the VGG16 backbone feature extraction network and has better learning ability. There are equal advantages and disadvantages between R Faster R-CNN and MVS Faster R-CNN. From the prediction box indicated by the yellow arrow, MVS Faster R-CNN has better recognition of occlusions, while R Faster R-CNN has relatively better recognition of small objects (Figure 15).

After comparing the experimental results parameters in Figure 16 AP, Figure 17 Recall, and Table 5, it can be seen that MRS Faster R-CNN, whether it is mature or immature, has the best Precision, AP, Recall, and

F_{1}

. The Precision values reach 95.12% and 88.79%, respectively, and the AP values reach 94.36% and 84%, respectively, for mature and immature strawberries. Compared to R Faster R-CNN without Mixup image mixing, the Precision of mature and immature strawberries increased by 0.81% and 6.34%, respectively, and the AP value increased by 0.26% and 5.34%, respectively. Compared to MVS Faster R-CNN using VGG16, the Precision of mature and immature strawberries was 11.61% and 16.12% higher, respectively, and the AP value was 12.61% and 15.51% higher, respectively. The mature and immature strawberries in MRS Faster R-CNN had Recall values of 95.29% and 88.17%, respectively. These values were 1.38% and 1.9% higher than R Faster R-CNN and 11.64% and 17.76% higher than MVS Faster R-CNN, respectively. The

F_{1}

was 95.20% and 88.48%, respectively, which were 1.09% and 4.16% higher than R Faster R-CNN and 11.62%, and 16.96% higher than MVS Faster R-CNN.

After verification, the proposed MRS Faster R-CNN model has the advantages of high precision and good robustness, which can meet production needs and provide a theoretical basis for automated strawberry picking.

4. Discussion

With the development of CV, image recognition has been widely used in the field of agriculture. As a significant producer of strawberry cultivation, China consumes a good deal of workforce and material resources for strawberry picking and sorting every year. Many scholars in the literature have applied image recognition algorithms, such as YOLO and SSD, to strawberry classification and strawberry scratch detection, which aim to reduce labor costs and improve the quality of strawberry products and have achieved good results. However, these algorithms perform poorly when identifying small and occluded objects. In order to improve the mechanization level of strawberry production process and reduce the occurrence of missed detection caused by small size and occlusion in the strawberry picking and sorting process, this paper proposes the MRS Faster R-CNN algorithm. The method was trained and verified by 527 strawberry images of different scenes and different maturity stages in the data set. It can be concluded objectively that this method has a significant improvement in the recognition of small targets and occluded objects. However, the experiment also has some shortcomings. The experimental samples are relatively small. The strawberry image environments are relatively simple and not representative. These images should be taken in multiple environments and multiple locations. With the addition of modules, the time required for model training is relatively prolonged. Compared with the YOLO and SSD single-stage algorithms in the literature, the detection accuracy of the MRS Faster R-CNN two-stage detection algorithm will be more advantageous in theory and slightly inferior in computational efficiency. The high accuracy is due to its two-stage structure characteristics and the improvement proposed in this paper. Through the comparison of three experiments, it can be found that the production of data sets is also the key. The data augmentation of Mixup can improve and expand the data set and improve the generalization learning ability of the model. As a backbone feature extraction network, the ResNet50 network has a good feature extraction effect, and Soft-NMS can effectively reduce occurrences of missed detection. However, this method should be further improved in the detection rate, which ensures high detection accuracy but also accounts for the fast detection rate, which is the pursuit of the recognition algorithm model. The new method should have a certain value. The essence is to improve people‘s living standards and serve people to respond to the call for smart agriculture. This method will have significant application potential in strawberry picking and strawberry quality sorting in the fruit market. It can also add value and wealth to society and contribute to the development of smart agriculture.

5. Conclusions

The two-stage detection method of MRS Faster R-CNN, including the powerful feature extraction network of ResNet50, Mixup’s excellent data augmentation algorithm, and detailed Soft-NMS processing, is more accurate in locating and recognizing objects, and has significant application potential. On the one hand, the MRS Faster R-CNN recognition algorithm can be potentially applied to strawberry harvesters in greenhouses or open-air strawberry plantations, which can help complete the identification and picking tasks of mature strawberries. On the other hand, it can also be used in the quality screening machines of strawberry processing factories to effectively separate mature strawberries from non-mature strawberries, remove defective strawberries, and improve the quality of strawberry products. In addition, the model’s ability to monitor strawberry fruit growth in greenhouses also has great application potential, which helps to improve the mechanization level of agriculture, improve production efficiency, and contribute certain economic value and benefits to society.

Author Contributions

Conceptualization, R.Z.; Methodology, H.Y.; Formal analysis, Z.G.; Writing—original draft, Y.Z.; Writing—review & editing, L.Z.; Supervision, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Natural Science Foundation of Shandong Province (ZR2022MF303) and the Natural Science Foundation of Liaocheng University (No. 318012013).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to that there are some private data.

Conflicts of Interest

The authors declare no conflict of interest.

References

Newerli-Guz, J.; Śmiechowska, M.; Drzewiecka, A.; Tylingo, R. Bioactive Ingredients with Health-Promoting Properties of Strawberry Fruit (Fragaria x ananassa Duchesne). Molecules 2023, 28, 2711. [Google Scholar] [CrossRef] [PubMed]
Kim, S.-J.; Jeong, S.; Kim, H.; Jeong, S.; Yun, G.-Y.; Park, K. Detecting Ripeness of Strawberry and Coordinates of Strawberry Stalk using Deep Learning. In Proceedings of the 2022 Thirteenth International Conference on Ubiquitous and Future Networks (ICUFN), Barcelona, Spain, 5–8 July 2022; pp. 454–458. [Google Scholar] [CrossRef]
Su, Z.; Zhang, C.; Yan, T.; Zhu, J.; Zeng, Y.; Lu, X.; Gao, P.; Feng, L.; He, L.; Fan, L. Application of Hyperspectral Imaging for Maturity and Soluble Solids Content Determination of Strawberry With Deep Learning Approaches. Front. Plant Sci. 2021, 12, 736334. [Google Scholar] [CrossRef] [PubMed]
Jiang, Z.P.; Liu, Y.Y.; Shao, Z.E.; Huang, K.W. An improved VGG16 model for pneumonia image classification. Appl. Sci. 2021, 11, 11185. [Google Scholar] [CrossRef]
Zheng, H.; Sherazi, S.W.A.; Son, S.H.; Lee, J.Y. A deep convolutional neural network-based multi-class image classification for automatic wafer map failure recognition in semiconductor manufacturing. Appl. Sci. 2021, 11, 9769. [Google Scholar] [CrossRef]
Zhao, Y.; Gong, L.; Huang, Y.; Liu, C. A review of key techniques of vision-based control for harvesting robot. Comput. Electron. Agric. 2016, 127, 311–323. [Google Scholar] [CrossRef]
Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep learning–Method overview and review of use for fruit detection and yield estimation. Comput. Electron. Agric. 2019, 162, 219–234. [Google Scholar] [CrossRef]
Ajayi, O.G.; Ashi, J.; Guda, B. Performance evaluation of YOLO v5 model for automatic crop and weed classification on UAV images. Smart Agric. Technol. 2023, 5, 100231. [Google Scholar] [CrossRef]
Laroca, R.; Severo, E.; Zanlorensi, L.A.; Oliveira, L.S.; Gonçalves, G.R.; Schwartz, W.R.; Menotti, D. A robust real-time automatic license plate recognition based on the YOLO detector. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–10. [Google Scholar]
Wu, J.; Ren, H.; Lin, T.; Yao, Y.; Fang, Z.; Liu, C. Autonomous Path Finding and Obstacle Avoidance Method for Unmanned Construction Machinery. Electronics 2023, 12, 1998. [Google Scholar] [CrossRef]
Zhao, G.; Zhang, Y.; Lan, Y.; Deng, J.; Zhang, Q.; Zhang, Z.; Li, Z.; Liu, L.; Huang, X.; Ma, J. Application Progress of UAV-LARS in Identification of Crop Diseases and Pests. Agronomy 2023, 13, 2232. [Google Scholar] [CrossRef]
Gao, Z.; Shao, Y.; Xuan, G.; Wang, Y.; Liu, Y.; Han, X. Real-time hyperspectral imaging for the in-field estimation of strawberry ripeness with deep learning. Artif. Intell. Agric. 2020, 4, 31–38. [Google Scholar] [CrossRef]
Xiao, B.; Nguyen, M.; Yan, W.Q. Fruit ripeness identification using transformers. In Applied Intelligence; Springer: Berlin/Heidelberg, Germany, 2023; pp. 1–12. [Google Scholar]
Li, X.; Li, J.; Tang, J. A deep learning method for recognizing elevated mature strawberries. In Proceedings of the 2018 33rd Youth Academic Annual Conference of Chinese Association of Automation (YAC), Nanjing, China, 18–20 May 2018; pp. 1072–1077. [Google Scholar]
Niu, Y.; Lu, M.; Liang, X.; Wu, Q.; Mu, J. YOLO-plum: A high precision and real-time improved algorithm for plum recognition. PLoS ONE 2023, 18, e0287778. [Google Scholar] [CrossRef] [PubMed]
Gai, R.; Chen, N.; Yuan, H. A detection algorithm for cherry fruits based on the improved YOLO-v4 model. Neural Comput. Appl. 2023, 35, 13895–13906. [Google Scholar] [CrossRef]
Wang, G.; Zheng, H.; Li, X. ResNeXt-SVM: A novel strawberry appearance quality identification method based on ResNeXt network and support vector machine. J. Food Meas. Charact. 2023, 17, 4345–4356. [Google Scholar] [CrossRef]
Ridho, M.F. Strawberry Fruit Quality Assessment for Harvesting Robot using SSD Convolutional Neural Network. In Proceedings of the 2021 8th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI), Semarang, Indonesia, 20–21 October 2021; pp. 157–162. [Google Scholar]
An, Q.; Wang, K.; Li, Z.; Song, C.; Tang, X.; Song, J. Real-Time Monitoring Method of Strawberry Fruit Growth State Based on YOLO Improved Model. IEEE Access 2022, 10, 124363–124372. [Google Scholar] [CrossRef]
Zhang, Y.; Yu, J.; Chen, Y.; Yang, W.; Zhang, W.; He, Y. Real-time strawberry detection using deep neural networks on embedded system (rtsd-net): An edge AI application. Comput. Electron. Agric. 2022, 192, 106586. [Google Scholar] [CrossRef]
Fu, L.; Feng, Y.; Majeed, Y.; Zhang, X.; Zhang, J.; Karkee, M.; Zhang, Q. Kiwifruit detection in field images using Faster R-CNN with ZFNet. IFAC-PapersOnLine 2018, 51, 45–50. [Google Scholar] [CrossRef]
Nishizawa, T. Current status and future prospect of strawberry production in East Asia and Southeast Asia. In Proceedings of the IX International Strawberry Symposium, Rimini, Italy, 1–5 May 2021; pp. 395–402. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv preprint 2017, arXiv:1710.09412. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Qi, L.; Li, B.; Chen, L.; Wang, W.; Dong, L.; Jia, X.; Huang, J.; Ge, C.; Xue, G.; Wang, D. Ship Target Detection Algorithm Based on Improved Faster R-CNN. Electronics 2019, 8, 959. [Google Scholar] [CrossRef]
Ullah, A.; Xie, H.; Farooq, M.O.; Sun, Z. Pedestrian detection in infrared images using fast RCNN. In Proceedings of the 2018 Eighth International Conference on Image Processing Theory, Tools and Applications (IPTA), Xi’an, China, 7–10 November 2018; pp. 1–6. [Google Scholar]
Li, D.J.; Li, R.H. Mug defect detection method based on improved Faster RCNN. Laser Optoelectron. Prog. 2020, 57, 353–360. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Zhou, C.; Hu, J.; Xu, Z.; Yue, J.; Ye, H.; Yang, G. A novel greenhouse-based system for the detection and plumpness assessment of strawberry using an improved deep learning technique. Front. Plant Sci. 2020, 11, 559. [Google Scholar] [CrossRef] [PubMed]
Sun, X.; Wu, P.; Hoi, S.C. Face detection using deep learning: An improved faster RCNN approach. Neurocomputing 2018, 299, 42–50. [Google Scholar] [CrossRef]
Li, Y.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4804–4814. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]

Figure 1. Images of different growth stages.

Figure 2. Image clipping process image.

Figure 3. Brightness processing contrast image.

Figure 4. Image of data augmentation process.

Figure 5. Faster R-CNN and MRS Faster R-CNN structure diagram.

Figure 6. Residual module.

Figure 7. ResNet50 residual module.

Figure 8. ResNet50 flowchart.

Figure 9. VGG16 network structure diagram.

Figure 10. Anchor boxes diagram.

Figure 11. RPN structure.

Figure 12. The action position of Soft-NMS.

Figure 13. Soft-NMS processing process diagram.

Figure 14. Labelimg Annotation Interface.

Figure 15. (a–d) Prediction Images of R Faster R-CNN. (e–h) Prediction Images of MVS Faster R-CNN. (i–l) Prediction Images of MRS Faster R-CNN.

Figure 16. AP diagram.

Figure 17. Recall diagram.

Table 1. ResNet50 structure.

Layer Name	Output Size	50-Layer
Conv1	300 × 300	7 × 7, 64, stride 2
Conv2	150 × 150	3 × 3 max pool, stride 2
Conv2	150 × 150	$[\begin{matrix} 1 \times 1, 64 \\ 3 \times 3, 64 \\ 1 \times 1, 256 \end{matrix}] \times 3$
Conv3	75 × 75	$[\begin{matrix} 1 \times 1, 128 \\ 3 \times 3, 128 \\ 1 \times 1, 512 \end{matrix}] \times 4$
Conv4	38 × 38	$[\begin{matrix} 1 \times 1, 256 \\ 3 \times 3, 256 \\ 1 \times 1, 1024 \end{matrix}] \times 6$
Conv5	19 × 19	$[\begin{matrix} 1 \times 1, 512 \\ 3 \times 3, 512 \\ 1 \times 1, 2048 \end{matrix}] \times 3$
	1 × 1	Average pool, 1000-d fc, Softmax

Table 2. R Faster R-CNN Parameter Settings.

Hyperparameter Name	Initialize Learning Rate (Init_lr)	Minimum Learning Rate	Momentum Value	Optimizer	Weight Decay	Learning Rate Decay	Backbone	nms_iou	Training Epochs	Batch Size
Numerical value	10⁻²	Init_lr × 0.01	0.9	SGD	0	cos	ResNet50	0.3	350	8

Table 3. MVS Faster R-CNN Parameter Settings.

Hyperparameter Name	Initialize Learning Rate (Init_lr)	Minimum Learning Rate	Momentum Value	Optimizer	Weight Decay	Learning Rate Decay	Backbone	nms_iou	Training Epochs	Batch Size
Numerical value	10⁻²	Init_lr × 0.01	0.9	SGD	0	cos	VGG16	0.3	350	8

Table 4. MRS Faster R-CNN Parameter Settings.

Hyperparameter Name	Initialize Learning Rate (Init_lr)	Minimum Learning Rate	Momentum Value	Optimizer	Weight Decay	Learning Rate Decay	Backbone	nms_iou	Training Epochs	Batch Size
Numerical value	10⁻²	Init_lr × 0.01	0.9	SGD	0	cos	ResNet50	0.3	350	8

Table 5. Experimental result parameters.

Experimental Group	Classification	$F_{1}$	Precision	Recall	AP
R Faster R-CNN	Mature	94.11%	94.31%	93.91%	94.10%
R Faster R-CNN	Immature	84.32%	82.45%	86.27%	78.66%
MVS Faster R-CNN	Mature	83.58%	83.51%	83.65%	81.75%
MVS Faster R-CNN	Immature	71.52%	72.67%	70.41%	68.49%
MRS Faster R-CNN (the method of this article)	Mature	95.20%	95.12%	95.29%	94.36%
MRS Faster R-CNN (the method of this article)	Immature	88.48%	88.79%	88.17%	84.00%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Zhang, L.; Yu, H.; Guo, Z.; Zhang, R.; Zhou, X. Research on the Strawberry Recognition Algorithm Based on Deep Learning. Appl. Sci. 2023, 13, 11298. https://doi.org/10.3390/app132011298

AMA Style

Zhang Y, Zhang L, Yu H, Guo Z, Zhang R, Zhou X. Research on the Strawberry Recognition Algorithm Based on Deep Learning. Applied Sciences. 2023; 13(20):11298. https://doi.org/10.3390/app132011298

Chicago/Turabian Style

Zhang, Yunlong, Laigang Zhang, Hanwen Yu, Zhijun Guo, Ran Zhang, and Xiangyu Zhou. 2023. "Research on the Strawberry Recognition Algorithm Based on Deep Learning" Applied Sciences 13, no. 20: 11298. https://doi.org/10.3390/app132011298

APA Style

Zhang, Y., Zhang, L., Yu, H., Guo, Z., Zhang, R., & Zhou, X. (2023). Research on the Strawberry Recognition Algorithm Based on Deep Learning. Applied Sciences, 13(20), 11298. https://doi.org/10.3390/app132011298

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on the Strawberry Recognition Algorithm Based on Deep Learning

Abstract

Featured Application

Abstract

1. Introduction

2. Materials and Methods

2.1. Strawberry Image Acquisition

2.2. Strawberry Image Preprocessing

2.2.1. Image Clipping

2.2.2. Image Enhancement

2.2.3. Image Denoising

2.2.4. Image Scaling and Data Augmentation

2.3. Introduction of Faster R-CNN Network Structure and Optimization

2.3.1. ResNet50 Backbone Feature Extraction Network

2.3.2. VGG16 Backbone Feature Extraction Network

2.3.3. Region Proposal Network

2.3.4. Soft-NMS, ROI Pooling, Classification and Regression

2.4. Experimental Description and Evaluation Methods

2.4.1. Production of Data Sets

2.4.2. Evaluation Analysis and Experimental Design

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI