MLP-YOLOv5: A Lightweight Multi-Scale Identification Model for Lotus Pods with Scale Variation

Lu, Ange; Liu, Jun; Cui, Hao; Ma, Lingzhi; Ma, Qiucheng

doi:10.3390/agriculture14010030

Open AccessArticle

MLP-YOLOv5: A Lightweight Multi-Scale Identification Model for Lotus Pods with Scale Variation

by

Ange Lu

^1,2,

Jun Liu

^1,2,*,

Hao Cui

^1,2,

Lingzhi Ma

^1,2 and

Qiucheng Ma

^1,2

¹

School of Mechanical Engineering and Mechanics, Xiangtan University, Xiangtan 411105, China

²

Engineering Research Center of Complex Track Processing Technology & Equipment, Ministry of Education, Xiangtan University, Xiangtan 411105, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(1), 30; https://doi.org/10.3390/agriculture14010030

Submission received: 10 November 2023 / Revised: 16 December 2023 / Accepted: 20 December 2023 / Published: 23 December 2023

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Lotus pods in unstructured environments often present multi-scale characteristics in the captured images. As a result, it makes their automatic identification difficult and prone to missed and false detections. This study proposed a lightweight multi-scale lotus pod identification model, MLP-YOLOv5, to deal with this difficulty. The model adjusted the multi-scale detection layer and optimized the anchor box parameters to enhance the small object detection accuracy. The C3 module with transformer encoder (C3-TR) and the shuffle attention (SA) mechanism were introduced to improve the feature extraction ability and detection quality of the model. GSConv and VoVGSCSP modules were adopted to build a lightweight neck, thereby reducing model parameters and size. In addition, SIoU was utilized as the loss function of bounding box regression to achieve better accuracy and faster convergence. The experimental results on the multi-scale lotus pod test set showed that MLP-YOLOv5 achieved a mAP of 94.9%, 3% higher than the baseline. In particular, the model’s precision and recall for small-scale objects were improved by 5.5% and 7.4%, respectively. Compared with other mainstream algorithms, MLP-YOLOv5 showed more significant advantages in detection accuracy, parameters, speed, and model size. The test results verified that MLP-YOLOv5 can quickly and accurately identify multi-scale lotus pod objects in complex environments. It could effectively support the harvesting robot by accurately and automatically picking lotus pods.

Keywords:

lotus pod; multi-scale object detection; deep learning; MLP-YOLOv5; lightweight

1. Introduction

Lotus seeds are the seeds of the perennial aquatic plant lotus [1], which have a delicious taste and excellent medicinal and nutritional value [2]. For this reason, they are often used as raw materials for food, medicine, and nutrition extraction. The lotus seeds grow in the lotus pod, and harvesting the lotus pod is a prerequisite for lotus seed production. However, lotus pods grow in complex environments such as lakes, swamps, and muddy fields. This causes traditional manual harvesting to often face problems of rough working conditions (Figure 1a), low working efficiency, and heavy labor intensity. In recent years, the intensification of population ageing has highlighted an increasing harvesting labor shortage. As a result, it directly impacted the lotus seed industry. Hence, it is urgent to research automatic lotus pod harvesting technology.

Accurate identification is a prerequisite for developing this kind of technology. In recent years, deep learning (DL) algorithms that possess high robustness and accuracy have been widely used to identify fruits and vegetables. Among them, representative algorithms include the SSD [3], the YOLO series [4], and the Faster R-CNN series [5]. Regarding specific research reports on fruit and vegetable identification, Wang et al. [6] designed an improved SSD algorithm for identifying Lingwu long jujubes. Chen et al. [7] developed a cherry tomato detection algorithm based on YOLOv3. Zhang et al. [8] studied the real-time detection of grape clusters based on YOLOv5. Chen et al. [9] improved the YOLOv5 algorithm for Camellia oleifera detection. Du et al. [10] designed a ripe strawberry detection model named DSW-YOLO. Yang et al. [11] developed an improved YOLOv7 model for apple detection.

Common vegetables and fruits (e.g., lettuces, strawberries, cherry tomatoes, kiwifruits, etc.) are often grown in a relatively standardized environment, and the object individual’s scale distribution in the image is relatively similar. However, lotus pods grow in a rather unstructured environment. Their colors are similar to those of surrounding objects (Figure 1), and occlusion and overlapping phenomena often exist. Moreover, the size, location, posture, and height of the lotus pod individuals are randomly distributed, showing typical multi-scale characteristics in the image collected by the vision system. Figure 2 shows an example of this situation.

The small-scale lotus pod object in the image has fewer pixels and less feature information, so identification algorithms easily ignore it during detection. In addition, the occlusion caused by surrounding objects further increases the multiple variations of the lotus pod’s scale. The above problems have made the lotus pod identification process quite prone to generating missed and false detections, thus increasing the difficulty of automatic identification. In addition, performing lotus pod identification in the wild requires that the detection algorithm has a small calculation scale and fast speed. Therefore, it is important to investigate a lightweight and efficient multi-scale object detection model to identify lotus pods rapidly and accurately.

Regarding multi-scale and small object detection research, there have been many literature reports on aerial image detection and autonomous driving. Cao et al. [12] built an improved Faster R-CNN algorithm to detect low-resolution traffic signs by optimizing the loss function, introducing an improved non-maximum suppression, and using multi-scale convolutional feature fusion. Zhu et al. [13] enhanced the YOLOv5 model by adding a detection layer, using the transformer prediction heads, and integrating an attention mechanism to identify objects at different scales in drone-captured scenarios. Ji et al. [14] improved the YOLOv4 model to enhance detection performance for small-scale objects. The main measures include adding a detection layer, using an expanded field-of-perception block, optimizing the loss function, and introducing an attention mechanism. Mahaur and Mishra [15] improved the YOLOv5 model by setting up a small-scale detection head and adjusting cross-layer connections in the neck network to enhance the detection accuracy for small-scale traffic objects. Wang et al. [16] presented an improved YOLOv5 network to detect multi-scale traffic signs. The model’s detection performance was improved by utilizing the adaptive attention and feature enhancement modules.

In precision agriculture, there have been some explorations around multi-scale plant disease identification and small-scale fruit detection. Zhao et al. [17] designed a specific multi-scale feature fusion network for the original Faster R-CNN model to detect small-scale strawberry diseases. Li et al. [18] developed a YOLOv5-based network model for multi-scale cucumber disease detection. It combined the coordinate attention mechanism, transformer modules, feature fusion network, and multi-scale training to enhance the detection performance for small targets. Lu et al. [19] developed a detection algorithm for small-scale fruits named ODL Net. The utilized label assignment strategy and semantic enhancement module enhanced the model’s small object detection performance. Hitherto, there have been few research reports on multi-scale lotus pod identification.

The above literature explored ways to enhance the multi-scale object detection performance of the model. And they could provide a basic reference for the multi-scale identification research of lotus pods. However, lotus pods’ characteristics and growth environment differ from those targets in the literature. Directly adopting existing models will inevitably affect the detection quality. Therefore, exploring a multi-scale detection model suitable for lotus pods is necessary.

Measures such as adjusting the scale of the detection layer and introducing attention mechanisms and transformer modules can effectively enhance the model’s detection accuracy of small targets in complex environments. In addition, using lightweight convolution is expected to effectively reduce the size and computational complexity of the model, thereby improving deployment performance. Therefore, aiming at the difficulty of identifying multi-scale lotus pods in unstructured environments, this study proposed a lightweight multi-scale lotus pod identification model (MLP-YOLOv5). In MLP-YOLOv5, the multi-scale detection layer and network anchor box were modified and optimized to enhance the small lotus pod objects’ detection accuracy; the C3 module with transformer encoder (C3-TR) was utilized to improve the model’s feature extraction ability of low-resolution lotus pod objects; the shuffle attention (SA) mechanism was introduced to strengthen the attention to the lotus pod’s feature information and reduce the interference of irrelevant information; GSConv and VoVGSCSP modules were adopted to achieve lightweight design of the neck network, thereby reducing network parameters and calculation scale and meeting agricultural application requirements; the SIoU loss function was utilized to further enhance the model’s accuracy and convergence speed. Finally, experimental testing on the multi-scale lotus pod test set was conducted to validate the model’s effectiveness. The research results are expected to effectively support the automatic harvesting of lotus pods.

2. Materials and Methods

2.1. Lotus Pods Image Acquisition

This study took the lotus pods of the lotus seed varieties “TaiKong 36” and “CunSan” as the research object. The lotus pod images were collected in Xiangtan County, Hunan Province, China. The collection time was in the summer of 2021 and 2022. Wherein, the collection process included different times of the day and involved various weather conditions. The acquisition devices were cameras on several smartphones (including iPhone 6s, iPhone 8 Plus, Huawei Nova 7, Vivo X23, etc.) and a drone (DJI Air 2S, Shenzhen, China). The images collected covered a variety of resolutions (including 3000 × 3000, 3000 × 4000, 3456 × 4608, 4032 × 3024, 5632 × 4224, etc.). The shooting process covered different lighting conditions, shooting angles, and distances. These conditions ensured the richness and diversity of the data regarding appearance and scale. Figure 3 shows some sample large, medium, and small-scale lotus pods in the images.

2.2. Dataset Preparation

We prepared 5000 images and then divided them into training and validation sets in a ratio of 8:2. In addition, we also prepared 800 images of different lotus pod scales as an independent multi-scale test set to evaluate the multi-scale detection performance of the model comprehensively. In order to improve the model’s robustness and generalization, data augmentation was used to expand the image number of the training set further to 8000. The operations involved include translation, flipping, blur adding, noise adding, and light changing.

The lotus pod has the characteristics of progressive ripening, and its appearance is shown in Figure 4. Figure 4a shows unripe lotus pods, which are smaller in size and mostly in the shape of a triangle or a wine glass, with very small lotus seeds inside. This study will not consider them because they are unripe and unsuitable for harvesting.

The detection object of this study is ripe lotus pods, whose appearance is shown in Figure 4b. The shape of ripe lotus pods is mostly bowl-shaped, with apparent wrinkles on the surface. The lotus seeds inside have grown up and protruded from the surface of the lotus pod. Specifically, the ripe lotus pod can be subdivided into three states: Green ripening stage, half ripening stage, and full ripening stage. They can all be harvested, so this study did not subdivide them but combined them into one category, collectively called ripe lotus pods. Based on this, LabelImg was utilized to label ripe lotus pod objects in all images with enclosing rectangular boxes (very distant targets were not labeled). The number of labeled categories was 1, and the category label name used was “Ripe-LP”.

According to the proportion of the annotation box area in the entire image area, we divided the annotation boxes into three scales: Large, medium, and small. The proportion thresholds for distinguishing small, medium, and large scales were 0.58% and 4%, respectively. Table 1 lists the number of images, annotation boxes, and annotation box scale statistics data in the above datasets. The results indicate that the number of annotation boxes at each scale in the datasets is reasonably distributed, fully considering the training requirements of a multi-scale detection model.

2.3. The Principle of YOLOv5

YOLOv5 is a classic deep-learning object detection algorithm. It possesses the merits of excellent accuracy, fast speed, lightweight, and easy deployment. It has attracted wide attention in many fields. This study selected the YOLOv5 (model s, v6.0) as the baseline model. Its network structure is illustrated in Figure 5.

The YOLOv5 network structure includes the input end, backbone, neck, and detection end. Among them, the input end is responsible for inputting images and performing image preprocessing. The backbone includes the CBS, C3, and SPPF modules, which are used to extract object feature information from the input image. The neck is used to fuse features with different scales, which are composed of the bidirectional structure formed by the FPN and PAN. The detection end uses three detection heads with output feature map scales of 20 × 20, 40 × 40, and 80 × 80 to identify large, medium, and small targets. Finally, the results of the predicted bounding box, object category, and confidence are output.

Specifically, the network has two C3 modules, i.e., C3-1 and C3-2 [20]. The C3-1 module comprises the CBS module and the Residual Unit (ResUnit). In comparison, the C3-2 module replaces the ResUnit in the original C3-1 with the CBS module. In addition, the SPPF module is used to achieve local and global feature fusion and improve the network’s computing efficiency. It includes the CBS, Maxpool, and Concat modules [21].

2.4. Development of the MLP-YOLOv5 Model

Although YOLOv5 has good detection performance, some limitations still exist for multi-scale lotus pod detection in unstructured growth environments: (1) Although the model set three output scales in the detection end, the smallest scale of 80 × 80 is still not small enough for some small-scale lotus pod objects. (2) The aspect ratio parameter of the pre-defined anchor boxes adopted in the original YOLOv5 may not be suitable for lotus pods. (3) The lotus pod’s color is similar to that of surrounding objects and is often occluded and overlapped, making finding and identifying lotus pods more difficult. (4) The original YOLOv5 model used for lotus pod detection has certain parameter redundancy, and there is still space for simplification to be more conducive to agricultural equipment deployment.

This study developed a lightweight, multi-scale identification model for lotus pods named MLP-YOLOv5 to deal with the above difficulties. Five improvements were performed: (1) Optimized the output size of the multi-scale detection layer and the anchor box parameters; (2) introduced the C3-TR module; (3) added a shuffle attention mechanism; (4) designed a lightweight neck; (5) improved the loss function. The network structure of MLP-YOLOv5 is illustrated in Figure 6.

2.4.1. Optimization of the Multi-Scale Detection Layer Structure and Anchor Box Parameters

A small object detection layer of 160 × 160 scale was newly added to the detection end of the MLP-YOLOv5 network, and the original large detection layer of 20 × 20 scale that existed in the original YOLOv5 was deleted. Figure 7 schematically shows the model network structure after adjusting the network detection layer. After the original neck performs the second up-sampling to obtain the 80 × 80 feature map, it performs the up-sampling again. It concatenates the feature map with the 160 × 160 feature map in the backbone (the 2nd layer in Figure 6), thus expanding the output feature map’s size from 80 × 80 to 160 × 160 and forming the new small object detection head. After reasonably adjusting the network detection layer output size, the MLP-YOLOv5 model can not only better utilize shallow feature information, improve the small target positioning capability, and reduce the lotus pod feature information loss but also reduce the scale of the model’s parameters.

The sizes of the pre-defined anchor boxes in a DL model directly impact the training quality and detection performance of the model. The original YOLOv5 pre-defined three sets of nine anchor boxes according to the clustering result on the COCO dataset. Since we adjusted the output scale of the network’s detection layer, and the lotus pod had unique shape characteristics, optimizing the anchor box sizes was necessary to strengthen the model’s detection accuracy for multi-scale lotus pod objects.

This study used K-means clustering to optimize the anchor box sizes. Based on the height and width data of the lotus pod annotation boxes in the training set, nine optimized anchor box size parameters were finally obtained. The results are shown in Table 2. Among them, the anchor boxes corresponding to the 160 × 160 output layers were used for small object detection, and the other anchor boxes were used primarily for medium and large object detection.

2.4.2. C3 Module with a Transformer Encoder

The small-scale lotus pod occupies a few pixels in the image, making it easy to cause context information loss during detection. The transformer is a deep neural network that utilizes the self-attention mechanism. It can adaptively and globally aggregate object features to achieve powerful feature expression capabilities [22]. Its encoder block structure can improve the model’s detection performance, especially for those low-resolution objects [23]. The structure of the transformer encoder block is illustrated in Figure 8a. Its structure includes two main sublayers: The multi-head attention layer and the subsequent multi-layer perceptron. In the MLP-YOLOv5 network, the transformer encoder block was integrated into the C3 module at the end part of the backbone to form the C3-TR module. The specific location is shown in Figure 6.

Figure 8b shows the basic structure of the C3-TR, where the ResUnit in the original C3-1 structure was replaced by the transformer encoder block (Figure 8a). In addition, the layer normalization (LN) layers were removed from the original transformer encoder block to achieve faster speed and better performance.

2.4.3. Shuffle Attention Mechanism Module

During detection, the large amount of low-level structural and textural information may prevent the model from correctly distinguishing between the objects and the background, resulting in false detection of surrounding objects. The attention mechanism that can select the most important information from a great amount of data could deal with this problem [24]. It could improve the model’s detection performance effectively by adjusting the weight data to suppress irrelevant information interference and increase attention to useful information.

This study added the shuffle attention mechanism (SA) [25] to the MLP-YOLOv5 model. SA is an efficient attention mechanism module combining spatial attention and channel attention. Its structure is illustrated in Figure 9. The working process is as follows: First, SA divides the input feature map X into N groups. Then, each sub-feature map X_k is split into X_k₁ and X_k₂ branches. Branch X_k₁ is used to form a channel feature map, while branch X_k₂ is used to form a spatial feature map. Subsequently, the two branches are concatenated, the sub-features are aggregated, and the cross-group information communication is realized through the channel shuffle operation. Finally, the refined output feature map Y is obtained.

By establishing the feature dependence of space and channel, the SA allows the detection model to focus on the two critical pieces of information of the lotus pods in the image: The spatial position information (“where”) and the semantic information (“what”), thus reducing missed and false detection. In MLP-YOLOv5, the SA layer was added before the three Concat layers in the neck part, and the specific location is shown in Figure 6.

2.4.4. Lightweight Design of the Neck Network

Harvesting robots working in natural environments have high requirements for the detection model’s accuracy and speed performance. Usually, the greater the scale of model parameters, the larger the model’s size will be, and the difficulty and cost of actual deployment of the model in harvesting equipment will also increase. Therefore, to effectively decrease the model’s complexity while maintaining its detection accuracy, this study introduced a lightweight convolution named GSConv [26] to replace the CBS module in the original neck network.

Figure 10a schematically illustrates the structure of GSConv. GSConv combines the information generated by depthwise separable convolution (DWConv) and standard convolution. Then, it uses the shuffle operation to exchange information between different channels and subsequently acquires the output results. GSConv fully combines the advantages of the two kinds of convolution and enhances the expression ability of the network. It not only effectively reduces the computational complexity but also promotes the balance between the model’s accuracy and speed [27].

Based on the GSConv, we also introduced the VoVGSCSP module, and its structure is shown in Figure 10b. In VoVGSCSP, the processing of the input feature map includes two routes. The input in one route is processed by convolution, and then the generated features are extracted using the GS bottleneck designed based on GSConv. The convolution module processes the other part of the input as a residual connection, and finally, the spliced convolution is concatenated for output. VoVGSCSP fully uses the advantages of GSConv and GS bottleneck, which improve feature extraction ability and reduce model parameters, thus improving lightweight characteristics [28]. This study replaces the original C3-2 module in the neck network of YOLOv5 with the VoVGSCSP module, which further reduces the complexity of the calculation and maintains sufficient accuracy.

2.4.5. Improvement of Loss Function

The loss function is particularly important for object detection models as an evaluation index to measure the difference between the model’s predicted results and real values. YOLOv5 uses CIoU as the loss function of bounding box regression. It considers the central point distance, overlap area, and box’s aspect ratio. However, CIoU does not take into account the direction relationship between the prediction and the ground truth, which may lead to less effective and slower model convergence. As a new loss function, SIoU [29] redefines the penalty metric and considers the direction-matching between the prediction and the ground truth, which helps to alleviate the low convergence and efficiency problems that exist in conventional loss functions [30]. This study replaced the preset CIoU in YOLOv5 with the above-mentioned SIoU as the bounding box regression loss function for lotus pod detection.

Figure 11 illustrates the geometric relationship between the ground truth (gt, red) and prediction (blue) boxes. The calculation process of the SIoU loss function takes into account the costs of the angle, distance, shape, and IoU parameters. The relevant calculation formulas are as follows [29,30,31].

(1): Angle cost Λ:

\begin{matrix} Λ = \cos (2 \times (\arcsin (\frac{c_{h 1}}{σ}) - \frac{π}{4})) \end{matrix}

(1)

where c_h₁ is the center point height difference between the two boxes, and σ is the center point distance between the two boxes (Figure 11a). The calculations are as follows.

\begin{matrix} σ = \sqrt{{(b_{c x}^{g t} - b_{c x})}^{2} + {(b_{c y}^{g t} - b_{c y})}^{2}} \end{matrix}

(2)

\begin{matrix} c_{h 1} = \max (b_{c y}^{g t}, b_{c y}) - \min (b_{c y}^{g t}, b_{c y}) \end{matrix}

(3)

(

b_{c x}^{g t}, b_{c y}^{g t}

) and (b_cx, b_cy) are the coordinates of the center point of the two boxes, respectively.

(2): Distance cost Δ:

\begin{matrix} Δ = \sum_{t = x, y} (1 - e^{- γ ρ_{t}}) \end{matrix}

(4)

\begin{matrix} ρ_{x} = {(\frac{b_{c x}^{g t} - b_{c x}}{c_{w 2}})}^{2}, ρ_{y} = {(\frac{b_{c y}^{g t} - b_{c y}}{c_{h 2}})}^{2} \end{matrix}

(5)

\begin{matrix} γ = 2 - Λ \end{matrix}

(6)

where c_w₂ and c_h₂ represent the width and height parameters of the minimum enclosing rectangle formed between the two boxes (Figure 11b).

(3): Shape cost Ω:

\begin{matrix} Ω = \sum_{t = w, h} {(1 - e^{- W_{t}})}^{θ} \end{matrix}

(7)

\begin{matrix} W_{w} = \frac{|w - w_{g t}|}{\max (w, w_{g t})}, W_{h} = \frac{|h - h_{g t}|}{\max (h, h_{g t})} \end{matrix}

(8)

where (w, h) and (w_gt, h_gt) represent the width and height parameters of the prediction and ground truth boxes, respectively; θ defines the degree of concern for shape cost.

(4): IoU cost:

\begin{matrix} IoU = \frac{|B \cap B_{g t}|}{|B \cup B_{g t}|} \end{matrix}

(9)

where B and B_gt represent the areas of the prediction and ground truth boxes, respectively (Figure 11c).

Based on the above calculations, the final expression of the SIoU loss function is shown in Equation (10).

\begin{matrix} {L O S S}_{S I o U} = 1 - IoU + \frac{Δ + Ω}{2} \end{matrix}

(10)

2.5. Experimental Environment and Training Parameters

The computing platform used in this study was a laptop (T58-V37, Machinist, Suzhou, China). The hardware configurations include an 8-core AMD Ryzen 9 5900 HX CPU, 32 GB RAM, NVIDIA GeForce RTX 3070 (8 GB) GPU, and 1 TB SSD. The software configurations include the Windows 11 operating system, the PyTorch 1.12.1 deep learning framework (with the CUDA version of 11.6 and the cuDNN version of 8.0), and the Python 3.7 programming language. The basic training parameters of the MLP-YOLOv5 model include: the maximum number of training epochs was 200; the image input size was 640 × 640; the batch size was set to 16; the momentum was set to 0.937; the initial learning rate was set to 0.01, and the optimizer used was SGD.

2.6. Indicators of Model Evaluation

In this study, P (precision), R (recall), mAP (mean average precision), parameters, model size, GFLOPs (giga floating-point operations per second), and FPS (frames per second) were used as the indicators for model performance evaluation.

Regarding detection performance, P and R are used to judge the model’s false and missed detections, respectively. mAP reflects the comprehensive detection performance of the model. The IoU threshold value of mAP was set to 0.5. The formulas for calculating the above metrics are listed as follows (higher is better):

\begin{matrix} P = \frac{T P}{T P + F P} \times 100 % \end{matrix}

(11)

\begin{matrix} R = \frac{T P}{T P + F N} \times 100 % \end{matrix}

(12)

\begin{matrix} AP = \int_{0}^{1} P (R) d R \end{matrix}

(13)

\begin{matrix} mAP = \frac{Σ_{i = 1}^{c} A P_{i}}{C} \end{matrix}

(14)

where TP (true positive) denotes the number of correctly predicted lotus pods, FP (false positive) indicates the number of falsely predicted lotus pods, FN (false negative) denotes the number of lotus pod objects missed, and C represents the number of object categories, which is taken as 1 in this study.

Regarding model size and detection speed, parameters are the scale of the calculation parameters of the model. GFLOPs represent the scale of computing power required by the model. FPS reflects the model’s real-time performance. In addition, model size refers to the storage space required for the model weight, and the smaller the value, the more favorable it is for deployment on agricultural equipment.

3. Experimental and Results Analysis

3.1. Experimental Comparison before and after Model Improvement

3.1.1. Model Training

The YOLOv5s and MLP-YOLOv5 models were trained and tested to verify the effectiveness of the model improvement. Figure 12 shows the loss curves during model training. As can be seen from the figure, with the increase in training epochs, both models’ loss curves steadily decreased and eventually stabilized after approximately 180 epochs. The MLP-YOLOv5 model achieved a lower loss value than the YOLOv5s model from the initial stage to the final convergence stage. This means that by improving the loss function and optimizing anchor box parameters, the loss value of the model was effectively reduced, thereby making the model more robust [18].

3.1.2. Experimental Results

The test results of the two models are listed in Table 3. It can be seen from the table that the P, R, and [email protected] values achieved by the MLP-YOLOv5 model were 93.7%, 90.8%, and 94.9%, respectively, which were increased by 1.7%, 3.9%, and 3% compared with the baseline. For another, the achieved model size, parameter number, and GFLOPs of MLP-YOLOv5 were significantly reduced by 25.7%, 30.0%, and 14.6%, respectively.

In addition, we specially extracted and calculated the experiment results of all the small-scale objects in the test set, as listed in Table 4. Compared with the YOLOv5s model, the P and R values of the MLP-YOLOv5 model increased by 5.5% and 7.4%, respectively. This fully proves that the improvements made in this study have effectively improved the model’s detection performance for small-scale lotus pods and reduced the occurrence of missed and false detections.

In addition, to investigate the feature extraction effects of MLP-YOLOv5 for small-scale lotus pod objects, the feature maps before and after the addition of the 160 × 160 detection layer were visually analyzed. Specifically, an image containing several small-scale lotus pods was selected and sent to the model for detection. Then, the representative feature maps of the 17th, 19th, 20th, and 22nd layers in the model network were extracted, and the corresponding visualization result pictures are shown in Figure 13.

The original image contains four ripe lotus pods (Figure 13a). However, the feature map on the 17th layer only displayed three bright spot-like regions representing lotus pods (Figure 13b), which means that the model failed to effectively capture the feature information of the upper left small-scale lotus pod (red circle in Figure 13b). The 17th layer corresponds to the 80 × 80 feature map. Combined with the analysis in Section 2.4, it can be known that the feature map’s size has not been expanded at this time. When the feature map’s size was expanded to 160 × 160, the corresponding feature map of the 19th layer (Figure 13c) obtained more effective lotus pod feature information, thus successfully focusing on this small-scale lotus pod. Furthermore, as displayed in the corresponding feature map of the 20th layer (Figure 13d), when SA was added to the model network, the feature information of the four lotus pods received continuous attention. Subsequently, in the corresponding feature map of the 22nd layer (Figure 13e), those areas where these four lotus pods were located have become bright and prominent. In contrast, the irrelevant background information has become dim. This means that SA successfully suppressed the irrelevant information in complex backgrounds and highlighted the feature information of the lotus pods. Finally, the model detected all four ripe lotus pods, as shown in Figure 13f.

3.1.3. Comparison of Detection Effects

Here, some scenes were selected to show the detection effect of the models, including scenes with background interference (Figure 14), scenes with occlusion and light influence (Figure 15), and the medium-small-scale lotus pod detection scenes (Figure 16). In the above images, the FP mark represents the falsely identified objects, and the FN mark represents the missed identified objects.

The first column and second column of Figure 14 show that the YOLOv5s model falsely identified the unripe lotus pods in the image, while MLP-YOLOv5 achieved correct identification. Moreover, in the third column, the YOLOv5s model falsely identified the lotus leaves in the image as lotus pods. However, the MLP-YOLOv5 model could maintain accurate judgment when facing background object interference in complex environments.

The first column and second column of Figure 15 show that the YOLOv5s model missed identifying the lotus pods occluded by lotus leaves, while the MLP-YOLOv5 model could successfully identify these lotus pods. As seen from the third column, the YOLOv5s model failed to identify the lotus pod in the shadow area, while the MLP-YOLOv5 model still achieved correct identification. This means MLP-YOLOv5 adapted well to the occlusion and lighting changes.

According to the detection effect of medium-small scale lotus pods shown in Figure 16, the YOLOv5s model generated missed and false detection phenomena in detecting small-scale lotus pod targets. In contrast, the MLP-YOLOv5 model showed a better detection effect in identifying small-scale lotus pods.

The above results show that MLP-YOLOv5 exhibited excellent performance for the lotus pod detection task in unstructured growth environments. The model improved the identification accuracy of small objects and reduced false and missed detection in complex environments. It also possesses strong robustness and generalization ability.

3.2. Ablation Experiments

3.2.1. Ablation Experimental Analysis of the Improvement Methods

Ablation experiments were performed to verify the model improvement methods’ contribution to the model’s performance. Table 5 lists the corresponding experimental results.

After the multi-scale detection layer improvement and anchor box optimization (MS), the model’s [email protected] was increased by 1.9%; in addition, the model size was reduced by 1.6 MB, compared with YOLOv5s. After adding a small object detection layer, the model could use shallow layer information more effectively to perform multi-scale feature fusion. This enhanced the model’s detection ability of small-scale lotus pod objects, thereby increasing the detection accuracy. On the other hand, after subtracting the redundant 20 × 20 detection layer, the model size was effectively reduced. The introduction of C3-TR to the backbone not only had no impact on the model size but also increased the [email protected] of the model by 1%; in addition, the P and R values were also improved. This was related to the fact that C3-TR could strengthen the attention and extraction abilities of local and low-resolution features. After adding SA to the neck, the [email protected] of the model increased by 1.7%, and the R increased significantly by 3.6%. This indicated that introducing SA could effectively avoid the interference of complex backgrounds and reduce the probability of missed identification. After completing the lightweight design of the neck, the size of the model was reduced by 2.1 MB; in addition, the [email protected] of the model increased by 1.7%. It shows that the GSConv and VoVGSCSP modules used not only improved accuracy but also had an excellent lightweight effect. Meanwhile, the [email protected] of the model increased by 0.7% after using the SIoU loss function.

Finally, the MLP-YOLOv5 model, which incorporates all the improvements, outperformed other models with a single improvement. The results indicated that the above five improvement measures effectively improved the model’s detection performance and reduced the model’s size, and all these improvements played their due roles.

3.2.2. Ablation Experimental Analysis of the Attention Mechanism

An ablation experiment on the impact of attention mechanisms was performed in this study to explore the suitability of the SA. Representative mainstream attention mechanisms were selected in comparison, i.e., SE (squeeze and excitation) [32], CBAM (convolutional block attention module) [33], and CA (coordinate attention) [34]. Specifically, the above attention mechanisms were set in the same location in the MLP-YOLOv5 model. Table 6 lists the experiment results.

The introduction of SA enabled the model to achieve the least number of parameters when compared with other attention mechanisms. In terms of accuracy, adding SA made the model obtain the largest mAP improvement. To analyze the results further, the SE only focused on attention in the channel dimension, which limited effective capture of local information on lotus pods. While the CBAM considered channel and spatial attention, it only concerned local region information without establishing long-distance dependencies. Moreover, although the CA mechanism performed well in capturing local context information, its computational complexity increased, and it cannot capture long-distance dependencies either. In contrast, SA applied channel attention and spatial attention to each sub-feature and subsequently utilized channel shuffle operations to achieve cross-group information communication. It effectively reduced the interference of complex backgrounds and strengthened the extraction capability of the lotus pod’s location information. Therefore, choosing the SA attention mechanism in this study was reasonable.

3.3. Comparative Analysis of Other Mainstream Detection Models

Further, we compared MLP-YOLOv5 with the currently mainstream one-stage object detection algorithms, i.e., YOLOv3, YOLOv4, YOLOv7, YOLOv8, and SSD. Table 7 lists the test results.

The MLP-YOLOv5 model showed more significant advantages in detection accuracy, speed, parameters, model size, and GFLOPs when compared with the above mainstream algorithms. It is worth mentioning that although the YOLOv7 and YOLOv8 models appeared later than YOLOv5, the proposed MLP-YOLOv5 model in this study still possesses advantages in almost all aspects.

A radar chart analysis was performed based on Table 7, and the results are shown in Figure 17. The graphic area of the MLP-YOLOv5 model was the fullest, indicating that its performance in all aspects was closer to the ideal state than other models.

Figure 18 shows the representative detection effects of the above-mentioned models for lotus pods in unstructured growth environments. For medium- and large-scale lotus pods, all models achieved accurate identification (Figure 18a). However, it can be observed from Figure 18b that the other models had false or missed detection for some small-scale lotus pods. In contrast, the MLP-YOLOv5 model detected all lotus pod objects correctly and showed the best detection results.

3.4. Multi-Scale Lotus Pod Detection Field Test

3.4.1. Test in a Laboratory Environment

A multi-scale lotus pod detection test in a laboratory environment was carried out based on the vision system of the lotus pod harvesting robot developed by our team (as shown in Figure 19a). The image acquisition device was the camera of a smartphone (Huawei Nova 7), whose position was fixed and in a top-down shooting posture. Three lotus pod samples (Figure 19b) were selected for the experiment. Then, they were placed at a series of heights (characterized using the camera-pod distances) from the camera and photographed by the camera to present different scale effects in the images. The principle is shown in Figure 19c. The captured lotus pod images were sent to the MLP-YOLOv5 model for detection.

The detection effects of the test are demonstrated in Figure 20. The model had achieved accurate detection for the same group of lotus pods at different heights (the camera-pod distance range was about 0.5~2.2 m), even if their pixels in the image were continuously changing.

3.4.2. Test in a Real Natural Growth Environment

In addition, a multi-scale lotus pod detection test in a natural environment was conducted. As shown in Figure 21a, images of the same group of lotus pods at different scales were captured by adjusting the distance between the camera (on Huawei Nova 7) installed on the adjustable tripod stand and the lotus pod. Then, these images were tested using the MLP-YOLOv5 model, and the corresponding detection effects are shown in Figure 21b. When the camera-pod distance was changed from dist. 1 to dist. 4 (from 0.5 m to 2.0 m; the interval used was 0.5 m), all three lotus pods in the images were detected accurately.

The above results indicated that the MLP-YOLOv5 model had good adaptability to lotus pod detection under scale change, both in the laboratory and natural growth environments.

4. Discussion

This study established the MLP-YOLOV5 model based on the multi-scale characteristics of the lotus pods. The above experimental results show that the model effectively improved the identification accuracy of lotus objects of different scales and achieved lightweight. The model’s performance has advantages compared with other mainstream algorithms.

In terms of detection accuracy, MLP-YOLOV5 achieved a mAP of 94.9% by adopting measures such as adjusting the multi-scale detection layer, optimizing anchor box parameters, and introducing the transformer encoder block and SA attention mechanism. Compared with the baseline model, the precision and recall rates for small-scale lotus pods were increased by 5.5% and 7.4%, respectively. Similar research has been reported in other literature. Reference [19] used the introduction of the semantic enhancement module and a “fair” label assignment strategy in the designed ODL Net model to improve the detection accuracy of small-scale green fruits. The model’s detection accuracy for small-scale pears increased by 2.4% (AP). This study focused on spherical fruit targets, and changes in viewing angle have little impact on the shape of the target. However, in our research, the shape of lotus pods will change when presented in different postures in the image, thus increasing the detection complexity. Reference [35] introduced a Bidirectional Feature Pyramid Network (BiFPN) into the YOLOv5 model. The detection accuracy of small-scale litchi targets in UAV images was enhanced by improving the regression loss function and using the image slicing method. The AP value of small target detection increased from 27.8% to 57.3%. However, this research is oriented toward the yield estimation task in a large scene. The litchi target is relatively small in the image, and the scale distribution has particularities. This differs from the medium- and short-range identification of an automatic picking task. Reference [36] proposed a DSE-YOLO model for the detection of strawberries at different stages of maturity. The model obtained a mAP value of 86.58%. Due to the size differences of strawberries at different ripening stages in the image, multi-scale features are formed to a certain extent. However, the differences in characteristics among strawberry individuals were mainly due to different fruit maturity stages. This differs from the scale difference caused by the random distribution of the individual size and the position of ripe lotus pods in our study.

MLP-YOLOv5 achieved lightweight by improving the multi-scale detection layer and using GSConv and VoVGSCSP modules. Model size, parameters, and GFLOPs were reduced by 25.7%, 30.0%, and 14.6%, respectively. Reference [37] proposed an apple flower detection method based on YOLOv4. Through channel pruning, the model’s parameters were reduced by 96.74%, the inference time was decreased by 39.47%, and the model size was reduced by 231.51 MB. However, the mAP of the model also decreased accordingly, which may be related to the degree of lightweight. Reference [38] introduced the ShuffleNetV2 network and ECA attention mechanism into the YOLOv5 model to form a lightweight Camellia oleifera detection algorithm. The model’s size was reduced by 56.55% compared to the baseline, but the detection accuracy (mAP) was only increased by 0.3%. This may be related to the fact that the object’s scale is relatively large and the model’s accuracy is already high. Correspondingly, the space for accuracy improvement is limited.

The MLP-YOLOv5 model achieved an initial balance between accuracy and lightweight. However, there is still room for further lightweighting to better adapt to the fieldwork requirements of automatic harvesting equipment. In future work, we will investigate how to achieve deep lightweight of the model while ensuring accuracy and the multi-scale identification task of lotus pods with different maturity levels.

5. Conclusions

This study proposed a lightweight multi-scale lotus pod identification model called MLP-YOLOv5. The following improvements were introduced: The model adjusted the multi-scale detection layer and optimized the anchor box parameters to reduce the model’s size and enhance the detection accuracy of small objects. The C3-TR and SA modules were introduced to improve the feature extraction ability and detection quality of the model. GSConv and VoVGSCSP lightweight modules were adopted to build a lightweight neck, thereby reducing model parameters and size. SIoU was utilized as the loss function of bounding box regression to achieve better accuracy and faster convergence.

The experimental results on the multi-scale lotus pod test set showed that the P, R, and [email protected] values acquired by MLP-YOLOv5 were 93.7%, 90.8%, and 94.9%, respectively, which were increased by 1.7%, 3.9%, and 3% compared with the baseline. Significantly, the P and R for small objects achieved by the model were increased by 5.5% and 7.4%, respectively. In addition, the model size, parameter number, and GFLOPs were significantly reduced by 25.7%, 30.0%, and 14.6%, respectively. The results indicate that the introduced improvements have effectively improved the detection accuracy, speed, and lightweight of the model, making it more adaptable to the multi-scale detection tasks of lotus pods. Compared with other mainstream algorithms, i.e., SSD, YOLOv3, YOLOv4, YOLOv7, and YOLOv8, MLP-YOLOv5 showed more significant advantages in detection accuracy, speed, parameters, model size, and GFLOPs. The model effectively reduced the phenomenon of false and missed detections in multi-scale object detection. The method could effectively support the harvesting robot by accurately and automatically picking lotus pods.

Author Contributions

Conceptualization, A.L.; methodology, A.L. and J.L.; software, J.L.; validation, J.L. and A.L.; formal analysis, J.L. and A.L.; investigation, J.L., H.C., A.L. and L.M.; writing—original draft preparation, A.L. and J.L.; writing—review and editing, A.L.; supervision, Q.M.; funding acquisition, A.L. and Q.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (NSFC) (Grant Nos. 52205285, 52175255) and the Natural Science Foundation of Hunan Province (Grant No. 2023JJ40629).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Y.; Xu, Y.; Wang, Q.; Zhang, J.; Dai, X.; Miao, S.; Lu, X. The antioxidant capacity and nutrient composition characteristics of lotus (Nelumbo nucifera Gaertn.) seed juice and their relationship with color at different storage temperatures. Food Chem. X 2023, 18, 100669. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Deng, Z.; He, Y.; Fan, Y.; Dong, H.; Chen, R.; Liu, R.; Tsao, R.; Liu, X. Differential specificities of polyphenol oxidase from lotus seeds (Nelumbo nucifera Gaertn.) toward stereoisomers, (−)-epicatechin and (+)-catechin: Insights from comparative molecular docking studies. LWT 2021, 148, 111728. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Xing, Z.; Ma, L.; Qu, A.; Xue, J. Object Detection Algorithm for Lingwu Long Jujubes Based on the Improved SSD. Agriculture 2022, 12, 1456. [Google Scholar] [CrossRef]
Chen, J.; Wang, Z.; Wu, J.; Hu, Q.; Zhao, C.; Tan, C.; Teng, L.; Luo, T. An improved Yolov3 based on dual path network for cherry tomatoes detection. J. Food Process Eng. 2021, 44, e13803. [Google Scholar] [CrossRef]
Zhang, C.; Ding, H.; Shi, Q.; Wang, Y. Grape cluster real-time detection in complex natural scenes based on YOLOv5s deep learning network. Agriculture 2022, 12, 1242. [Google Scholar] [CrossRef]
Chen, S.; Zou, X.; Zhou, X.; Xiang, Y.; Wu, M. Study on fusion clustering and improved YOLOv5 algorithm based on multiple occlusion of Camellia oleifera fruit. Comput. Electron. Agric. 2023, 206, 107706. [Google Scholar] [CrossRef]
Du, X.; Cheng, H.; Ma, Z.; Lu, W.; Wang, M.; Meng, Z.; Jiang, C.; Hong, F. DSW-YOLO: A detection method for ground-planted strawberry fruits under different occlusion levels. Comput. Electron. Agric. 2023, 214, 108304. [Google Scholar] [CrossRef]
Yang, H.; Liu, Y.; Wang, S.; Qu, H.; Li, N.; Wu, J.; Yan, Y.; Zhang, H.; Wang, J.; Qiu, J. Improved Apple Fruit Target Recognition Method Based on YOLOv7 Model. Agriculture 2023, 13, 1278. [Google Scholar] [CrossRef]
Cao, C.; Wang, B.; Zhang, W.; Zeng, X.; Yan, X.; Feng, Z.; Liu, Y.; Wu, Z. An improved faster R-CNN for small object detection. IEEE Access 2019, 7, 106838–106846. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Ji, S.; Ling, Q.; Han, F. An improved algorithm for small object detection based on YOLOv4 and multi-scale contextual information. Comput. Electr. Eng. 2023, 105, 108490. [Google Scholar] [CrossRef]
Mahaur, B.; Mishra, K.K. Small-object detection based on YOLOv5 in autonomous driving systems. Pattern Recognit. Lett. 2023, 168, 115–122. [Google Scholar] [CrossRef]
Wang, J.; Chen, Y.; Dong, Z.; Gao, M. Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput. Appl. 2023, 35, 7853–7865. [Google Scholar] [CrossRef]
Zhao, S.; Liu, J.; Wu, S. Multiple disease detection method for greenhouse-cultivated strawberry based on multiscale feature fusion Faster R_CNN. Comput. Electron. Agric. 2022, 199, 107176. [Google Scholar] [CrossRef]
Li, S.; Li, K.; Qiao, Y.; Zhang, L. A multi-scale cucumber disease detection method in natural scenes based on YOLOv5. Comput. Electron. Agric. 2022, 202, 107363. [Google Scholar] [CrossRef]
Lu, Y.; Du, S.; Ji, Z.; Yin, X.; Jia, W. ODL Net: Object detection and location network for small pears around the thinning period. Comput. Electron. Agric. 2023, 212, 108115. [Google Scholar] [CrossRef]
Li, S.; Zhang, S.; Xue, J.; Sun, H. Lightweight target detection for the field flat jujube based on improved YOLOv5. Comput. Electron. Agric. 2022, 202, 107391. [Google Scholar] [CrossRef]
Xue, Z.; Lin, H.; Wang, F. A small target forest fire detection model based on YOLOv5 improvement. Forests 2022, 13, 1332. [Google Scholar] [CrossRef]
Zhang, Q.; Tang, J.; Zheng, H.; Lin, C. Efficient object detection method based on aerial optical sensors for remote sensing. Displays 2022, 75, 102328. [Google Scholar] [CrossRef]
Zhu, X.; Hang, X.; Gao, X.; Yang, X.; Xu, Z.; Wang, Y.; Liu, H. Research on crack detection method of wind turbine blade based on a deep learning method. Appl. Energy 2022, 328, 120241. [Google Scholar]
Kang, J.; Liu, L.; Zhang, F.; Shen, C.; Wang, N.; Shao, L. Semantic segmentation model of cotton roots in-situ image based on attention mechanism. Comput. Electron. Agric. 2021, 189, 106370. [Google Scholar] [CrossRef]
Zhang, Q.; Yang, Y. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar]
Zhang, Y.; Shen, S.; Xu, S. Strip steel surface defect detection based on lightweight YOLOv5. Front. Neurorobotics 2023, 17, 1263739. [Google Scholar] [CrossRef]
Li, J.; Pan, H.; Li, J. ESD-YOLOv5: A Full-Surface Defect Detection Network for Bearing Collars. Electronics 2023, 12, 3446. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Chen, J.; Ma, B.; Ji, C.; Zhang, J.; Feng, Q.; Liu, X.; Li, Y. Apple inflorescence recognition of phenology stage in complex background based on improved YOLOv7. Comput. Electron. Agric. 2023, 211, 108048. [Google Scholar] [CrossRef]
Li, J.; Li, J.; Zhao, X.; Su, X.; Wu, W. Lightweight detection networks for tea bud on complex agricultural environment via improved YOLOv4. Comput. Electron. Agric. 2023, 211, 107955. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Xiong, Z.; Wang, L.; Zhao, Y.; Lan, Y. Precision Detection of Dense Litchi Fruit in UAV Images Based on Improved YOLOv5 Model. Remote Sens. 2023, 15, 4017. [Google Scholar] [CrossRef]
Wang, Y.; Yan, G.; Meng, Q.; Yao, T.; Han, J.; Zhang, B. DSE-YOLO: Detail semantics enhancement YOLO for multi-stage strawberry detection. Comput. Electron. Agric. 2022, 198, 107057. [Google Scholar] [CrossRef]
Wu, D.; Lv, S.; Jiang, M.; Song, H. Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments. Comput. Electron. Agric. 2020, 178, 105742. [Google Scholar] [CrossRef]
Li, Z.; Kang, L.; Rao, H.; Nie, G.; Tan, Y.; Liu, M. Camellia oleifera Fruit Detection Algorithm in Natural Environment Based on Lightweight Convolutional Neural Network. Appl. Sci. 2023, 13, 10394. [Google Scholar] [CrossRef]

Figure 1. (a) Manual lotus pod harvesting scene. (b) Growth environment of lotus pods.

Figure 2. Example of the multi-scale characteristics of the lotus pods in the image.

Figure 3. Sample large, medium, and small scales lotus pods in the images.

Figure 4. Representative appearances of unripe and ripe lotus pods.

Figure 5. Illustration of the YOLOv5 network structure.

Figure 6. Illustration of the MLP-YOLOv5 network structure.

Figure 7. Network structure of the improved multi-scale detection layer in the proposed MLP-YOLOv5 network.

Figure 8. Structure diagram of (a) the transformer encoder block and (b) the C3-TR module.

Figure 9. Structure of the shuffle attention mechanism.

Figure 10. Structure of (a) the GSConv and (b) the VoVGSCSP module.

Figure 11. Geometric relationship between the prediction and ground truth boxes in SIoU.

Figure 12. Total loss curves of the YOLOv5s and MLP-YOLOv5 models.

Figure 13. Visualization results of the extracted feature maps. (a) Original image. Feature maps on (b) the 17th layer, (c) the 19th layer, (d) the 20th layer, (e) the 22nd layer, and (f) final results.

Figure 14. Detection results under background object interference.

Figure 15. Detection results under occlusion and light changing.

Figure 16. Detection results for medium-small scale lotus pods.

Figure 17. Radar chart of the models’ comprehensive performance comparison.

Figure 18. Detection effects of different models for (a) medium- and large-scale lotus pods and (b) small-scale lotus pods.

Figure 19. Multi-scale detection test of lotus pods in a lab environment. (a) The overall structure of the lotus pod harvesting robot. (b) Picture of the testing process. (c) Principle of the test.

Figure 20. Detection effects of the same group of lotus pods at different heights (camera-pod distances).

Figure 21. Multi-scale detection test in the natural environment. (a) Experimental scene. (b) Detection effects under different camera-pod distances.

Table 1. Statistics information of the established datasets.

	Number of Images	Number of Annotation Boxes
	Number of Images	Large-Scale	Medium-Scale	Small-Scale	Total
Training set	8000	1808	8152	12,906	22,866
Validation set	1000	222	1019	1710	2951
Test set	800	132	676	1932	2740
Total	9800	2162	9847	16,548	28,557

Table 2. Size of anchor boxes after clustering.

Detection Layer	Optimized Anchor Box Sizes
Small (160 × 160)	[11, 13] [18, 18] [25, 27]
Medium (80 × 80)	[38, 37] [49, 55] [80, 58]
Large (40 × 40)	[73, 94] [129, 96] [151, 172]

Table 3. Performance results of the baseline and the improved models.

Model	P (%)	R (%)	[email protected] (%)	Parameters (M)	Model Size (MB)	GFLOPs
YOLOv5s	92.0	86.9	91.9	7.0	13.6	15.8
MLP-YOLOv5	93.7	90.8	94.9	4.9	10.1	13.5
Improvement	1.7% ↑	3.9% ↑	3.0% ↑	30.0% ↓	25.7% ↓	14.6% ↓

Table 4. Model performance comparison on pure small-scale objects on the test set.

Model	P (%)	R (%)
YOLOv5s	84.1	84.9
MLP-YOLOv5	89.6	92.3
Improvement	5.5 ↑	7.4 ↑

Table 5. Experimental results of the ablation experiments.

Models	Model Size (MB)	P (%)	R (%)	[email protected] (%)
YOLOv5s	13.6	92.0	86.9	91.9
YOLOv5s + MS	12.0	90.4	90.9	93.8
YOLOv5s + C3-TR	13.6	92.6	88.6	92.9
YOLOv5s + SA	13.7	91.8	90.5	93.6
YOLOv5s + GSConv + VoVGSCSP	11.5	91.3	90.9	93.6
YOLOv5s + SIoU	13.6	92.0	88.5	92.6
Ours	10.1	93.7	90.8	94.9

Table 6. Results of the ablation experiments of the attention mechanism.

Attention Mechanism	Parameters	P (%)	R (%)	[email protected] (%)
SE	4,928,966	90.8	90.3	93.8
CBAM	4,929,260	90.4	91.2	94.0
CA	4,932,622	91.5	90.5	93.9
SA (Ours)	4,925,990	93.7	90.8	94.9

Table 7. Results of the model performance comparison test.

Model	P (%)	R (%)	[email protected] (%)	Parameters (M)	Model Size (MB)	GFLOPs	FPS
SSD	91.2	85.7	84.5	23.8	90.6	60.9	31.5
YOLOv3	92.9	83.3	82.4	61.5	235.0	155.1	40.0
YOLOv4	91.4	79.1	77.7	63.9	244.0	141.4	30.3
YOLOv5s	92.0	86.9	91.9	7.0	13.6	15.8	43.3
YOLOv7	92.9	92.4	93.5	36.5	71.3	103.2	41.0
YOLOv8	91.6	89.3	93.4	11.1	21.5	28.4	46.5
MLP-YOLOv5	93.7	90.8	94.9	4.9	10.1	13.5	49.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, A.; Liu, J.; Cui, H.; Ma, L.; Ma, Q. MLP-YOLOv5: A Lightweight Multi-Scale Identification Model for Lotus Pods with Scale Variation. Agriculture 2024, 14, 30. https://doi.org/10.3390/agriculture14010030

AMA Style

Lu A, Liu J, Cui H, Ma L, Ma Q. MLP-YOLOv5: A Lightweight Multi-Scale Identification Model for Lotus Pods with Scale Variation. Agriculture. 2024; 14(1):30. https://doi.org/10.3390/agriculture14010030

Chicago/Turabian Style

Lu, Ange, Jun Liu, Hao Cui, Lingzhi Ma, and Qiucheng Ma. 2024. "MLP-YOLOv5: A Lightweight Multi-Scale Identification Model for Lotus Pods with Scale Variation" Agriculture 14, no. 1: 30. https://doi.org/10.3390/agriculture14010030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MLP-YOLOv5: A Lightweight Multi-Scale Identification Model for Lotus Pods with Scale Variation

Abstract

1. Introduction

2. Materials and Methods

2.1. Lotus Pods Image Acquisition

2.2. Dataset Preparation

2.3. The Principle of YOLOv5

2.4. Development of the MLP-YOLOv5 Model

2.4.1. Optimization of the Multi-Scale Detection Layer Structure and Anchor Box Parameters

2.4.2. C3 Module with a Transformer Encoder

2.4.3. Shuffle Attention Mechanism Module

2.4.4. Lightweight Design of the Neck Network

2.4.5. Improvement of Loss Function

2.5. Experimental Environment and Training Parameters

2.6. Indicators of Model Evaluation

3. Experimental and Results Analysis

3.1. Experimental Comparison before and after Model Improvement

3.1.1. Model Training

3.1.2. Experimental Results

3.1.3. Comparison of Detection Effects

3.2. Ablation Experiments

3.2.1. Ablation Experimental Analysis of the Improvement Methods

3.2.2. Ablation Experimental Analysis of the Attention Mechanism

3.3. Comparative Analysis of Other Mainstream Detection Models

3.4. Multi-Scale Lotus Pod Detection Field Test

3.4.1. Test in a Laboratory Environment

3.4.2. Test in a Real Natural Growth Environment

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI