Robot Operating Systems–You Only Look Once Version 5–Fleet Efficient Multi-Scale Attention: An Improved You Only Look Once Version 5-Lite Object Detection Algorithm Based on Efficient Multi-Scale Attention and Bounding Box Regression Combined with Robot Operating Systems

Wang, Haiyan; Shi, Zhan; Gao, Guiyuan; Li, Chuang; Zhao, Jian; Xu, Zhiwei

doi:10.3390/app14177591

Open AccessArticle

Robot Operating Systems–You Only Look Once Version 5–Fleet Efficient Multi-Scale Attention: An Improved You Only Look Once Version 5-Lite Object Detection Algorithm Based on Efficient Multi-Scale Attention and Bounding Box Regression Combined with Robot Operating Systems

by

Haiyan Wang

^1,2,3,*,

Zhan Shi

¹,

Guiyuan Gao

¹,

Chuang Li

⁴,

Jian Zhao

^1,2,3

and

Zhiwei Xu

¹

College of Computer Science and Technology, Changchun University, Changchun 130022, China

²

Key Laboratory of Intelligent Rehabilitation and Barrier-Free Access for the Disabled, Ministry of Education, Changchun 130022, China

³

Jilin Provincial Key Laboratory of Human Health State Identification and Function Enhancement, Changchun 130022, China

⁴

College of Computer, Jilin Normal University, Siping 136000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7591; https://doi.org/10.3390/app14177591

Submission received: 23 July 2024 / Revised: 14 August 2024 / Accepted: 20 August 2024 / Published: 28 August 2024

(This article belongs to the Special Issue Object Detection and Image Classification)

Download

Browse Figures

Versions Notes

Abstract

:

This paper primarily investigates enhanced object detection techniques for indoor service mobile robots. Robot operating systems (ROS) supply rich sensor data, which boost the models’ ability to generalize. However, the model’s performance might be hindered by constraints in the processing power, memory capacity, and communication capabilities of robotic devices. To address these issues, this paper proposes an improved you only look once version 5 (YOLOv5)-Lite object detection algorithm based on efficient multi-scale attention and bounding box regression combined with ROS. The algorithm incorporates efficient multi-scale attention (EMA) into the traditional YOLOv5-Lite model and replaces the C3 module with a lightweight C3Ghost module to reduce computation and model size during the convolution process. To enhance bounding box localization accuracy, modified precision-defined intersection over union (MPDIoU) is employed to optimize the model, resulting in the ROS–YOLOv5–FleetEMA model. The results indicated that relative to the conventional YOLOv5-Lite model, the ROS–YOLOv5–FleetEMA model enhanced the mean average precision (mAP) by 2.7% post-training, reduced giga floating-point operations per second (GFLOPS) by 13.2%, and decreased the params by 15.1%. In light of these experimental findings, the model was incorporated into ROS, leading to the development of a ROS-based object detection platform that offers rapid and precise object detection capabilities.

Keywords:

ROS; efficient multi-scale attention; C3Ghost; MPDIoU; YOLOv5-Lite

1. Introduction

Object detection is an important branch of machine vision. Its purpose is to automatically identify and locate targets of interest in images or videos. In the field of service robots, object detection technology is mainly used to identify various objects and people in the environment, so as to realize functions such as autonomous navigation, task execution, and human–computer interaction [1]. However, due to the diversity and complexity of service robot application scenarios, object detection faces many challenges, such as illumination changes, occlusion, scale changes, etc. Therefore, object detection is an indispensable function for service robots.

Due to the high accuracy and high stability of deep learning technology in image processing, many researchers have begun to use deep learning technology to solve the problem of target detection in computer vision [2]. At present, the commonly used object detection network based on deep learning can be roughly divided into the following two categories: one-stage and two-stage [3].

For two-stage object detection, Ross Girshick et al. proposed the classical region-based convolutional neural network (R-CNN) [4] algorithm. Firstly, about 2000 region proposals are obtained using selective search, then the features of region proposals are extracted by AlexNet4 [5], and then these features are regressed by multiple classifiers. Subsequently, He et al. proposed a spatial pyramid pooling network [6] (SPPNet), which extracts more feature information by performing convolution operations on the entire image to avoid the problem of computational redundancy when R-CNN extracts features for all candidate regions. Therefore, the fully connected neural network (F-CNN) [7] adds an SPPNet between the last convolutional layer and the fully connected layer to extract a fixed-length feature vector and avoid the normalization of the region proposal. Ross Girshick et al. proposed fast R-CNN [8] by referring to SPPNet, which simplifies the SPP layer to the region of interest (ROI) layer and applies singular value decomposition (SVD) to the output of the fully connected layer to accelerate the test process. Fast R-CNN combines classification with a bounding box, but fast R-CNN has the problem of excessive calculation. In this regard, Ross Girshick and others then proposed faster R-CNN [9], which uses a region proposal network (RPN) instead of a selective search algorithm to extract region proposal, which greatly improves the detection efficiency. On the basis of faster R-CNN, Lin et al. proposed the feature pyramid network (FPN) [10], which uses RPN to extract candidate regions on the feature pyramid. By fusing deep and shallow feature information, prediction is performed at different scales to enhance the semantic understanding of shallow feature maps, thereby improving the accuracy of small target detection. In order to further improve the detection speed, Dai et al. proposed a region-based fully convolutional network(R-FCN) [11], replacing the fully connected layer with a fully convolutional layer, allowing the features of each candidate region to perform convolution operations directly to obtain the confidence of each category. Although two-stage object detection has high detection accuracy, it does not perform well in real time. In this regard, target detection technology usually uses the one-stage target detection algorithm. The one-stage target detection algorithm can achieve real-time detection, and the detection accuracy can maintain the same level as the two-stage target detection algorithm [12].

The you only look once (YOLO) algorithm is an object detection algorithm that divides the trained image into a grid system. Each unit in the grid is responsible for detecting its own internal objects. The YOLO algorithm has occupied an important position in the field of target detection, with its excellent detection speed and accuracy, since it was first proposed in 2016. The YOLOv1 [13] algorithm regards the target detection problem as a regression problem, which is an end-to-end method with fast detection speed and good real-time performance. However, the detection accuracy of the algorithm is low, and it is difficult to detect when the target object is small. In order to improve this problem, Redmon et al. improved YOLOv1 by introducing batch normalization and dimension clustering to improve the detection accuracy and called the algorithm YOLOv2 [14]. On the basis of YOLOv2, Redmon et al. further improved it through a series of improvements, such as using a residual network to improve the network structure to achieve multi-scale output, thereby improving the accuracy of detection, and named the improved algorithm YOLOv3 [15]. Since the accuracy rate has been greatly improved after using the YOLOv3 algorithm, it has become one of the most used algorithms. Bochkovskiy et al. proposed the YOLOv4 algorithm [16] for some shortcomings in the YOLOv3 algorithm and made a series of improvements. The algorithm uses the cross-stage partial darknet-53 (CSP-Darknet53) [17] structure to optimize the network structure and uses the data enhancement method in the training phase to further improve the training speed and accuracy. In 2020, Ultralytics developed an open-source version. The backbone network of the YOLOv5 algorithm [18] takes into account both the detection efficiency and image recognition effect. The volume of the algorithm model can be adjusted, and the recognition result is more accurate than other detection methods, but the calculation amount of the model is large, and the structure is redundant. In response to these challenges, this paper proposes an improved YOLOv5-Lite target detection algorithm that combines multi-scale attention and bounding box regression, aiming to further improve the detection performance while maintaining the lightweight characteristics of the algorithm. As a widely used robot software platform, ROS1 provides a wealth of tools and libraries to support the development and integration of algorithms. The improved YOLOv5-Lite algorithm is integrated with ROS, which can not only realize the rapid deployment of the algorithm but also facilitate the interaction with other robot perception and decision-making modules through the modular characteristics of ROS. Next, the design and implementation of the improved YOLOv5-Lite algorithm will be introduced in detail, as well as the experimental process and result analysis in the ROS environment.

2. Related Work

As a one-stage object detection algorithm, the YOLO series algorithm has high detection accuracy and achieves a good balance between accuracy and recognition, which is suitable for object detection in complex natural environments [19]. As the latest lightweight version of this series, YOLOv5-Lite is designed for computing resource-constrained environments. It provides acceptable accuracy while maintaining high detection speed. Although YOLOv5-Lite performs well in some application scenarios, there is still room for improvement in specific robot vision tasks, for example, the detection accuracy of small targets in a dynamic environment, or the robustness under different lighting conditions.

2.1. YOLOv5-Lite Network Model

The YOLOv5-Lite model adopts a lightweight design to reduce computational complexity and improve operating efficiency while maintaining high detection accuracy. This structural optimization makes the algorithm more suitable for running on resource-constrained devices. The network structure is shown in Figure 1.

The structure can be roughly divided into the backbone network, neck network, and detection head network. The algorithm removes the focus structure layer, reduces the volume of the model, and makes the model lighter; at the same time, four slice operations are removed, which reduces the occupation of the computer chip cache and reduces the processing burden of the computer. Compared with the YOLOv5 algorithm, the YOLOv5-Lite algorithm can avoid repeated use of the C3 layer module [20]. The C3 layer module will occupy a lot of running space on the computer, thus reducing the processing speed. In this way, the accuracy of the YOLOv5-Lite algorithm model can be controlled within a reliable range, making it easier to deploy. At the beginning of the backbone, YOLOv5-Lite uses the Conv_Batch_Norm_ReLu structure [21] to replace the traditional focus structure.

The deep stacking module of ShuffleNet V2 [22] divides the input feature channels into two parts directly through the channel splitting function. The left side does not participate in convolution and is constant, which plays the role of residual edge. After feature fusion, a channel shuffle is performed, and the left and right features can be effectively communicated. Because the down-sampling module changes the size of the feature map, a deep separable convolution is also added to the original residual edge on the left side, and the number of feature channels is changed so that the two sides after convolution can be fused. The two modules finally fuse and communicate the features after grouping through channel shuffle. The basic unit of ShuffleNet V2 is shown in Figure 2.

The ShuffleNet V2 model has a good trade-off between the speed and accuracy of image recognition. At the expense of certain prediction accuracy, a faster inference speed and smaller model parameters are obtained. YOLOv5-Lite uses a large number of Shuffle_Block operations in its backbone, which can reduce memory access, reduce the number of convolution operations, and meet lightweight design requirements.

2.2. Efficient Multi-Scale Attention

Scientists’ research on human vision shows that the human brain only selectively extracts the visual information of its own region of interest while ignoring the visual information of other regions. For example, when reading, humans will only focus on some key words and ignore some non-key words. In recent years, deep learning scholars have used the method of human brain processing vision to apply this attention mechanism to deep learning models. The experimental results show that the performance of the model can be improved to a certain extent.

EMA [23] is an efficient multi-scale attention mechanism, which reshapes some channels into batch dimensions, thereby avoiding the situation of channel dimension reduction so as to retain the information of each channel and reduce the computational cost. EMA not only adjusts the channel weight of parallel sub-networks using global information coding but also fuses the output features of two parallel sub-networks through cross-latitude interaction. The overall structure of EMA is shown in Figure 3.

In the figure, “c” denotes the number of channels in the input feature map, “h” and “w” represent the height and width of the feature map respectively, “g” denotes the number of groups, “X Avg Pool” denotes the 1D horizontal global pool, and “Y Avg Pool” denotes the 1D vertical global pool. The expression “c//g×h×w” calculates the total number of elements in the entire feature map by first dividing the channel count (c) by the group count (g) to determine the channels per group and then multiplying this value with the height (h) and width (w) of the feature map.

For the input features, EMA divides them into g sub-features according to the number of channels to learn different semantics. Without losing generality, it is assumed that the learned weight descriptor will be used to enhance the feature representation of the region of interest in each sub-feature.

In deep learning, convolution kernels of different sizes can capture features at different scales. Convolutional kernels sized 1 × 1 are typically used to capture fine-grained detail information, while 3 × 3 convolutional kernels can capture a wider range of contextual information. By combining these two sizes of convolution kernels, EMA can simultaneously obtain local and slightly global features, thereby enhancing the expressive power of the features.

EMA extracts the weight descriptor of the grouping feature map through two parallel paths on the 1 × 1 branch and one on the 3 × 3 branch. In the 1 × 1 branch, two 1D global average pooling operations are used to encode the channel along two spatial directions, and the two coding features are connected so that it does not reduce the dimension on the 1 × 1 branch. Then, the output after 1 × 1 convolution is re-decomposed into two vectors, and two Sigmoid nonlinear functions are used to fit the 2D binary distribution on the linear convolution. Finally, the cross-channel interaction is realized by multiplying the channel attention. In the 3 × 3 branch, a 3 × 3 convolution is used to capture the multi-scale feature representation.

The 2D global average pooling is used to encode the global spatial information in the outputs of 1 × 1 branches and 3 × 3 branches. The output will be converted into the corresponding dimension shape. Finally, the nonlinear function Softmax is added to fit the linear transformation. The output of the same size of the two branches is connected and converted into the R1 × H × W format. The matrix dot product operation is used to multiply the results of the above parallel processing to obtain a spatial attention map, which can collect spatial information at different scales. The final output of EMA is the same size as the input X, which is convenient to be directly added to the YOLOv5-Lite network.

2.3. MPDIoU Loss Function

Bounding box regression (BBR) [24] has an important influence on the accurate positioning and recognition of the model and is the key link to achieving efficient and accurate object detection. At present, most of the existing BBR loss functions can be divided into the following two categories: loss function based on ln norm and loss function based on intersection over union (IoU). The traditional bounding box regression loss function has the same aspect ratio in the prediction box and the actual annotation box, so it cannot be optimized. The MPDIoU loss function combines the concept of minimum point distance and improves the regression efficiency and accuracy by minimizing the distance between the upper left and lower right points between the prediction box and the real box. This process can be described as follows:

d_{1}^{2} = {(x_{1}^{B} - x_{1}^{A})}^{2} + {(y_{1}^{B} - y_{1}^{A})}^{2}

(1)

d_{2}^{2} = {(x_{2}^{B} - x_{2}^{A})}^{2} + {(y_{2}^{B} - y_{2}^{A})}^{2}

(2)

M P D I o u = \frac{A ⋂ B}{A ⋃ B} - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}

(3)

The parameters

A

and

B

denote two arbitrary convex images,

w

is the width, and

h

is the height; (

x_{1}^{A}

,

y_{1}^{A}

) and (

x_{2}^{A}

,

y_{2}^{A}

) represent the coordinates of the upper left corner and the lower right corner of

A

, respectively; (

x_{1}^{B}

,

y_{1}^{B}

) and (

x_{2}^{B}

,

y_{2}^{B}

) represent the coordinates of the upper left corner and the lower right corner of

B

, respectively;

d_{1}^{2}

is the square of Euclidean distance between the upper left corner of

A

and

B

;

d_{2}^{2}

is the square of Euclidean distance between the lower right corner points of

A

and

B

; and MPDIou is the intersection and union ratio (IoU) of

A

and

B

minus the normalized minimum point distance.

2.4. C3Ghost Module

GhostNet is a new lightweight deep neural network architecture proposed by Huawei Noah’s Ark Laboratory [25]. In general, a large number of redundant feature maps generated by convolution have little complementary effect on the main feature maps in the actual detection task, which is not helpful for the network to improve detection accuracy. However, generating these redundant feature maps consumes a lot of computing power. Therefore, GhostNet constructs the ghost module and uses it to generate redundant feature maps faster and more efficiently. The GhostNet lightweight network can greatly reduce the amount of calculation and parameters of the network while maintaining the size and channel size of the original convolution output feature map. The implementation principle is to divide the traditional convolution into two steps, which are ordinary convolution and cheap linear calculation. Firstly, a part of the feature map is generated by using fewer convolution kernels, then the channel convolution is performed on this part of the feature map to generate more feature maps, and finally, the two sets of feature maps are spliced to generate the GhostNet feature map. The traditional convolution and GhostNet convolution processes are shown in Figure 4.

The head part of YOLOv5-Lite adopts multiple C3 structures, which have a large number of parameters and a slow detection speed. Therefore, this study replaces the new C3Ghost module with the C3 module to achieve a lightweight effect. The specific structure is shown in Figure 5.

The GhostBottleneck module is an innovative network structure component, and its design inspiration comes from the ghost module. The module is mainly composed of two GhostConv modules and a residual block. Among them, the first GhostConv module acts as an extension layer, and its core function is to increase the number of channels of the input feature map. This step is crucial because it provides a richer feature representation for subsequent deep feature extraction and information fusion. The second GhostConv module undertakes the task of dimensionality reduction, which aims to reduce the number of channels of the output feature map. This not only helps to reduce the computational complexity but also ensures that the output feature map matches the other structures in the network (such as the diameter structure) in the number of channels so as to achieve more efficient information transmission and processing. Between the two GhostConv modules, a residual edge with deep convolution processing is also embedded. This design enables the features to be effectively fused with the features of the residual edge after expansion and dimensionality reduction, thereby enhancing the expression ability of the model. In addition, the main purpose of introducing depthwise convolution (DWConv) is to further reduce the number of parameters of the model, thereby reducing the computational burden and improving the practicability of the model. The C3Ghost module is an improvement based on the C3 block. It replaces the traditional residual component Resunit with a reusable GhostBottleneck module. This replacement not only reduces a large number of convolution operations in the traditional structure but also significantly compresses the size of the model and reduces the computational complexity of the model. In this way, the C3Ghost module achieves a lightweight model while maintaining its performance, making it more suitable for deployment and operation in resource-constrained environments. This design not only improves the efficiency of the model but also enhances its adaptability and flexibility in practical applications.

3. Experimental Test and Result Analysis

The background of the robot system and the related technologies to realize ROS robot object detection are described in the previous section. Combined with the above technology, EMA is inserted into the YOLOv5-Lite model, and a lightweight C3Ghost module is designed to replace the C3 module in the traditional network to compress the calculation amount and model size of the convolution process. In order to further improve the positioning accuracy of the bounding box, the MPDIoU loss function is used to optimize the ROS–YOLOv5–FleetEMA algorithm. This chapter will introduce the basic service platform of object detection built during the experiment and train the model of the algorithm proposed in this paper. The contrast experiment and ablation experiment are designed. Finally, the object detection technology is integrated and deployed to the robot equipment for testing. The experimental process and experimental results are as follows.

3.1. Hardware Equipment

In this paper, an Ackerman differential car integrated with ROS is selected as the experimental equipment. The robot integrates a variety of sensors and computing equipment. It is equipped with laser radar for environmental perception, a camera for visual information capture, an inertial measurement unit (IMU) for attitude and motion information, a motor with an encoder for the precise control of motion, and embedded computing hardware for data processing and algorithm execution. Servo motors and stepper motors are used to precisely control the motion of robots. IMU can provide data on robot acceleration and angular velocity, which is crucial for robot positioning and navigation. Laser lidar is commonly used to detect static and dynamic obstacles. The camera is the main visual sensor for object detection, which is used to capture two-dimensional images of the scene. Through the image processing and computer vision algorithms of Raspberry Pi 4B, objects in images can be recognized and classified. The detailed layout and configuration of the hardware structure are shown in Figure 6.

3.2. Experimental Equipment

This paper uses the ROS melodic version, and the corresponding Ubuntu version is 18.04, which is installed on a virtual machine.

3.2.1. SSH Remote Connection

When we are debugging the car, we usually need to run the command line on the ROS host. However, if the display, keyboard, mouse, and other input devices are directly connected to the car to operate, when the car is in the process of movement, this method would be very inconvenient and may even affect the safety and efficiency of the operation. In order to avoid this situation and to ensure that flexible debugging and control are still possible when the car is moving, we adopted the method of remote control to realize the control of the car.

Usually, we use secure shell (SSH) login for remote control. SSH is a widely used network protocol that provides security for remote login sessions and other network services. Through SSH, we can safely execute commands on the remote host on the local computer, just like operating directly next to the car.

3.2.2. Deep Learning Environment

First, install miniforge3 for the car; after installation, enter the following command to create a virtual environment:

conda create -n yolo python=3.8
conda activate yolo
conda install pytorch torchvision torchaudio cpuonly-c pytorch

3.3. ROS-Based Object Detection Service Platform

In order to simplify the complex compilation and parameter modification process in the use of ROS, this part designs an object detection service platform based on ROS. The platform combines Qt and ROS technology; through the intuitive graphical user interface (GUI) [26], the use of buttons, input boxes, and other controls to achieve a key operation greatly improves the user experience and operational efficiency. For example, users can easily complete the SSH login device, mount the device file, open the object detection function, and conduct other operations by clicking the button, making the debugging process more convenient and clear.

After ensuring that the host and the car are in the same network environment, start the software; first, click the SSH button to remotely connect the device. Since the SSH password-free login has been configured, this step does not require additional password input, thereby simplifying the connection process. Next, in order to view and modify the car’s source files in the virtual machine, you need to use a network file system (NFS) mount to mount the device’s files to the virtual machine. Just click the NFS button, and you can automatically complete the file mount operation. After the file is mounted, it first needs to enter the object detection deep learning environment and then start the object detection function. To this end, this section developed a powerful target detector. Firstly, a node handle is created, and it is used to create image transmission objects and subscribe topics. When a new message is received, the callback function is called to convert the ROS image information into Opencv format, and then the function is used to convert the Opencv format object into Qt’s QImage object of different depths according to the depth and channel number of cv.Mat so as to realize the visual display of the image. The workflow is shown in Figure 7.

3.4. Experimental Test

Based on the abovementioned platform technology, this part will carry out model training on the algorithm proposed in this paper, design comparative experiments and ablation experiments, and finally, integrate the object detection technology into the robot equipment for testing. The experimental process and experimental results are as follows.

3.4.1. Model Training

The computer configuration used for model training is shown in Table 1.

In this paper, a series of training parameters and strategies are used to ensure that the model can learn efficiently and stably. The weight used for training is v5lite-s.pt. The specific file and label path are input through the yaml file, in which the Batch_size is set to 16, the iteration period epochs are set to 150, the confidence threshold is 0.45, and the iou threshold is 0.65. The warmup learning rate method is used for training. When the model is in the initial epochs, a smaller learning rate value will be selected to increase the stability of the model in the initial training stage. After stabilization, the training will be continued with a preset cyclic learning rate to improve the convergence speed. The hyperparameter settings in the training process are shown in Table 2.

3.4.2. Dataset

The pattern analysis, statistical modeling, and computational learning visual object classes challenge (PASCAL VOC) represents a challenge in the field of international computer vision. This paper selects PASCAL VOC 2007. PASCAL VOC 2007, as one of the early datasets, holds an important historical position in the field of computer vision. PASCAL VOC 2007 covers over 12,000 labeled objects and is one of the commonly used standard test datasets for the YOLO algorithm. Due to its high-quality annotation information and various common target categories, it is more consistent and accurate in category definition and annotation, including more occlusions or more complex scene layouts, making it suitable for robot object detection and fully testing the performance of the model.

3.4.3. Comparative Experiment

The specific indicators for evaluating the performance of the algorithm in this paper include precision (P), recall (R), and mAP, where mAP @ 0.5 represents the average precision mean when the intersection over the Unio IOU (IOU) threshold is 50% and mAP0.5–0.95 represents the average precision mean of the IOU threshold in the range from 50% to 95%. This process can be described as follows:

P = \frac{T P}{T P + F P}

(4)

R = \frac{T P}{T P + F N}

(5)

m A P = \frac{\sum_{k = 1}^{n} P R}{N}

(6)

In the formula, TP represents the number of correct positive samples, FP represents the number of wrong positive samples, FN represents the number of wrong negative samples, and N represents the number of types in the sample.

Params are an important indicator for measuring model complexity. The more parameters a model has, the more computing resources and data it requires for training and inference. GFLOPS stands for the computational efficiency and speed of a model. It denotes the number of floating-point operations needed to run a network model once. It measures the number of floating-point operations a model performs during a forward propagation.

In order to verify the effectiveness of the improvement of the attention mechanism, the added attention mechanism is replaced, including squeeze-and-excitation (SE) [27], the convolutional block attention module (CBAM) [28], efficient channel attention (ECA) [29], coordinate attention (CA) [30], and EMA. The effects of different attention mechanisms on the model detection effect are compared and analyzed. The “-“ indicates that the attention mechanism is not applied to the model. The same parameters are used in the training process, and experiments are performed on the VOC dataset. The results are shown in Table 3.

SE is a classic channel attention mechanism, which strengthens the importance of feature channels by compressing and stimulating processes. As another form of channel attention, ECA enhances feature representation by effectively capturing cross-channel correlations but ignores spatial location information. CA integrates location information into channel attention and processes features in different spatial directions through two feature coding steps, thereby generating weights that fuse channel and spatial information. CBAM combines the advantages of channel and spatial attention mechanisms and models the channel and spatial weights independently, which not only strengthens the relationship between channels but also considers the spatial interaction and realizes the comprehensive optimization of features. The experimental results reveal the specific effects of different attention mechanisms on model performance. The model with SE and CBAM attention mechanisms suffered a 0.4% and 0.2% decrease in detection accuracy, respectively, indicating that the two mechanisms did not effectively improve performance on the current dataset. In contrast, when the model combines the CA, ECA, and EMA attention mechanisms, the detection accuracy is improved by 0.4%, 0.3%, and 1.1%, respectively. For the latter two attention mechanisms, the detection accuracy is significantly improved. On the whole, the introduction of the EMA attention mechanism not only accelerates the detection speed but also effectively improves the detection accuracy of the model, which makes it more advantageous in practical applications.

In order to verify the performance of the ROS–YOLOv5–FleetEMA model proposed in this paper, the model is compared with the traditional YOLOv5-Lite model based on deep learning. In the case of using the same dataset and experimental environment, the average accuracy improvement effect is shown in Table 4.

Through the analysis of the results, the mAP @ 0.5 of the ROS–YOLOv5–FleetEMA model proposed in this paper is 2.7% higher than that of the traditional YOLOv5-Lite model, and in a wider accuracy range mAP @ 0.5–0.95, the ROS–YOLOv5–FleetEMA model proposed in this paper is 4.3% higher than the traditional YOLOv5-Lite model.

In order to evaluate the lightweight improvement effect of the ROS–YOLOv5–FleetEMA model more comprehensively, this paper introduces the traditional YOLOv5 s model as the comparison benchmark. The experimental results are shown in Table 5.

The results show that the ROS–YOLOv5–FleetEMA model proposed in this paper has achieved significant optimization in the two key indicators of GFLOPS and Param. Compared with the traditional YOLOv5s model, the GFLOPs of the ROS–YOLOv5–FleetEMA model are reduced by 79.3%, and the parameter amount is reduced by 81.1%. This optimization not only reduces the consumption of computing resources but also makes the model more suitable for deployment on resource-constrained devices. At the same time, compared with the YOLOv5-Lite model, the GFLOPs of the ROS–YOLOv5–FleetEMA model are reduced by 13.2%, and the amount of parameters is reduced by 15.1%.

By comparing the experimental results, it is verified that the ROS–YOLOv5–FleetEMA model shows significant advantages in computational efficiency and resource consumption while maintaining high detection accuracy, which proves its practicability and effectiveness in a resource-constrained environment.

3.4.4. Ablation Experiment

In order to further verify the effectiveness of the improved method ROS–YOLOv5–FleetEMA model proposed in this paper, the following ablation experiments are designed: Conduct ablation experiments to explore the effectiveness of improvement methods on the model. Combine EMA, C3Ghost, and MPDIoU with the traditional YOLOv5-Lite model in different ways. By comparing the performance of models with different configurations, ablation experiments can help us understand how each component affects the overall performance of the model, including detection accuracy, computational efficiency, and resource consumption. The ablation experiment systematically removes or replaces various components in the model, observes the impact of these changes on model performance, provides an empirical basis for model design decisions, and ensures the practicality and effectiveness of the proposed ROS–YOLOv5–FleetEMA model in resource-constrained environments. In the experimental design, “−” indicates that an improvement has not been applied to the model, while “+” indicates that the improvement has been integrated. In this way, the specific impact of each combination on the performance of the model can be clearly demonstrated. The specific results are shown in Table 6 and Figure 8.

Through experimental analysis, the EMA attention module is introduced into the traditional YOLOv5-Lite model, and the mAP @ 0.5 is significantly improved, while the number of model parameters does not increase much. In addition, the traditional CIoU loss function is replaced by the MPDIoU loss function, which further optimizes the performance of the model in terms of bounding box positioning accuracy. The MPDIoU loss function makes the model more accurate in predicting the bounding box by considering the center point and diagonal distance of the bounding box, and the predicted bounding box has a higher degree of coincidence with the real bounding box. The experimental results show that mAP @ 0.5 is increased by 0.4%, which indicates that the MPDIoU loss function can make the regression of the model to the bounding box more stable, and the prediction accuracy is higher. After the introduction of the C3Chost module, the parameters of the model and the GFLOPS are significantly reduced while maintaining a high detection accuracy. The C3Chost module reduces the consumption of computing resources by optimizing the feature extraction process without affecting the detection effect. Finally, all these improved methods are applied to the YOLOv5-Lite model; not only has mAP @ 0.5 been significantly improved but the number of parameters of the model has been reduced by 15.1%. This shows that these optimization strategies can significantly reduce the computational complexity and resource consumption of the model without sacrificing the detection accuracy, making the model more suitable for deployment on resource-constrained devices, such as mobile devices and embedded systems.

3.4.5. Integrated Experiment

The Jilin Provincial Key Laboratory of Human Health Status Identification and Function Enhancement was selected as the experimental site.

In order to realize the object detection function, this paper deploys deep learning object detection technology to the robot equipment. For this reason, this paper develops an object detection function package based on ROS–YOLOv5–FleetEMA, enters the src directory in the working space catkin_ws, and opens the terminal; input conda activate yolo, enter the virtual environment, enter the function package directory, enter sudo pip install -r requirements.txt, and install the object detection-related dependency library.

After the installation is completed, enter the roslaunch yolov5_ros yolo.launch command and start the usb_cam and the object detection function based on ROS–YOLOv5–FleetEMA at the same time. The usb_cam is a package used for interacting with the USB camera. This package allows users to subscribe to camera image topics and publish them to ROS, allowing them to use USB cameras in ROS. By subscribing to the image topic published by usb_cam, we employ cv-bridge to transform ROS image messages into the OpenCV image format. Within the callback function, we execute YOLOv5 object detection on the transformed image, subsequently convert the processed image back into ROS image messages, and publish them to a new YOLOv5 topic.

At this point, open the ROS-based object detection service platform, set the IP address of the car, and then connect the device; by using the QT button to subscribe to newly established YOLOv5 topics with just one click, the results will be displayed on the ROS-based object detection platform, as shown in Figure 9.

Through experimental analysis, the application effect of the ROS–YOLOv5–FleetEMA model proposed in this paper in the ROS robot system is verified. The model not only performs well in a resource-constrained environment but also integrates with a ROS-based object detection platform to achieve efficient and fast object detection. Specifically, the system can accurately identify and track multiple targets, such as pedestrians, monitors, etc. When the car maintains a speed of 0.5 to 1.5 m per second, it takes 34.4 milliseconds to identify the object, up to 30 FPS, ensuring the fluency of the detection process. This optimization not only improves the robot’s perception ability in complex environments but also provides strong support for further decision-making and execution.

4. Conclusions

This paper comprehensively introduces the development process of ROS and makes an in-depth analysis of object detection technology. Along with the hardware equipment and software platform of the ROS robot, the experimental environment is built. On this basis, an improved YOLOv5-Lite object detection algorithm combining multi-scale attention and bounding box regression is proposed to form the ROS–YOLOv5–FleetEMA model, and the object detection function is integrated and deployed on the platform of the ROS robot. Through experimental analysis, relative to the conventional YOLOv5-Lite model, the ROS–YOLOv5–FleetEMA model enhanced the mAP @ 0.5 by 2.7%, reduced GFLOPS by 13.2%, and decreased the params by 15.1%; it has been proven that the ROS–YOLOv5–FleetEMA model proposed in this paper can achieve near real-time object detection function. Compared with the traditional model, ROS–YOLOv5–FleetEMA shows significant advantages in a resource-constrained environment, including but not limited to high detection accuracy, small model size, low cost, and fast inference speed. These advantages give the ROS–YOLOv5–FleetEMA model an extremely high reference value and use value in practical applications. Although the ROS–YOLOv5–FleetEMA model proposed in this paper performs well in a specific experimental environment, its generalization ability for other types of datasets or practical application scenarios may be insufficient. In the future, we will study how to improve the generalization ability of the model further so that it can adapt to a wider range of application requirements.

Author Contributions

Conceptualization, H.W. and J.Z.; methodology, Z.S. and G.G.; software, Z.S. and G.G.; validation, H.W., C.L. and J.Z.; visualization, Z.X. and C.L.; writing—original draft, Z.S.; writing—review and editing, Z.S., G.G., Z.X. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the “Jilin Province Science and Technology Development Plan Project, grant number YDZJ202201ZYTS549”, the “Changchun Science and Technology Development Plan Project, grant number 21ZGM30”, and the “Science and Technology Research Project of Education Department of Jilin Province, grant number JJKH20220597KJ.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were used in this study. These data can be found here: https://github.com/rbgirshick/rcnn/issues/48 (accessed on 21 July 2024).

Acknowledgments

We would like to express our deepest gratitude to all those who have contributed to the completion of this research and the writing of this paper. Finally, special thanks to the Changchun University scholar climbing program for providing guidance on this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cao, Y. Analysis of the application of computer vision technology in automation. China New Commun. 2021, 23, 123–124. [Google Scholar]
Gu, Y.; Zong, X. A review of research on object detection based on deep learning. Mod. Inf. Technol. 2022, 6, 76–81. [Google Scholar]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001; IEEE: New York, NY, USA, 2001; pp. 786–790. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: New York, NY, USA, 2014; pp. 580–587. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [PubMed]
Zhao, W.; Fu, H.; Luk, W.; Yu, T.; Wang, S.; Feng, B.; Ma, Y.; Yang, G. F-CNN: An FPGA-based framework for training Convolutional Neural Networks. In Proceedings of the 2016 IEEE 27th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), London, UK, 6–8 July 2016; IEEE: New York, NY, USA, 2016. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE: New York, NY, USA, 2017; Volume 39, pp. 1137–1149. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-Based Fully Convolutional Networks; Curran Associates Inc.: Red Hook, NY, USA, 2016. [Google Scholar] [CrossRef]
Qian, W. Research on Indoor Mobile Robot Target Detection and Location Grasping. Master’s Thesis, Nanjing Forestry University, Nanjing, China, 2023. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only Look Once: Unified, Real-time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on CVPR, Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao HY, M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
Ge, Y.; Qi, Y.; Meng, X. YOLOv5 Improved Lightweight Mask Face Detection. Comput. Syst. Appl. 2023, 32, 195–201. [Google Scholar] [CrossRef]
Lyu, Z.; Xu, Y.; Xie, Z. Detection of safety equipment for coal mine electric power personnel based on lightweight YOLOv5. J. Heilongjiang Univ. Sci. Technol. 2023, 33, 737–742. [Google Scholar]
Park, H.; Yoo, Y.; Seo, G.; Han, D.; Yun, S.; Kwak, N. C3: Concentrated-Comprehensive Convolution and its application to semantic segmentation. arXiv 2018, arXiv:1812.04920. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning; Aerospace Science & Industry ShenZhen (Group) Co., Ltd.: Shenzhen, China, 2023. [Google Scholar]
Hajič, J., Jr.; Pecina, P. Detecting Noteheads in Handwritten Scores with ConvNets and Bounding Box Regression. arXiv 2017, arXiv:1708.01806. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More featuresfrom cheap operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE: Seattle, WA, USA, 2020; pp. 1577–1586. [Google Scholar]
Support Government. Design and Implementation of Human Computer Interaction Interface for Seven Degree of Freedom Robotic Arm Based on ROS and Qt. Master’s Thesis, China University of Petroleum (East China), Dongying, China, 2019. [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I. CBAM: Convolutional Block Attention Module; Springer: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021. [Google Scholar] [CrossRef]

Figure 1. YOLOv5-Lite network structure.

Figure 2. Basic units of ShuffleNet V2. (a) Deep stacking module Stage 1; (b) deep stacking module Stage 2.

Figure 3. Efficient multi-scale attention.

Figure 4. Traditional convolution and GhostNet convolution processes.

Figure 5. Ghost module.

Figure 6. Hardware structure of Ackerman differential car.

Figure 7. Workflow of object detection function.

Figure 8. Ablation experimental results.

Figure 9. ROS-based object detection platform.

Table 1. Experimental environment.

Environment Configuration	Name	Related Configuration
Hardware environment	CPU	Intel(R) Core (TM) i7-7700HQ [email protected]
	Running memory	8 G
	GPU	NVIDIA GeForce GTX 1050Ti
Software environment	Operating system	Windows10
	Python	3.8
	Deep learning framework	Pytorch
	CUDA	11.3

Table 2. Hyperparameter settings.

Parameter	Parameter Description	Value
lr0	Initial learning rate	0.001
lrf	cyclical learning rates	0.2
weight_decay	Weight attenuation parameter is used to prevent model over-fitting	0.0005
warmup_epochs	Warmup learning rounds	3.0
momentum	Warmup learning momentum	0.8
warmup_bias_lr	Warmup learning rate	0.1
IoU loss coefficient	It is used to measure the overlap between the predicted bounding box and the real bounding box	0.05
cls loss coefficient	It is used to measure the prediction accuracy of the model for the target category	0.5
cls BECLoss	Positive sample weight	1.0

Table 3. Comparison of attention improvement effects.

Model	mAP @ 0.5	mAP @ 0.5–0.95	Precision	Recall
-	0.765	0.515	81.2	66.8
SE	0.761	0.523	78.4	68.1
CBAM	0.763	0.505	83.6	60.5
ECA	0.768	0.525	83.2	63.9
CA	0.769	0.521	73.7	66.3
EMA	0.776	0.534	85.6	69.1

Table 4. mAP improvement effect comparison.

Model	mAP @ 0.5	mAP @ 0.5–0.95
YOLOv5-Lite	0.765	0.515
ROS-YOLOv5-FleetEMA	0.792	0.558

Table 5. Lightweight improvement effect comparison.

Model	GFLOPS	Param
YOLOv5s	15.9	7,064,065
YOLOv5-Lite	3.8	1,566,561
ROS–YOLOv5–FleetEMA	3.3	1,332,471

Table 6. Ablation experimental results.

Method	EMA	C3Ghost	MPDIoU	mAP @ 0.5	mAP @ 0.5–0.95	GFLOPs	Param
YOLOv5-lite	-	-	-	0.768	0.515	3.8	1,566,561
YOLOv5-lite + EMA	+	-	-	0.776	0.534	3.8	1,566,575
YOLOv5-lite + C3Ghost	-	+	-	0.769	0.519	3.3	1,328,617
YOLOv5-lite + MPDIoU	-	-	+	0.772	0.522	3.8	1,566,561
YOLOv5-lite + EMA + C3Ghost	+	+	-	0.783	0.539	3.3	1,328,617
YOLOv5-lite + EMA + MPDIoU	+	-	+	0.778	0.536	3.8	1,566,575
YOLOv5-lite + C3Ghost + MPDIoU	-	+	+	0.771	0.533	3.3	1,328,617
ROS–YOLOv5–FleetEMA	+	+	+	0.792	0.558	3.3	1,332,471

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Shi, Z.; Gao, G.; Li, C.; Zhao, J.; Xu, Z. Robot Operating Systems–You Only Look Once Version 5–Fleet Efficient Multi-Scale Attention: An Improved You Only Look Once Version 5-Lite Object Detection Algorithm Based on Efficient Multi-Scale Attention and Bounding Box Regression Combined with Robot Operating Systems. Appl. Sci. 2024, 14, 7591. https://doi.org/10.3390/app14177591

AMA Style

Wang H, Shi Z, Gao G, Li C, Zhao J, Xu Z. Robot Operating Systems–You Only Look Once Version 5–Fleet Efficient Multi-Scale Attention: An Improved You Only Look Once Version 5-Lite Object Detection Algorithm Based on Efficient Multi-Scale Attention and Bounding Box Regression Combined with Robot Operating Systems. Applied Sciences. 2024; 14(17):7591. https://doi.org/10.3390/app14177591

Chicago/Turabian Style

Wang, Haiyan, Zhan Shi, Guiyuan Gao, Chuang Li, Jian Zhao, and Zhiwei Xu. 2024. "Robot Operating Systems–You Only Look Once Version 5–Fleet Efficient Multi-Scale Attention: An Improved You Only Look Once Version 5-Lite Object Detection Algorithm Based on Efficient Multi-Scale Attention and Bounding Box Regression Combined with Robot Operating Systems" Applied Sciences 14, no. 17: 7591. https://doi.org/10.3390/app14177591

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Robot Operating Systems–You Only Look Once Version 5–Fleet Efficient Multi-Scale Attention: An Improved You Only Look Once Version 5-Lite Object Detection Algorithm Based on Efficient Multi-Scale Attention and Bounding Box Regression Combined with Robot Operating Systems

Abstract

1. Introduction

2. Related Work

2.1. YOLOv5-Lite Network Model

2.2. Efficient Multi-Scale Attention

2.3. MPDIoU Loss Function

2.4. C3Ghost Module

3. Experimental Test and Result Analysis

3.1. Hardware Equipment

3.2. Experimental Equipment

3.2.1. SSH Remote Connection

3.2.2. Deep Learning Environment

3.3. ROS-Based Object Detection Service Platform

3.4. Experimental Test

3.4.1. Model Training

3.4.2. Dataset

3.4.3. Comparative Experiment

3.4.4. Ablation Experiment

3.4.5. Integrated Experiment

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI