Research on the Rapid Recognition Method of Electric Bicycles in Elevators Based on Machine Vision

Zhao, Zhike; Li, Songying; Wu, Caizhang; Wei, Xiaobing

doi:10.3390/su151813550

Open AccessArticle

Research on the Rapid Recognition Method of Electric Bicycles in Elevators Based on Machine Vision

by

Zhike Zhao

^1,2,*

,

Songying Li

¹,

Caizhang Wu

¹ and

Xiaobing Wei

³

¹

College of Electrical Engineering, Henan University of Technology, Zhengzhou 450001, China

²

Key Laboratory of Grain Information Processing and Control (Henan University of Technology), Ministry of Education, Zhengzhou 450001, China

³

Henan Special Equipment Inspection Technology Research Institute, Zhengzhou 450000, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(18), 13550; https://doi.org/10.3390/su151813550

Submission received: 1 August 2023 / Revised: 2 September 2023 / Accepted: 5 September 2023 / Published: 11 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

People are gradually coming around to the idea of living a low-carbon lifestyle and using green transportation, and given the severe urban traffic congestion, electric bicycle commuting has taken over as the preferred mode of short-distance transportation for many. Since batteries are used to power electric bicycles, there are no greenhouse gas emissions while they are in use, which is more in line with the requirement for sustainable development around the world. The public has been increasingly concerned about the safety issues brought on by electric bicycles as a result of the industry’s quick development and the rapid increase in the number of electric bicycles worldwide. The unsafe operation of the elevator and the safety of the building have been seriously compromised by the unauthorized admission of electric bicycles into the elevator. To meet the need for fast detection and identification of electric bicycles in elevators, we designed a modified YOLOv5-based identification approach in this study. We propose the use of the EIoU loss function to address the occlusion problem in electric bicycle recognition. By considering the interaction ratio and overlap loss of the target frames, we are able to enhance localization accuracy and reduce the missed detection rate of occluded targets. Additionally, we introduce the CBAM attention mechanism in both the backbone and head of YOLOv5 to improve the expressive power of feature maps. This allows the model to prioritize important regions of the target object, leading to improved detection accuracy. Furthermore, we utilize the CARAFE operator during upsampling instead of the nearest operator in the original model. This enables our model to recover details and side information more accurately, resulting in finer sampling results. The experimental results demonstrate that our improved model achieves an mAP of 86.35 percent, a recall of 81.8 percent, and an accuracy of 88.0 percent. When compared to the original model under the same conditions, our improved YOLOv5 model shows an average detection accuracy increase of 3.49 percent, a recall increase of 5.6 percent, and an accuracy increase of 3.5 percent. Tests in application scenarios demonstrate that after putting the model on the hardware platform Jeston TX2 NX, stable and effective identification of electric bicycles can be accomplished.

Keywords:

electric bicycle identification; YOLOv5; improved upsampling operator; fusion attention mechanism; loss function

1. Introduction

As global urbanization has accelerated, cities have witnessed a rise in high-rise buildings and traffic congestion. Consequently, there is a growing emphasis on sustainable development worldwide. In this context, electric bicycles have emerged as a significant option to reduce traffic congestion and promote low-carbon, environmentally friendly travel. However, the issue of charging large-scale electric bicycles has become increasingly prominent. Some individuals resort to illegal means, such as using elevators to transport electric bicycles to high-rise buildings for indoor charging. This violation poses multiple security risks, which not only jeopardizes the safety of passengers’ lives but also poses a potential threat to the safety and sustainability of urban infrastructure. To address the increasing number of fires caused by electric bikes illegally entering high-rise buildings for indoor charging, it is crucial to develop a system that can accurately identify these vehicles in elevators. According to a report on building fires published by the Fire Service [1], elevators are the sole means of entry for electric bicycles into tall buildings. Despite the issuance of relevant management regulations by the building management department, the difficulty of manual management makes it challenging to prevent electrocution incidents in elevators. Therefore, it is necessary to study detection and identification systems and methods for the entry of electric bicycles into elevators in order to mitigate safety hazards caused by their illegal entry and address this management issue.

Previous studies aimed to identify electric bicycles by combining magnetic induction and pressure sensing. However, this approach was susceptible to interference from environmental factors, leading to a high false-positive rate. Moreover, assigning different pressure thresholds to different types of electric bicycles proved challenging, making it difficult to accurately detect electric bicycles in elevators. The key to solving this problem lies in the target detection algorithm based on image processing. However, traditional target detection algorithms that rely heavily on data samples and lack robustness are being replaced by emerging target detection algorithms based on deep learning. Thanks to technological advancements and extensive research, computer vision has made significant progress in various domains [2]. While object detection is a crucial task in computer vision, the efficiency and accuracy of object detection can be improved by analyzing image information. Currently, there are two main categories of object detection algorithms: classical methods and deep learning methods. Classical methods typically employ hand-designed feature extractors and machine learning algorithms like Haar features, Histogram of Oriented Gradient (HOG) directed gradient histograms, Scale-invariant Feature Transform (SIFT) scale-invariant feature variants, and so on, as benchmarks. Shaoqing Ren et al. [3] propose a method that combines Haar features with the Region Proposal Network (RPN) to generate candidate target regions, enhancing the accuracy and detection speed of object detection tasks. However, the use of Haar features may result in the loss of certain low-level features, which limits its effectiveness in detecting complex backgrounds and small targets. On the other hand, object detection based on deep learning methods involves utilizing convolutional neural networks or other variants to achieve end-to-end object detection [4]. Deep learning techniques, such as Faster Region-based Convolutional Neural Networks (R-CNN), Single Shot Multibox Detector (SSD), and the You Only Look Once algorithm (YOLO), have been developed to predict object locations and categories directly from images. These algorithms leverage large-scale datasets and train deep network models to automatically learn feature representations from raw data [5]. One notable algorithm, Faster R-CNN, has been enhanced by JiangXiang Ju et al. [6] through the introduction of a feature pyramid network (FPN). FPN effectively addresses the challenge of multi-scale recognition in object detection, resulting in improved accuracy. However, the incorporation of FPN also introduces computational complexity to the model, leading to slower processing speeds. Wu Shan et al. [7] proposed a multi-scale fusion and hybrid attention mechanism to enhance the preservation of semantic information in SSD algorithms. With the increasing popularity of deep learning, numerous highly effective object detection algorithms have been developed. Among them, the YOLO series algorithm is widely referenced in various research domains due to its superior accuracy and fast processing speed. It utilizes convolutional networks from deep learning to perform object detection tasks [8,9]. By acquiring features and contextual information from images, it can precisely detect objects and adjust their bounding boxes, leading to significant achievements in the field of object detection.

YOLOv1, the first version of the YOLO series, introduces a novel object detection strategy. It divides the image into grid cells, and each cell predicts the object’s class and location [10]. YOLOv2, also known as YOLO9000, is the second version of the series. It builds upon YOLOv1 by incorporating key techniques like multi-scale prediction, anchor frames, and feature pyramid networks [11]. YOLOv3, the third version, further enhances the network structure by introducing new features such as multiple predictions, feature pyramid networks, and path-hopping connections [12]. YOLOv4, the official final version, integrates advanced object detection techniques, including the CIoU loss function, Mish activation function, self-attention mechanism, and SAM module, to improve accuracy and detection speed [13]. Subsequently, the updated YOLOv5 series introduces adaptive anchor frame technology, enhancing adaptability to different sizes and shapes of target objects. The detection results prove valuable in various scenarios and datasets.

Ziyi Li et al. [14] introduced a YOLOv5s-D model, which improved the convergence rate and detection accuracy by replacing the decoupling head in YOLOv5s. Jiajia Liu et al. [15] proposed a modified YOLOv5-based method for workpiece detection in dense scenes of industrial production lines. This method addresses the detection difficulties caused by high similarity and the disordered arrangement of workpieces. Yongbin Guo et al. [16] presented an AC-YOLOv5 model that incorporates a cavernous convolution pyramid. This model effectively mitigates false and missing detections in detecting textile defects through the introduction of a hollow convolution pyramid module. Qing A. et al. [17] utilized the K-Means++ method to modify the anchor frame, aiming to enhance the low accuracy of the YOLOv5 model in helmet detection. In a similar vein, Liu Jianqi et al. [18] addressed the issue of gradient disappearance in YOLOv5 by incorporating the Feature Pyramid Transfomer (FPT) attention mechanism; however, this improvement came at the expense of detection speed. Yibin Lin et al. [19] proposed a deep, multi-scale fusion-based approach for detecting electric bicycles in elevator environments. This approach effectively improves detection accuracy and robustness by integrating multi-scale feature maps. Nevertheless, it should be noted that the deployment of this method on mobile terminals may require substantial computational resources.

In order to enhance the extraction of network features, Zhenhai Wang et al. [20] integrated the FacNet network model into the backbone network of YOLOv5. They achieved this by extracting multiple finite frequency domain components and utilizing the input channel information effectively, resulting in improved accuracy of the electric bicycle detection system. However, it is worth noting that the research did not investigate the potential impact of the narrow space inside the elevator and the occlusion phenomenon on the detection performance of the model. Caifeng Zhang et al. [21] enhanced the backbone network of YOLOv5s by incorporating GhostNet, resulting in a reduction in model parameters and the preservation of memory and computing resources. Additionally, QFocal Loss was employed to improve the classification and localization of electric vehicles. However, it was observed that the network exhibited a phenomenon of missing detection when the detection targets were densely packed. Xianyu Yang et al. [22] utilized the lightweight network MobileNetV2 as a replacement for the backbone network of YOLOv3. This was carried out to address the challenge of deploying an electric bicycle detection algorithm on an edge device within an elevator. The use of MobileNetV2 resulted in reduced computational effort and model parameters. However, the accuracy of detection was compromised due to the limitations of the YOLOv3 network. In a similar context, Peng Huang et al. [23] employed a deep learning-based SSD object detection network to identify violations of electric bicycles within an elevator. However, the accuracy of detection was found to be inadequate when occlusion occurred within the elevator.

To address the requirements of various industrial applications, Chuyi Li et al. [24] introduced YOLOv6, which incorporates post-training quantization and quantization-aware training into the YOLO model. This integration resulted in improved speed and accuracy for the model. Chien-Yao Wang et al. [25] proposed a novel label assignment method called coarse-to-freebies, which tackles the challenge of replacing the original module with a heavy-parameter module. They named their model YOLOv7 and focused on assigning dynamic label classification policies to different output layers. Subsequent versions of YOLO, such as YOLOv6 and YOLOv7, have made significant enhancements to the YOLO series algorithm, demonstrating notable performance improvements on the COCO dataset. However, they lack generalization capabilities for custom datasets. Since our experiments utilize self-built datasets, we have chosen YOLOv5 as the base model. This decision takes into account the processing performance and economic benefits of mobile devices in our in-house environment.

Based on the improvements of the YOLOv5 model, this paper proposes a more suitable model for detecting electric bicycles inside elevators. Firstly, the Enhanced Intersection over Union (EIoU) loss function is used instead of the Complete Intersection over Union (CIoU) loss function in the original model to reduce the missed detection rate of occluded objects. Secondly, to enhance the expressive power of feature maps, the model focuses more on important regions by integrating the Convolutional Block Attention Module (CBAM) attention mechanism. The suitable locations for insertion are experimentally selected. Finally, the upsampling operator in the original model is improved, and the nearest operator is replaced with Content-Aware ReAssembly of Features (CARAFE) to obtain more accurate sampling results, recover spatial resolution, and obtain finer localization information.

2. YOLOv5 Model Algorithm

YOLOv5 is a deep learning-based object detection algorithm that is a version of the YOLO family. YOLOv5 has improved detection performance, speed, and model size compared to previous versions. YOLOv5 utilizes a single-stage detection approach, enabling the completion of the object detection task with a single forward propagation [26]. It utilizes deep convolutional neural networks as the backbone network and incorporates feature pyramid networks and feature fusion techniques to extract multi-level features of the target from feature maps of varying scales. YOLOv5 utilizes a highly optimized network structure, ensuring efficient inference speed without compromising detection accuracy. In contrast to YOLOv4, YOLOv5 employs a lighter network design that reduces the number of parameters and computational load in the model [27]. YOLOv5 is composed of four main components: Input, backbone, neck, and detect. The input side functions serve as image input and Mosaic data augmentation. The backbone is primarily responsible for extracting image features. YOLOv5 adopts CSPDarknet53 [28] as its default backbone network, which is a deep convolutional network with residual connections and feature fusion across stages. The neck fuses multi-scale features extracted from the backbone network. YOLOv5 adopts a feature pyramid network structure that combines Feature Pyramid Networks (FPN) [29] and Path Aggregation Network (PANet) [30]. It integrates features from different levels through upsampling and downsampling operations to enhance object detection accuracy. The detect component is responsible for generating bounding boxes and category predictions for object detection. These components together form the fundamental framework of the YOLOv5 object detection network, which is trained using optimization algorithms like backpropagation and gradient descent.

The overall network structure of YOLOv5 is depicted in Figure 1. The Conv Batch Normalization Leakey Relu (CBL) module plays a crucial role in the YOLOv5 network by extracting features and performing nonlinear transformations. The CBL module primarily comprises convolutional layers, batch normalization, and activation functions. The CBL module enhances the nonlinear capability and stability of the network while also accelerating its convergence rate. A Resunit [31] is a residual unit that comprises two CBL modules, which incorporate skip connections to add inputs and outputs. These residual elements help mitigate the issue of vanishing and exploding gradients. CSP1_X represents a module in the CSPDarknet network, with X indicating the number of residual units included in the module. Focus is a feature extraction module that utilizes slicing operations to split the input feature map into four sub-feature maps. These sub-feature maps are then stacked in the channel dimension to create a deeper feature map. The object features are extracted through a convolution operation. Additionally, the Spatial Pyramid Pooling Fast (SPPF) [32] module is an enhanced version of the Spatial Pyramid Pooling (SPP) [33] module, which enhances the computational speed of the model.

3. Improved YOLOv5 Model Algorithm

3.1. Improved Scheme for the YOLOv5 Algorithm

This study focuses on using machine vision recognition to identify electric bicycles in elevators. The recognition of electric bicycles in occlusion poses a challenge for the YOLOv5 algorithm. To address this issue, the YOLOv5 model is modified in this paper. Figure 2 illustrates the modified structure of the YOLOv5 algorithm, with the specific algorithm improvements highlighted in red boxes.

The YOLOv5 algorithm incorporates the following improvements: (1) The CBAM attention mechanism is integrated into the backbone and neck, as shown in box a in Figure 2, to enhance the model’s attention to the target area and improve detection accuracy. (2) In the UpSample process, the nearest operator in the original model is replaced with the CARAFE operator, strengthening the semantic information in the feature map in boxes b and c in Figure 2. (3) The original model’s CIoU loss is replaced with the EIoU loss, a new loss function, to enhance the accuracy of boundary frame positioning.

3.2. CBAM Module

The original YOLOv5 model was unable to effectively capture feature information at different levels when detecting an electric bicycle inside an elevator. This was mainly due to the diverse visual features of the bicycle at different scales and poses. However, the Conditional Block Attention Module (CBAM) addresses this limitation by modeling the channel and spatial dimensions of the feature map. This allows for a more accurate focus on key areas to capture the characteristics of electric bicycles. Additionally, the CBAM module is lightweight and can be easily integrated with various Convolutional Neural Networks (CNN) architectures, including the YOLOv5 network, without increasing computational costs.

The CBAM module utilizes a feed-forward neural network mechanism. The module structure diagram is depicted in Figure 3. The overall process begins by accepting a feature map F, which was generated by the previous convolution, as the input feature map. Next, the channel attention module generates the feature graph F′. This feature graph is then used as the input feature for the spatial attention module, resulting in a new feature graph F″. The channel attention and spatial attention weighted results are multiplied together to obtain the final feature re-calibration result. This recalibration enhances attention to both the channel and spatial locations of the feature maps, thereby improving feature expression.

The CBAM module consists of two main components: the channel attention module and the spatial attention module. The channel attention module, illustrated in Figure 4, dynamically determines the importance of each channel by assigning attention weights to the channel dimensions of the convolutional layer feature maps. This is achieved by utilizing global average pooling and fully connected layers to generate channel attention weights, which are then applied to each channel on the feature map to enhance the representation of crucial channels. On the other hand, the spatial attention module, depicted in Figure 5, focuses on the spatial dimension of feature maps and learns the significance of each spatial location by employing 1D convolution and fully connected layers to generate spatial attention weights. These spatial attention weights are subsequently applied to each spatial location on the feature map to improve the representation of vital locations.

In the CBAM module, the channel attention module is positioned at the beginning, while the spatial attention module is placed at the end. The channel attention module assigns weights to the feature channels in the preceding convolutional layers to emphasize the important information for the task. By incorporating the spatial attention module after the channel attention module, we can enhance the focus on the relationship between different spatial locations using the improved channel features. Additionally, the spatial attention mechanism can utilize additional relevant features. This approach enables the capture of informative features at both local and global levels, thereby enhancing their representation.

The input received by the channel attention mechanism is represented as a feature diagram F, with dimensions H × W and C channels. Global maximum pooling and global average pooling are then applied to obtain two 1 × 1 × C feature maps. These feature maps are fed into a Multi-Layer Perceptron (MLP) [34], which consists of two layers. The first layer has C/r neurons (where r is the decrement rate) and uses the Rectified Linear Unit (Relu) activation function. The second layer has C neurons, and both layers are shared. The MLP features are added and summed, and the Sigmoid activation function is applied to generate the weighting coefficients Mc. Mc is multiplied by the input feature graph to obtain the output feature graph F′. The calculation process of the channel attention module is represented by Equation (1).

\begin{matrix} M_{C} (F) = & σ \{M L P [A v g P o o l (F)]\} + M L P [M a x P o o l (F)] = \\ σ \{W_{1} [W_{0} (F_{avg}^{C})] + W_{1} [W_{0} (F_{m a x}^{C})]\} \end{matrix}

(1)

where

F

is the input feature map and

σ

is the activation function. MLP represents passing through a multilayer perceptron, and

AvgPool (F)

and

MaxPool (F)

represent average and maximum pooling of

F

, respectively. Here,

F_{avg}^{C}

is the average pooling feature,

F_{m a x}^{C}

is the maximum pooling feature, and

W_{0}

and

W_{1}

are the weight parameters.

The spatial attention module is illustrated in Figure 5. The feature map F′ from the channel attention module serves as the input feature map for the spatial attention module. Initially, two H × W × C feature maps are obtained by applying maximum pooling and average pooling based on the number of channels. These two feature maps are then concatenated and processed through a 7 × 7 convolution to obtain a one-dimensional feature map of size H × W × 1. Finally, the spatial attention feature is generated using the Sigmoid activation function. Ms is then multiplied by the input feature map of the spatial attention module to obtain the final output feature map F″. The calculation Formula (2) of the spatial attention module is as follows:

\begin{matrix} M_{s} (F) & = σ (f^{7 \times 7} ([AvgPool (F); MaxPool (F)])) \\ = σ (f^{7 \times 7} ([F_{a v g}^{s}; F_{m a x}^{s}])) \end{matrix}

(2)

where

F

is the input feature of the spatial attention module,

f^{7 \times 7}

is the convolution operation of size 7 × 7,

σ

is the activation function,

AvgPool (F)

and

MaxPool (F)

represent average pooling and maximum pooling of

F

.

This paper utilizes the Gradient-weighted Class Activation Mapping (Grad-CAM) method to generate a thermal map. In order to evaluate the effectiveness of integrating the CBAM module into the algorithm, thermal map visualizations are applied to the same image data. The results are presented in Figure 6. Figure 6a displays the thermal map obtained by the original YOLOv5 algorithm using Grad-CAM. It can be observed that the algorithm also focuses on areas that are not part of the detection target, such as the ground and the right side. The thermal map obtained by integrating the YOLOv5 algorithm with CBAM and using Grad-CAM is shown in Figure 6b. The thermal map clearly indicates that the improved algorithm pays more attention to the target area, where the electric bicycle is located. This integration of the improved YOLOv5 algorithm with the CBAM module facilitates the extraction of visual features from the monitored target, thereby enhancing the accuracy of target detection.

3.3. Improved Upsampling Operator

Feature upsampling plays a critical role in object detection. The current main sampling methods are nearest-neighbor interpolation [35] and bilinear interpolation [36], which primarily consider the adjacent pixel space. However, these methods fail to adequately represent the semantic information of the feature maps. In the original YOLOv5 model, nearest neighbor interpolation is used for upsampling, and the nearest neighbor operator is employed in the upsampling process. The boundary of the detected electric bicycle may become blurred due to the use of the nearest-neighbor operator in the pixel duplication process. Accurate object localization in electric bike detection systems relies heavily on the details of object image boundaries. Loss of these details can result in false detections or incorrect localizations of electric bicycles. To address this issue, we propose replacing the original sampling operator with the CARAFE operator for upsampling, building upon the YOLOv5 framework. When an electric bike is placed in an elevator, its images may vary in size and proportion. To address this, the CARAFE operator can dynamically adjust the size of the receptive field based on the content of the input image, allowing it to accommodate targets of different scales. The CARAFE operator includes a context adaptation mechanism that adjusts the weights based on the features of local regions, enabling it to handle the features of different regions more effectively. Additionally, the CARAFE operator incorporates a nonlinear interpolation operation during the upsampling process, facilitating more informative interactions between pixels. This leads to improved detection results in object detection tasks, specifically for electric bicycles.

In contrast, the CARAFE operator introduced in YOLOv5 does not significantly increase the number of parameters and computational effort. Table 1 demonstrates the comparison of various commonly used sampling operators on the COCO dataset, revealing that CARAFE enhances the accuracy, regression rate, and mean accuracy of the model [37]. The nearest interpolation, known as the nearest neighbor interpolation, assigns a new pixel value by selecting the original pixel value that is closest to the target position. Bicubic interpolation, on the other hand, calculates the original pixel values by considering the 16 neighboring pixels around the target pixel, resulting in a relatively smooth outcome. ConvTranspose2d is an upsampling method that generates a larger output by convolving each pixel of the original image with a convolution kernel. Among the four types, the CARAFE operator achieves the highest accuracy and preserves image features and information better compared to other operators. In terms of return rate, Nearest, Bicubic, and CARAFE have similar performances. However, the CARAFE operator enables better upsampling while maintaining image features. Nearest and Bicubic achieved the highest average accuracy of 42.6 percent. Overall, the CARAFE operator performs well in terms of accuracy, regression rate, and mean accuracy, making it suitable for a wide range of tasks and applications.

The CARAFE operator follows a specific execution flow chart, as illustrated in Figure 7. It is composed of two main parts. The first part is the upsampled kernel prediction module. When the upsampling rate is set to

σ

, the input feature map has a size of

H \times W \times C

. In this module, an upsampled kernel is initially predicted. The second part is the feature recombination module, which completes the upsampling process to obtain an output feature map with a size of

σ H \times σ W \times C

. In the upsample prediction module, an input feature map of

H \times W \times C

is received. The channel number of the feature graph is reduced through a 1 × 1 convolution to minimize computation. Subsequently, a convolutional layer

k_{u p} \times k_{u p}

is employed to predict the size of the upsampled kernel. The number of input channels is

C_{m}

and the number of output channels is

σ^{2} k_{u p}^{2}

. The upsampling kernel of the shape

σ H \times σ W \times k_{u p}^{2}

is then obtained by expanding the channel dimension into the spatial dimension. Softmax is used to normalize the resulting upsampled kernels such that the weights of the convolutional kernels sum to unity, completing the work of the upsampled kernel prediction module. In the feature recombination module, the output feature map is mapped back to the input feature map by selecting the region centered at

k_{u p} \times k_{u p}

and computing the dot product of the predicted upsampling kernel at that point. Each location in the output feature map shares an up-sampling kernel for different channels. This process results in an output feature map of size

σ H \times σ W \times C

.

3.4. The EIoU Loss Function

In the process of deep learning, the loss function serves as a metric to assess the disparity between the predicted output of the model and the true label. By minimizing the loss function, the model’s performance can be optimized. YOLOv5 employs the Complete Intersection over Union (CIoU) loss [38] as its loss function. The CIoU loss function is specifically designed for bounding boxes [39] and measures the discrepancy in Intersection over Union (IoU) [40] between the predicted bounding box and the actual bounding box. By combining the IoU loss and bounding box coordinates, the CIoU loss function enhances the accuracy of bounding box regression. However, the CIoU loss function in the original model uses the aspect ratio between the predicted frame and the target frame to predict the target location. This approach has some imperfections in the aspect penalty term, especially when there is occlusion. As a result, the localization accuracy of the target is compromised. To address this issue, we propose the EIoU loss function in this paper. The EIoU loss function optimizes the model loss and improves the accuracy of model regression on predicted frames by considering overlapping regions. The procedure for computing the EIoU loss in the original model is described in Equation (3):

CIoU = IoU - (\frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v)

(3)

v = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2}

(4)

α = \frac{v}{(1 - IoU) + v}

(5)

where

IoU

is the interaction ratio,

v

is is the ratio difference between the true and predicted bins, and

α

is the weight parameter. It is computed by dividing the intersection area between the predicted and true boxes by their joint area.

In Figure 8, the prediction box is represented by the color blue, while the true box is represented by the color green. Equation (3) calculates the Euclidean distance between the center point of the true box and the predicted box. This is the distance between marker 1 and marker 2 in Figure 8.

c

denotes the diagonal length of the smallest that covers the predicted box and the true box, that is, the distance between labeled 3 and labeled 4.

\frac{\partial v}{\partial w^{p}} = - \frac{h}{w} \frac{\partial v}{\partial h^{p}}

(6)

During the calculation, CIoU Loss reflects the overall difference in height and width but not the true difference in height and width, respectively, and their confidence. Equation (6) shows that by taking the derivative in the CIoU loss function, it can be observed that the length and width of the prediction box are inversely proportional. If one side of the prediction box grows, the other side must shrink. Hence, the CIoU loss function suffers from inaccurate localization when the predicted box is close to the true box. Equation (7) demonstrates the procedure to compute the EIoU loss function.

\begin{matrix} L_{EIoU} & = L_{IoU} + L_{d i s} + L_{a s p} \\ = 1 - IoU + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + \frac{ρ^{2} (w, w^{g t})}{C_{w}^{2}} + \frac{ρ^{2} (h, h^{g t})}{C_{h}^{2}} \end{matrix}

(7)

where

C_{w}

and

C_{h}

are the width and height of the smallest outer frame containing the predicted and true frames.

L_{IoU}

is the loss of overlap between the predicted box and the true box,

L_{d i s}

is the loss of center distance, and

L_{a s p}

is the loss of width and height of the predicted box. By splitting the aspect ratio loss term into the difference between the predicted width and height and the minimum width and height of the external frame, EIoU enhances convergence speed and enhances regression accuracy.

4. Experiment and Analysis

4.1. Data Set Introduction

To address the lack of image data for electric bicycles inside elevators, we have created our own dataset for this study. The dataset comprises 500 individual electric bicycles, 200 images of pedestrians riding electric bicycles, 500 images of electric bicycles pushing an elevator, and 500 background images without any detection targets. The photographs of electric bicycles capture various real-life scenarios, while the images of electric bicycles in elevators focus on different perspectives such as the front view, the rear view, pedestrians pushing into the elevator, pedestrians riding into the elevator, and other complex or obstructed situations. For the dataset allocation, a total of 1400 samples are assigned to the training set, and 300 samples are assigned to the test set for model training purposes. The distribution of the selected datasets is illustrated in Figure 9. Figure 10 showcases a sample image dataset of electric bicycles used for training.

4.2. Training Environment

The system used for training the model described in this paper is Ubuntu 18.04. The GPU used is an NVIDIA RTX A2000 with 12GB of memory. The CPU model is Intel(R) Xeon(R) CPU E5-2680 [email protected]. The motherboard model is the X10DRG-O+-CPU. The hard drive used is a Samsung SSD 870 with 128GB of capacity. The model is implemented in Python using the PyTorch framework.

4.3. Evaluation Index

In order to evaluate the performance of the training model, this paper selected several evaluation metrics, including mean Average Precision (

m A P

), Precision (

P

), Recall (

R

), Parameter number (Params), Giga Floating-Point Operations per Second (GFLOPs), and Frames Per Second (FPS). The procedure for calculating

P

,

R

,

A P

, and

m A P

is described in Equations (8) and (9):

\{\begin{matrix} P = T P / (T P + F P) \\ R = T P / (T P + F N) \end{matrix}

(8)

\{\begin{matrix} A P = \int_{1}^{0} P (R) d R \\ m A P = \sum_{i = 1}^{N} A P_{i} / N \end{matrix}

(9)

where

T P

denotes the number of samples detected as true positives,

F P

denotes the number of samples predicted as true negatives, and

F N

denotes the number of positive samples not detected. The average precision (

A P

) is a metric that evaluates object detection results for a single category and measures the trade-off between accuracy and recall of the model for a given category.

A P

is computed by calculating accuracy and recall at various confidence thresholds and then averaging them.

A P

values range from 0 to 1, where higher values indicate better performance of the model in that specific category.

m A P

is an indicator that comprehensively evaluates the detection results for multiple classes of objects. It computes the average

A P

over all classes to obtain the average

A P

accuracy of the model over the entire dataset.

m A P

is an important metric to measure the overall performance of a model and is used to compare performance differences between different models or between models with different configurations.

4.4. Result Analysis

To evaluate the impact of each module on the model’s performance, we examine the effect of adding each module on recall, precision, and mean precision under the experimental conditions described in Section 4.2. The results of the ablation experiments are illustrated in Table 2. The first model, YOLOv5, serves as the fiducial model. YOLOv5-DIOU is the training result obtained by replacing the loss function in the original YOLOv5 model with DIOU. YOLOv5-EIoU replaces the CIoU loss function with the EIoU loss function in the original algorithm. We observe that when the EIoU loss function is used, the model’s accuracy, recall, and mean accuracy increase by 1.3%, 2.3%, and 0.5%, respectively, compared to the original model. This indicates that the EIoU loss function helps optimize missing and false detections in the underlying experiment to some extent.

YOLOv5-CBAM incorporates the CBAM attention mechanism module into the base model, resulting in a notable 3.3% improvement in recall compared to the original model as well as a 2.3% increase in average accuracy. On the other hand, YOLOv5-CARAFE introduces the CARAFE operator separately, leading to an average improvement of 2.3% in accuracy and 1.5% in recall compared to the original model. This demonstrates that the CARAFE operator enhances the network’s ability to recognize and localize the target, refines the side information of the target, and effectively mitigates the feature loss problem during model training. Lastly, the CBAM-EIoU model is obtained by incorporating the CBAM module into the YOLOv5 algorithm and replacing the EIoU loss function.

Compared to the original model, the addition of a single module has improved both the return rate and the average accuracy. The EIoU-CARAFE model incorporates the upsampling operator and the EIoU loss function into the YOLOv5 model, resulting in improved accuracy compared to the single module case, although the return rate has slightly decreased. The BAM-CARAFE model is based on YOLOv5 and includes the CBAM attention mechanism and the CARAFE operator. All three metrics—accuracy, recall, and average accuracy—show improvement compared to the original model and the addition of a single module. This paper proposes a modified algorithm that introduces three shift modules, which has led to a 3.5% improvement in accuracy, a 5.6% improvement in recall, and a 3.5% improvement in average accuracy compared to the original model. Since the final model needs to be deployed on mobile terminals, it is essential to balance computation and parameter requirements while ensuring accurate model detection.

The algorithm proposed in this paper introduces three change modules, resulting in improved accuracy, recall, and average accuracy of 3.5%, 5.6%, and 3.5%, respectively, compared to the original model. Considering the deployment of the final model on mobile terminals, it is crucial to balance computation and parameter requirements while maintaining the detection accuracy of the model.

In Figure 11, mAP_0.5 represents the average accuracy calculated with an IoU threshold of 0.5. This indicates the model’s detection performance when the overlap between the target and the detection box is 50 percent. Figure 12, on the other hand, shows mAP_0.5:0.95, which represents the average accuracy calculated with a step size of 0.05. This is conducted by using different thresholds ranging from 0.5 to 0.95. Both mAP_5:0.95 and mAP_0.5 are parameters used for evaluating model performance metrics. The results show that the improved model outperforms the other models in terms of mAP_0.5:0.95 and mAP_0.5, as depicted in Figure 11 and Figure 12. The original model performs well in detecting objects that occupy large positions in the image, but it may overlook objects that occupy small positions in the middle corners. However, after incorporating the CBAM attention mechanism, the modified model in this study focuses more on the regions where the target is present and performs well in detecting small target electric bicycles that occupy a relatively small portion of the image.

The test results of electric bicycles in the elevator are depicted in Figure 13. Figure 13a displays the image of electric bicycles waiting outside the elevator, comprising five electric bicycles stacked together. Figure 13b illustrates the recognition outcome of Figure 13a, demonstrating the model’s ability to accurately locate each electric bicycle and providing a confidence level. Figure 13c presents a scenario where a person is riding an electric bike into the elevator. In Figure 13d, the model successfully identifies both the person and the electric bike in the image and aligns them with rectangular boxes. Figure 13e showcases the situation when the electric bicycle has just entered the elevator, while Figure 13f demonstrates the model’s accurate detection of its structure.

Figure 14 illustrates the training loss values obtained in each iteration during the training process. Figure 14a represents the training model loss of the original algorithm, while Figure 14b represents the training model loss of the improved algorithm. The training loss is composed of boxing loss, objection loss, and classification loss, denoted as val/box_loss, val/obj_loss, and val/cls_loss, respectively. Specifically, val/box_loss refers to the bounding box loss, val/obj_loss refers to the target confidence loss, and val/cls_loss refers to the class loss. The val/box_loss is a metric used to evaluate the accuracy of the model in predicting the position of the bounding box. It measures the difference between the predicted bounding box and the true bounding box. A lower bounding box loss value indicates a more accurate prediction of the target location. On the other hand, val/obj_loss measures the model’s ability to predict whether a target is correctly detected or not. It reflects the accuracy of the model’s confidence in determining the presence or absence of the target. The model’s detection capability is stronger when the target confidence loss value is lower. The accuracy of the model’s prediction of the target category is measured by val/cls_loss. This metric quantifies the difference between the model’s prediction and the true category. A lower class loss value indicates a better classification ability of the model. With an increase in the number of iterations, the model’s loss value gradually decreases. In the initial training phase, the model learns efficiently, and the training loss curve converges quickly. Figure 13 illustrates that the improved model achieves faster convergence compared to the original model.

Comparing the results in Figure 14a,b, it is evident that the modified EIoU loss function reduces the bounding box loss and class loss by 0.005% and 0.0025%, respectively, compared to the original model. However, there is a slight increase of 0.001% in the target confidence loss. The improved accuracy in bounding box localization provides more precise information about the target’s position. This is particularly beneficial when the target object is partially occluded, as it helps the model estimate the occluded part more accurately, leading to a decrease in false detections or missing detections. The improved model demonstrates a significant reduction in both bounding box loss and class loss, indicating its enhanced ability to localize the target and predict its class with greater accuracy. Although there is a slight increase in target confidence loss, it helps prevent some false predictions.

To assess the accuracy, detection rate, model size, and computational cost of the proposed model, we conducted a comparative experimental study using commonly used object detection algorithms: YOLO-M, YOLOv5-m, EfficientNetB3, and VGG16. The experimental results, presented in Table 3, include metrics such as mean average precision (mAP), frames per second (FPS), number of parameters, and number of computations. In this study, we replaced the YOLO-M model with the MobileNetV2 lightweight architecture. This replacement resulted in a reduction in the number of parameters and computational effort. However, it is important to note that the detection accuracy of YOLO-M may be slightly degraded. Another variant of the YOLOv5 model, called YOLOv5-m, was also examined in this research. YOLOv5-m has different specifications and improved scales and parameters compared to the original model. Additionally, we analyzed the EfficientNetB3 model, which belongs to the Efficient family. The modified model has a larger number of parameters and a higher computational cost compared to EfficientNetB3, but the difference in detection accuracy is minor. The modified model shows an average improvement of 2% in accuracy compared to VGG16. However, due to its deep structure and large number of parameters, VGG16 is computationally expensive, resulting in a decrease in the detection rate. Through a comparison of the experimental results, we can conclude that the proposed model better fulfills the real-world requirements in terms of both detection rate and accuracy.

4.5. Test Environment Description

The NVIDIA Jetson TX2 NX development board, depicted in Figure 15, was utilized in this field test. The development board is equipped with a dual-core NVIDIA Denver 264-bit CPU and an Arm Cortex-A57MP Core composite processor. The GPU is based on the NVDIA Pascal architecture. It includes a 128 GB NVM Express (NVMe) P2000 SSD and a 1080P RMONCAM camera. The Jetson TX2 NX is an edge computing device that offers high computing power while consuming low power. Additionally, it possesses robust real-time inference capabilities, making it suitable for meeting the detection requirements of electric bicycles inside elevators. The test environment for this study involved Ubuntu 18.04 as the operating system and Python as the programming language. The configuration environment comprised Python 3.6, OpenCV, CUDA 10.2, cuDNN 8.0, torch 1.9, and torversion 0.8.1. Field tests conducted on the hardware platform demonstrated that models with 185 layers, 7,331,242 parameters, 16.3 GFLOPS, a 30 FPS frame rate, and an average processing time of 45 ms per image can effectively fulfill the need for high-speed testing of electric bicycles in elevators.

5. Conclusions

This paper proposes an improved algorithm based on YOLOv5 to address the challenges of occlusion and slow recognition speed when an electric bicycle is pushed into an elevator. The original YOLOv5 algorithm shows a lack of localization accuracy in predicting the aspect ratio of frames, as it results in an increase in length and width on one side and a decrease on the other. To address this issue, we use the EIoU loss function instead of the original CIoU to improve the localization accuracy of the model. This modification aims to reduce the missed detection rate of the model on occluded targets. The attention mechanism CBAM is integrated into the backbone and head of YOLOv5. This integration improves the model’s focus on key regions and allows it to extract richer feature information. Additionally, CARAFE is used as the upsampling operator to help the model retain more details and side information, thereby improving detection accuracy. Experiments have shown that the improved model achieves an average detection accuracy that is 3.5 percent higher compared to the original model. The improved model demonstrates enhanced capability for accurately detecting occluded and incomplete image information. For future research, efforts will be made to further optimize the model size and enhance the detection speed on mobile terminals while ensuring the accuracy of the detection. This study deeply explores the safety risk of electric vehicles entering elevators and effectively enhances the safety of the elevator environment through improvements in the YOLOv5 algorithm and its field deployment on Jeston TX2 NX.

Author Contributions

Conceptualization, Z.Z. (Henan University of Technology; Key Laboratory of Grain Information Processing and Control (Henan University of Technology)), and S.L. (Henan University of Technology); methodology, Z.Z. (Henan University of Technology; Key Laboratory of Grain Information Processing and Control (Henan University of Technology)); formal analysis, C.W. (Henan University of Technology); investigation, X.W. (Henan Special equipment inspection Technology Research Institute); writing—original draft preparation, Z.Z. (Henan University of Technology; Key Laboratory of Grain Information Processing and Control (Henan University of Technology)), and S.L. (Henan University of Technology). All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by the Natural Science Program of the Henan Provincial Department of Education (22A440009); High-level Talents Research Start-up Fund Project of Henan University of Technology (2020BS011); Open Project of Key Laboratory of Grain Information Processing and Control (KFJJ-2021-111); Natural Science Project of Zhengzhou Science and Technology Bureau (22ZZRDZX07); Open Project of Henan Engineering Laboratory for Optoelectronic Sensing and Intelligent Measurement and Control (HELPSIMC-2020-005); Henan Provincial Science and Technology Research and Development Plan Joint Fund (222103810084).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data and materials are available from the authors upon request.

Acknowledgments

The authors would like to thank everyone who helped with this study for their insightful remarks.

Conflicts of Interest

The authors declare they have no conflicts of interest.

References

Yar, H.; Khan, Z.A.; Ullah, F.U.M.; Ullah, W.; Baik, S.W. A modified YOLOv5 architecture for efficient fire detection in smart cities. Expert Syst. Appl. 2023, 231, 120465. [Google Scholar] [CrossRef]
Hossein, H.M.; Hadis, M. Fine-tuned YOLOv5 for real-time vehicle detection in UAV imagery: Architectural improvements and performance boost. Expert Syst. Appl. 2023, 231, 120845. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Jiang, J.; Han, Y.; Zhao, H.; Suo, J.; Cao, Q. Recognition and sorting of coal and gangue based on image process and multilayer perceptron. Int. J. Coal Prep. Util. 2023, 43, 54–72. [Google Scholar] [CrossRef]
Du, B.; Wan, F.; Lei, G.; Xu, L.; Xu, C.; Xiong, Y. YOLO-MBBi: PCB Surface Defect Detection Method Based on Enhanced YOLOv5. Electronics 2023, 12, 2821. [Google Scholar] [CrossRef]
Ju, J.X.; Liang, D.X. Railway Catenary Insulator Recognition Based on Improved Faster R-CNN. Autom. Control. Comput. Sci. 2023, 56, 553–563. [Google Scholar]
Wu, S.; Zhou, F. Small Object Detection Based on Improved SSD Algorithm. Comput. Eng. 2023, 49, 179–188. [Google Scholar]
Liu, H.; Duan, X.; Lou, H.; Gu, J.; Chen, H.; Bi, L. Improved GBS-YOLOv5 algorithm based on YOLOv5 applied to UAV intelligent traffic. Sci. Rep. 2023, 13, 9577. [Google Scholar] [CrossRef]
Yang, Y.; Wang, X. Recognition of bird nests on transmission lines based on YOLOv5 and DETR using small samples. Energy Rep. 2023, 9, 6219–6226. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Li, Z.; Zhang, W.; Yang, X. An Enhanced Deep Learning Model for Obstacle and Traffic Light Detection Based on YOLOv5. Electronics 2023, 12, 2228. [Google Scholar] [CrossRef]
Liu, J.; Zhang, S.; Ma, Z.; Zeng, Y.; Liu, X. A Workpiece-Dense Scene Object Detection Method Based on Improved YOLOv5. Electronics 2023, 12, 2966. [Google Scholar] [CrossRef]
Guo, Y.; Kang, X.; Li, J.; Yang, Y. Automatic Fabric Defect Detection Method Using AC-YOLOv5. Electronics 2023, 12, 2950. [Google Scholar] [CrossRef]
An, Q.; Xu, Y.; Yu, J.; Tang, M.; Liu, T.; Xu, F. Research on Safety Helmet Detection Algorithm Based on Improved YOLOv5s. Sensors 2023, 23, 5824. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Yan, H.; Wang, X.; Li, M. Improved YOLOv5 Object Detection Network with Pyramid and Skip Connections. Control Decis. 2023, 38, 1730–1736. [Google Scholar]
Lin, Y.; Chen, X.; Zhong, W.; Pan, Z. In Online detection system for electric bike in elevator or corridors based on multi-scale fusion. In Proceedings of the Computers and Software Engineering (AEMCSE), Changsha, China, 26–28 March 2021; pp. 31–34. [Google Scholar]
Zhang, C.; Xiong, A.; Luo, X.; Zhou, C.; Liang, J. Electric Bicycle Detection Based on Improved YOLOv5. In Proceedings of the 2022 4th International Conference on Advances in Computer Technology, Information Science and Communications (CTISC), Suzhou, China, 22–24 April 2022; pp. 1–5. [Google Scholar]
Wang, Z.; Hu, C.; Li, J.; Karras, D.A.; Yang, S.X. Electric bicycle detection in elevator car based on YOLOv5. In Proceedings of the 3rd International Conference on Artificial Intelligence, Automation, and High-Performance Computing, Wuhan, China, 31 March–2 April 2023; pp. 127170I–127170I-9. [Google Scholar]
Yang, X.Y. Improved YOLOv3-based algorithm for detecting electric vehicles in lifts. Comput. Age 2023, 61–65. [Google Scholar] [CrossRef]
Huang, P.; Fang, Z.; Zhu, M.; Huang, Z.; Ye, R.; Liu, Y. Research on SSD network based e-bike detection in lifts. China Prod. Saf. Sci. Technol. 2023, 19, 167–172. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Liu, G.; Hu, Y.; Chen, Z.; Guo, J.; Ni, P. Lightweight object detection algorithm for robots with improved YOLOv5. Eng. Appl. Artif. Intell. 2023, 123, 106217. [Google Scholar] [CrossRef]
Xu, C.; Wang, Z.; Du, R.; Li, Y.; Li, D.; Chen, Y.; Li, W.; Liu, C. A method for detecting uneaten feed based on improved YOLOv5. Comput. Electron. Agric. 2023, 212, 108101. [Google Scholar] [CrossRef]
Fu, J.; Chen, X.; Lv, Z. Rail Fastener Status Detection Based on MobileNet-YOLOv4. Electronics 2022, 11, 3677. [Google Scholar] [CrossRef]
Chen, F.; Zhang, L.; Kang, S.; Chen, L.; Dong, H.; Li, D.; Wu, X. Soft-NMS-Enabled YOLOv5 with SIOU for Small Water Surface Floater Detection in UAV-Captured Images. Sustainability 2023, 15, 10751. [Google Scholar] [CrossRef]
Zhai, Y.; Zeng, W.; Li, N. A Novel Detection Method Using YOLOv5 for Vehicle Target under Complex Situation. Trait. Du Signal 2022, 39, 1153–1158. [Google Scholar] [CrossRef]
Tang, J.; Liu, S.; Zhao, D.; Tang, L.; Zou, W.; Zheng, B. PCB-YOLO: An Improved Detection Algorithm of PCB Surface Defects Based on YOLOv5. Sustainability 2023, 15, 5963. [Google Scholar] [CrossRef]
He, Q.; Zhang, H.; Mei, Z.; Xu, X. High accuracy intelligent real-time framework for detecting infant drowning based on deep learning. Expert Syst. Appl. 2023, 228, 120204. [Google Scholar] [CrossRef]
Wang, J.; Su, Y.; Yao, J.; Liu, M.; Du, Y.; Wu, X.; Huang, L.; Zhao, M. Apple rapid recognition and processing method based on an improved version of YOLOv5. Ecol. Inform. 2023, 77, 102196. [Google Scholar] [CrossRef]
Chuangchuang, Y.; Tonghai, L.; Fangyu, G.; Rui, Z.; Xiaoyue, S. YOLOv5s-CBAM-DMLHead: A lightweight identification algorithm for weedy rice (Oryza sativa f. spontanea) based on improved YOLOv5. Crop Prot. 2023, 172, 106342. [Google Scholar]
Shuai, L.; Mu, J.; Jiang, X.; Chen, P.; Zhang, B.; Li, H.; Wang, Y.; Li, Z. An improved YOLOv5-based method for multi-species tea shoot detection and picking point location in complex backgrounds. Biosyst. Eng. 2023, 231, 117–132. [Google Scholar] [CrossRef]
Zhang, J.; Chen, H.; Yan, X.; Zhou, K.; Zhang, J.; Zhang, Y.; Jiang, H.; Shao, B. An Improved YOLOv5 Underwater Detector Based on an Attention Mechanism and Multi-Branch Reparameterization Module. Electronics 2023, 12, 2597. [Google Scholar] [CrossRef]
Li, Z.; Rao, Z.; Ding, L.; Ding, B.; Fang, J.; Ma, X. YOLOv5s-D: A Railway Catenary Dropper State Identification and Small Defect Detection Model. Appl. Sci. 2023, 13, 7881. [Google Scholar] [CrossRef]
Ren, Y.; Liu, J.; Yuan, H.; Wan, W. Edge-guided with gradient-assisted depth up-sampling. Electron. Lett. 2017, 53, 1400–1402. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
Lv, H.; Yan, H.; Liu, K.; Zhou, Z.; Jing, J. YOLOv5-AC: Attention Mechanism-Based Lightweight YOLOv5 for Track Pedestrian Detection. Sensors 2022, 22, 5903. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Network structure diagram of YOLOv5.

Figure 2. Network structure diagram of improved YOLOv5.

Figure 3. Structure diagram of CBAM model.

Figure 4. Structure diagram of the channel attention module.

Figure 5. Structure diagram of the spatial attention module.

Figure 6. Comparison results of thermal maps before and after the improved algorithm. (a) thermal map of the original algorithm; (b) thermal map of the CBAM model algorithm.

Figure 7. Structure diagram of the CARAFE operator.

Figure 8. Schematic diagram of the prediction box and the real box.

Figure 9. The data set distribution graph is as follows: (a) the number of samples; (b) the relative location of the central point.

Figure 10. The sample image data set of electric bicycles used for training are listed as follows: (a) Training sample images of electric bicycles parked outdoors; (b) training sample images of an electric bicycles parked indoors; (c) training sample images of people riding electric bicycles into elevators; (d) multiple training sample images of electric bicycles densely parked; (e) the training sample image of an electric bicycle under the condition of partial occlusion; (f) training sample images without electric bicycles in the elevator.

Figure 11. Comparison of test results of mAP_0.5.

Figure 12. The threshold of mAP is selected from 0.5 to 0.95, and the step size is 0.5 to obtain a comparison chart of test results.

Figure 13. The test results of electric bicycles in the elevator are listed as follows: (a) Images of electric bicycles waiting for inspection outside the elevator; (b) test results of electric bicycles outside the elevator; (c) the waiting image of the electric bicycle fully entering the elevator; (d) test results of the electric bicycle fully entering the elevator; (e) the waiting image of the front wheel of the electric bicycle entering the elevator; (f) the test result of the front wheel of the electric bicycle entering the elevator.

Figure 14. The loss curves during training are as follows: (a) The training model loss of the original algorithm; (b) the training model loss of the improved algorithm.

Figure 15. The hardware platform and elevator test scenarios are as follows: (a) The hardware platform; (b) a test scenario in an elevator.

Table 1. Test comparison table of various upsampling algorithms.

Upsampling Operator	Precision (%)	Recall (%)	Mean Average Precision (%)
Nearest	48.2	42.9	42.1
Bicubic	46.5	42.8	42.6
ConvTranspose2d	44.3	39.2	37.6
CARAFE	49.0	42.9	42.6

Table 2. The results of the ablation experiment.

Model	DIOU	CBAM	EIoU	CARAFEE	Precision (%)	Recall (%)	mAP_0.5 (%)
YOLOv5					84.5	76.2	82.86
YOLOv5-DIOU	√				84.9	79.9	85.12
YOLOv5-CBAM		√			85.3	79.5	85.15
YOLOv5-EIoU			√		86.6	77.7	84.01
YOLOv5-CARAFE				√	86.8	78.5	84.32
CBAM-EIoU		√	√		86.6	79.7	85.55
EIoU-CARAFE			√	√	87.3	78.4	85.37
CBAM-CARAFE		√		√	86.8	79.5	85.83
The model proposed in this paper		√	√	√	88.0	81.8	86.35

Table 3. Experimental results of algorithm performance comparison.

Model	mAP/%	FPS	Parameter/M	Computation/G
YOLO-M	85.3	45	1.6	3.9
YOLOv5-m	85.2	45	1.7	4.1
EfficientNetB3	86.1	40	5.1	12.2
VGG16	88.1	24	5.7	15.5
The model proposed in this paper	86.3	45	2.3	5.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Z.; Li, S.; Wu, C.; Wei, X. Research on the Rapid Recognition Method of Electric Bicycles in Elevators Based on Machine Vision. Sustainability 2023, 15, 13550. https://doi.org/10.3390/su151813550

AMA Style

Zhao Z, Li S, Wu C, Wei X. Research on the Rapid Recognition Method of Electric Bicycles in Elevators Based on Machine Vision. Sustainability. 2023; 15(18):13550. https://doi.org/10.3390/su151813550

Chicago/Turabian Style

Zhao, Zhike, Songying Li, Caizhang Wu, and Xiaobing Wei. 2023. "Research on the Rapid Recognition Method of Electric Bicycles in Elevators Based on Machine Vision" Sustainability 15, no. 18: 13550. https://doi.org/10.3390/su151813550

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on the Rapid Recognition Method of Electric Bicycles in Elevators Based on Machine Vision

Abstract

1. Introduction

2. YOLOv5 Model Algorithm

3. Improved YOLOv5 Model Algorithm

3.1. Improved Scheme for the YOLOv5 Algorithm

3.2. CBAM Module

3.3. Improved Upsampling Operator

3.4. The EIoU Loss Function

4. Experiment and Analysis

4.1. Data Set Introduction

4.2. Training Environment

4.3. Evaluation Index

4.4. Result Analysis

4.5. Test Environment Description

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI