A Method for Detecting Key Points of Transferring Barrel Valve by Integrating Keypoint R-CNN and MobileNetV3

Huang, Canyu; Lei, Zeyong; Li, Linhui; Zhong, Lin; Lei, Jieheng; Wang, Shuiming

doi:10.3390/electronics12204306

Open AccessArticle

A Method for Detecting Key Points of Transferring Barrel Valve by Integrating Keypoint R-CNN and MobileNetV3

by

Canyu Huang

¹,

Zeyong Lei

^1,*,

Linhui Li

¹,

Lin Zhong

¹,

Jieheng Lei

² and

Shuiming Wang

¹

School of Mechanical Engineering, University of South China, Hengyang 421001, China

²

School of Electrical Engineering, University of South China, Hengyang 421001, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(20), 4306; https://doi.org/10.3390/electronics12204306

Submission received: 11 September 2023 / Revised: 8 October 2023 / Accepted: 16 October 2023 / Published: 18 October 2023

Download

Browse Figures

Versions Notes

Abstract

:

Industrial robots need to accurately identify the position and rotation angle of the handwheel of chemical raw material barrel valves during the process of opening and closing, in order to avoid interference between the robot gripper and the handwheel. This paper proposes a handwheel keypoint detection algorithm for fast and accurate acquisition of handwheel position and rotation pose. The algorithm is based on the Keypoint R-CNN (Region-based Convolutional Neural Network) keypoint detection model, which integrates the lightweight mobile network MobileNetV3, the Coordinate Attention module, and improved BiFPN (Bi-directional Feature Pyramid Network) structure to improve the detection speed of the model, enhance the feature extraction performance of the handwheel, and improve the expression capability of small targets at keypoint locations. Experimental results on a self-built handwheel dataset demonstrate that the proposed algorithm outperforms the Keypoint R-CNN model in terms of detection speed and accuracy, with a speed improvement of 54.6%. The detection accuracy and keypoint detection accuracy reach 93.3% and 98.7%, respectively, meeting the requirements of the application scenario and enabling accurate control of the robot’s rotation of the valve handwheel.

Keywords:

keypoint detection; keypoint R-CNN; MobileNet V3; attention module; feature pyramid network

1. Introduction

During the process of transferring and unloading chemical materials, a small amount of harmful dust or gas may escape. Meanwhile, operators are required to enter a workshop wearing full protective suits, which is inconvenient. Robots are widely used in industrial manufacturing, logistics sorting, and other fields [1,2]. Integrating image recognition algorithms into robots to replace or assist in component localization has become a hot research topic. By visually recognizing the unloading valve on the transfer barrel, using visual recognition to identify the discharge valves on transfer containers, the unloading robot can be controlled to open and close the valves, replacing manual operation. This can reduce the duration of unloading operations, lessen operators’ burdens that have to go into the workshop and handle chemical materials, which decrease the labor intensity. Additionally, it requires that the robot insert its gripper into the handwheel at the top of the transferring barrel and rotate it for opening the unloading valve, as shown in Figure 1. During the insertion period, interference may occur between the handwheel’s gripper and spokes, causing frictional wear of the mechanical coaxial positioning device and the handwheel. By using visual recognition to locate the center point of the handwheel and the deflection angle of the spokes, such problems can be effectively avoided. Traditional feature matching algorithms [3,4] are only suitable for specific objects. Furthermore, the changes in the appearance features of the grasping target and external lighting are much more likely to lead to a decrease in the detection performance. Robot recognition and localization systems designed based on these methods have poor robustness. As a result, researchers have begun to apply deep learning to solve robot grasping problems [5].

Widely used deep learning object detection algorithms can be classified into two types: one-stage and two-stage algorithms. Commonly used one-stage algorithms include YOLO (You Only Look Once) [6] and SSD (Single Shot MultiBox Detector) [7]. These algorithms do not generate candidate regions but directly output the detection box positions of the targets, contributing to relatively fast computation. The typical example of a two-stage algorithm is Faster R-CNN proposed by He Kaiming et al. [8]. In the first stage, it generates candidate regions that may contain objects, and in the second stage, the optimal result is selected from these candidate regions [9]. These algorithms have higher accuracy but slower computing speed.

Based on object detection, many researchers have conducted studies on keypoint detection to further obtain pose information of the target objects, such as human keypoint detection [10,11,12] and instrument keypoint detection [13]. Key point detection can be based on collecting 3D (3 Dimensional) features of the object using point clouds [14] to determine the locations of the key points, or it can be based on visual image detection on a 2D (2 Dimensional) plane. Xu Zhengyang [15] utilized the UNet backbone structure and residual network to extract features for mouse pose estimation. Gao Jiangjin [16] proposed using the VGG (Visual Geometry Group) network for facial keypoint detection. However, such network models are relatively large and not conducive to lightweight deployment. Qingqi Zhang [17] implemented the recognition of the water meter pointer readings by employing the detection modules for YOLOv4 tiny target and RFB-Net (Receptive Field Block Net) key point as follows. First, they used the detection moules for YOLOv4 tiny target to extract the images’ features to identify the section of water meter dial; then, they intercepted the images in this exact section and turned them into the detection modules for RFB-Net key point to re-extract features for the test of the water meter pointer key points, whereas the complexity of module training [18] increased due to the use of multiple modules for features extraction, and so the duration of the whole detection process increased on account of the participation of plural features network meshed.

Considering the operational speed of industrial robots, this study adopts the high-accuracy two-stage algorithm Keypoint R-CNN as the detection method for handwheel position and rotation angle. In order to improve the detection speed and accuracy, and meanwhile facilitate usage on embedded devices [19], improvements were made to the Keypoint R-CNN model:

The backbone network discards the ResNet50, a residual network, and instead opts for the lightweight MobileNetV3 network. Compared to ResNet50, it reduces model parameters and computational complexity, thereby improving detection speed. Additionally, it addresses issues such as complex network structure and difficulties in applying it to embedded devices.
Embedding the CA (Coordinate Attention) module enhances the model’s feature extraction capability, enabling more accurate localization and recognition of the targets.
Incorporating the Bi-directional BiFPN (Feature Pyramid Network) enhances the effective fusion between high-level and low-level features in the network.
The deviation angle is calculated out by utilizing the information of the handwheel center’s coordinates and the key points on the spokes, then using the obtained angles to control the robot to loosen or tighten the valve.

The chapter structure of this article is as follows: Section 2 provides a list of the materials required for the model improvements presented in this paper and the background information on the materials; Section 3 is about a comparative analysis on the three improved models presented in the paper to discuss the differences in recognition speed and recognition accuracy among these models; Section 4 as the final section summarizes the research work presented in this article.

2. Materials and Methods

2.1. Keypoint R-CNN Model

The Keypoint R-CNN model used in this paper is an extension of the Faster R-CNN model with an additional keypoint detection branch. The detection part of Faster R-CNN can be divided into four main modules:

The conv layers (feature extraction network) were used to extract features. Convolution and pooling operations were applied to extract feature maps from the image, which were used for subsequent RPN layers and proposal generation.
The Region Proposal Network (RPN) generates candidate boxes through two tasks. The first task involves classification, which determines the presence of objects within predefined anchors. The second task, bounding box regression, corrects anchors to obtain more precise proposals. As shown in Figure 2.
RoI Pooling (Region of Interest Pooling) was used to gather the proposals (the coordinates of each box) generated by RPN and extract them from the feature maps of the feature extraction network. This results in the proposal feature maps that were inputted into subsequent fully connected layers for classification and regression tasks.
Classification and regression. Utilizing the proposal feature maps, the specific class was determined, and a further round of bounding box regression was performed to obtain the precise position of the final detection box.

The Keypoint R-CNN model replaces ROI pooling used in the Faster R-CNN model, with ROI align. For ROI pooling, when pooling operations are applied to the feature maps which correspond to candidate boxes in the region proposal network, the coordinates of the candidate boxes are rounded to integers. This rounding operation can lead to significant position inaccuracies, especially in tasks involving small object detection. ROI align performs bilinear interpolation and pooling operations on the corresponding feature maps, ensuring that fixed-size feature maps were obtained. By using the interpolation method, ROI Align avoids introducing position deviation and thus preserves the accuracy of detection. A comparison between RoI Pooling and RoI Align is shown in Figure 3. The keypoint detection branch utilizes the feature map, extracted by the RPN, to obtain the coordinates of keypoints through pooling, convolution, and heatmap regression. With the addition of the keypoint detection branch, the model can perform tasks of both object detection and keypoint detection. The network structure of the Keypoint R-CNN model is illustrated in Figure 4.

2.2. MobileNetV3 Backbone Network

Many object detection algorithms adopt MobilenetV3 to reduce the amount of models parameters and computational complexity [20,21,22,23], enabling efficient inference and achieving good accuracy even on small devices. The Keypoint R-CNN model utilizes Resnet50 as the backbone network for feature extraction, which results in a large model size, a great number of training parameters, and a long period of prediction time. This paper does not adopt the Resnet50 option but selected MobilenetV3 as the backbone network. The depth-wise separable convolution in MobilenetV3, proposed by Google [24], divides the convolution operation into two parts: depth-wise convolution and point-wise convolution. The depth-wise convolution uses a convolution kernel, respectively, to extract features from each input channel first, and ascend dimensions to deliver point-wise convolution next. It can reduce the number of training parameters and the duration of detection significantly compared to regular convolutions.

2.3. Coordinate Attention Module

Compared to MobilenetV2 [25], MobilenetV3 adds the SE (Squeeze-and-Excitation) [26] attention module, which plays a part in the channels of the feature map and establishes channel relationships to make the module learn which channels should be followed. This enhances the model’s sensitivity about information channels to influence the network deciding process. The SE module consists of two steps: squeezing and exciting, which are used for embedding overall information and weighting adaptive channel relationships [27]. But the SE module only concerns the importance of channels and ignores positional information, while the keypoint detecting assignment is quite sensitive to positional information. Hou Qibin proposed the Coordinate Attention module [28], which combines positional information with channel attention by performing weighted fusion on channel attention. This allows for the network to obtain information from a larger area while avoiding excessive computational complexity. The structure of the CA Attention module is shown in Figure 5.

To capture attention on the width and height of the image and encode precise positional information, the CA Attention module applies global average pooling separately in both the width and height dimensions to the input feature map. This process generates feature maps for the width and height directions. The formulae are as follows:

z_{c}^{h} = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)

(1)

z_{c}^{w} = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w)

(2)

Then, the feature maps obtained from the previous step for the width and height directions are concatenated along the channel dimension. This concatenated feature map was passed through a

1 \times 1

convolutional kernel to compress its channel dimension to

C / r

. After that, the resulting feature map was processed in the width and height directions using BatchNorm and a non-linear activation function to obtain feature map

F_{1}

. This feature map

F_{1}

was then passed through a Sigmoid activation function to obtain the final feature map

f

with a size of

1 \times (W + H) \times C / r

. The formula is as follows:

f = δ (F_{1} ([z^{h}, z^{w}]))

(3)

The feature map

f

was passed through a

1 \times 1

convolutional layer to adjust its channel dimension, resulting in feature maps

F_{h}

and

F_{w}

with the same number of channels as the original feature map. These feature maps are then passed through a Sigmoid activation function to obtain the attention weights for the height and width directions,

g^{h}

and

g^{w}

, respectively. The formulae are as follows:

g^{h} = σ (F_{h} (f^{h}))

(4)

g^{w} = σ (F_{w} (f^{w}))

(5)

Finally, the original feature map was weighted using the attention weights obtained from the previous step through element-wise multiplication. This results in a final feature map that has attention weights for both the width and height directions. The formula is as follows:

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(6)

The CA Attention module excels in feature extraction, delivering superior performance [29,30]. To enhance the feature extraction capability of the model, this paper incorporates a Coordinate Attention module into MobilenetV3. The CA module suppresses the attention degree of unimportant regions in the image, directing the model to focus much more on the hand wheel image during feature extraction, so that the accuracy of hand wheel object detection can be improved.

2.4. Bi-Directional Feature Pyramid Network

Less semantic information on the low-level features in convolutional networks provides accurate target position information, while quite abundant semantic information for the high-level features offers a coarser estimation of the target’s position instead. The Keypoint R-CNN detection module requires recognizing not only the handwheel target but also the keypoints on the handwheel, so it was expected that the detection was able to be used for both giant targets and tiny objects. The FPN (Feature Pyramid Network) [31] addresses this issue by up-sampling the low-level features and fusing them with the corresponding high-level features to obtain high-resolution and strongly semantic features, which are beneficial for detecting small objects. FPN is the most fundamental feature fusion structure, known for its simplicity and effectiveness. It combines deep features with shallow features through a top–down operation. However, this unidirectional information flow limits its feature fusion capability. In this paper, we enhance the feature information in the network by employing a bi-directional feature pyramid structure [32] called BiFPN, which consists of multi-scale feature fusion modules and deep feature fusion modules. For keypoint detection models where the targets are relatively small at the keypoints, shallow feature maps with smaller receptive fields are more suitable for detecting such small objects. Therefore, in this study, we adjust the input layers of the BiFPN network to perform feature fusion on shallow feature maps, as capturing the features of small target objects improves the accuracy of keypoint detection in the model. The network architecture of the Bi-Directional Feature Pyramid Network is illustrated in Figure 6.

2.5. Improved Backbone Network of the Model

The detection module proposed in the paper modifies the backbone network from resnet50 to resnet50 for lightweighting the burden of the module. Meanwhile, it replaces the SE module with the CA attention module and FPN with BiFPN to raise the accuracy in the circumstance of little increasing in model parameters. The backbone network of the improved module is shown in Figure 7.

2.6. Experimental Platform

The camera used in this study was the Intel RealSense D455 camera, with a maximum resolution of 1280 × 720 for color images. The operating system of the computer used in the experiments was Windows 10, with an Intel [email protected] GHz CPU and an Nvidia RTX3060ti graphics card. The deep learning models was developed using Python 3.6 and trained on PyTorch 1.10, accelerated with CUDA 11.3.

2.7. Image Dataset

According to the information we have obtained, image datasets such as ImageNet do not include datasets specifically for handwheels. To validate the proposed method’s performance in detecting the handwheels’ position and its rotation angle, the images of handwheels collected with the Intel RealSense D455 camera are shown in Figure 8. The dataset includes top–down views of the handwheel, with objects placed as background noise, such as wrenches, screwdrivers, bolts, etc. The handwheel is a circular metal ring with three spokes radiating from the center. The key points to be detected are the center of the handwheel and the centers of the circular holes on the spokes. During the unloading procedure, the robot needs to insert its gripper into the gap between the spokes, with the gripper’s three fingers as the driving object, and the spokes as the driven object, to rotate the handwheel. The handwheel dataset was divided into training set and validation sets in a ratio of 5:1.

2.8. Training Scheme

We trained the model using a self-built handwheel dataset with a batch size of 2, learning rate set to 0.002, and a total of 50 epochs. To address the issue of limited training data, one can employ semi-supervised learning [33] or augment the dataset using data augmentation techniques. During training, data augmentation was carried out through taking advantage of the fast training data augmentation library, Albumentations from OpenCV. The augmentations include random rotation, random brightness, and contrast adjustments.

2.9. Experimental Comparison

To verify the impact of the improved model on the performance of handwheel keypoint detection, four models were used for comparative experiments.

Model 1 (ResNet50) was a KeyPoint R-CNN model for keypoint detection, with ResNet50 as the backbone network.
Model 2 (MobileNetV3) was a KeyPoint R-CNN model for keypoint detection, with the backbone network replaced by MobileNetV3.
Model 3 (MobileNetV3 + CA) was a KeyPoint R-CNN model for keypoint detection, with the backbone network as MobileNetV3 and the SE attention module replaced by the CA Attention module.
In Model 4 (MobileNetV3 + CA + mFPN) builds upon Model 3, with an additional enhancement has been introduced. Specifically, the model increased the number of layers in the feature pyramid. This extension allowed for feature fusion to take place in the shallow feature maps with high resolution.
The model proposed in this paper (MobileNetV3 + CA + BiFPN) was a KeyPoint R-CNN model for keypoint detection, with MobileNetV3 as the backbone network. It incorporated the CA attention module and BiFPN.

The model accuracy was evaluated using the COCO object detection and keypoint detection evaluation criteria. The mean average precision (mAP) was used to assess the quality of the model by calculating the ratio of the number of detected objects to the total number. A higher mAP value indicated better detection performance of the object detection model on the given dataset. The accuracy was calculated using the following formula:

A c c u r a c y = \frac{T P + T N}{P + N}

(7)

TP represents the number of instances where the actual positive detection result was positive, TN represents the number of instances where the actual negative detection result was negative, P represents the number of positive instances, and N represents the number of negative instances.

3. Results

3.1. Experimental Results

As shown in Figure 9 and Figure 10, after 20 epochs of training, the model reached convergence. The average values for object detection mAP and keypoint detection mAP were calculated over the last 30 epochs. From the validation set, 100 random images (with a resolution of 1280 × 720) were selected for inference with the experimental models. The inference time for each image is shown in Figure 11 and the average inference time per single image was then figured out. The final experimental results are presented in Table 1.

The detection speed was significantly improved by substituting the Resnet50 backbone in Model 1 for MobilenetV3 in Model 2. The average inference time per single image decreased from 1.19 s to 0.41 s, amounting to a 65% reduction. However, due to the lower number of parameters in MobilenetV3, the model’s expressive power was limited, resulting in a prominent decrease in accuracy in both object detection and keypoint detection.

After replacing the SE attention module in Model 2 with a CA module, Model 3 focused on the handwheel image features in both the channel and spatial dimensions so that the CA module could capture more informative features to increase the accuracy of object detection by 2.4%, and 0.6% for the accuracy of keypoint detection. As shown in Figure 12, the CA module demonstrated better performance compared to the SE attention module, which allows for the network to assign higher weights to regions containing objects in the feature maps. Thereby, features related to the handwheel can be emphasized further.

Compared to Model 2, Model 3 exhibited a significant improvement in object detection, but there was a relatively smaller improvement in keypoint detection tasks. Model 4′s feature pyramid integrated towards shallow feature maps to acquire richer detailed feature information, which empowered the model with a stronger ability to recognize tiny keypoint features, resulting in a 1.9% increase in keypoint detection accuracy and a 1.1% improvement in object detection accuracy. However, the scale of shallow feature maps was so large that the average detection time per single image of Model 4 increased significantly.

To capture detailed information about keypoints better, this paper’s model was based on Model 4. We removed FPN (Feature Pyramid Network) and integrated BiFPN (Bifocal Feature Pyramid Network) into the model. The integration of BiFPN enhanced the fusion of high-level and low-level features, bringing about an increase in both object detection accuracy and keypoint detection accuracy by 0.9%. At the same time, the detection time increased by only 0.02 s. Both the mFPN network in Model 4 and the BiFPN in this paper’s model fuse the output feature maps from bneck4, bneck7, and bneck13 of MobileNetV3, obtaining a 3 × 256 channel feature maps. The experiments demonstrate that on the wheel dataset, BiFPN outperformed FPN in terms of the capability of feature fusion and extraction. With a little increase in model parameters, it can achieve great improvement in detection performance.

Compared to the KeyPoint R-CNN keypoint detection model, our model in this paper reduced the average inference time per image from 1.19 s to 0.54 s, contributing to a 54.6% decrease in detection time. Moreover, it achieved an improvement of 1.8% in object detection accuracy and 0.4% in keypoint detection accuracy. Our model demonstrated higher detection accuracy for the handwheel target and keypoints while having fewer parameters and a faster detection speed.

3.2. Recognition of Handwheel Rotation Angle

After inputting the handwheel image into the keypoint detection network, the model obtains the keypoint information of the handwheel center and spokes. The coordinate system was defined as shown in Figure 13, where the X-axis was positive to the left and the Y-axis was positive downwards. The intersection of the two red lines represents the center of the camera lens. One green line is the connecting line of the key points, and the other green line is the parallel line of the X-axis. The two green lines were used to calculate the rotation angle of the spoke. First, the robot gradually moved the robotic gripper above the wheel based on the co-ordinates of the wheel center, aligning it with the center of the lens. Then, the insertion angle of the gripper was adjusted by the rotation angle of the wheel to avoid interference and collision between the gripper and the wheel.

To offset the coordinates of the output results from the detection network, the handwheel center keypoint was used as the origin for angle calculations. Defining the coordinates of the handwheel center keypoint as

{(x}_{c}, y_{c})

and the coordinates of the spoke keypoint as

{(x}_{s}, y_{s})

, the calculation method for the angle output is as follows:

a n g l e = \{\begin{array}{l} {0 y}_{s} - y_{c} = 0, x_{s} - x_{c} > 0 \\ {180 y}_{s} - y_{c} = 0, x_{s} - x_{c} < 0 \\ \frac{\sqrt{({y_{s} - y_{c})}^{2}}}{y_{s} - y_{c}} a r c c o s \frac{x_{s} - x_{c}}{\sqrt{({x_{s} - x_{c})}^{2} + ({y_{s} - y_{c})}^{2}}} y_{s} - y_{c} \neq 0 \end{array}

(8)

Based on the recognized target and keypoint information of the hand wheel, code was written to test the image. The test results are shown in Figure 14, where the green box represents the detected hand wheel target, the green endpoints represent the detected hand wheel keypoints, and the point on the red circle represents the insertion point of the robot gripper.

4. Conclusions

This paper focuses on the valve handwheel in an automatic unloading system and proposes a detection method for handwheel keypoint detection. The method is based on the KeyPoint R-CNN model, with the MobileNetV3 used as the backbone network for feature extraction, to avoid the problem of slow running speed on small devices caused by the large parameter volume of KeyPoint R-CNN. By integrating the CA module, the method can better capture positional information and global relationships, thus extracting more comprehensive features and improving the model’s ability to recognize handwheels. To enhance the feature extraction capability of the model at keypoint locations, the BiFPN was introduced to fuse the features performed from shallow-level feature maps so that more comprehensive semantic information could be extracted. Thereby, the accuracy of object detection and keypoint detection improved. Compared to the KeyPoint R-CNN model, the proposed MobileNetV3-KP model reduces detection time by 54.6%, resulting in higher efficiency. The mAP values for object detection and keypoint detection were improved by 1.8% and 0.4%, respectively, making the model’s accuracy higher. Next, we will further optimize the network model to enable it to run more efficiently on small devices.

Author Contributions

Methodology, J.L.; Software, L.L. and S.W.; Data curation, L.Z.; Writing—original draft, C.H.; Writing—review & editing, C.H.; Funding acquisition, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Ministry of Science and Technology of the People’s Republic of China grant number 2019YFC1907704.

Data Availability Statement

Limited data available on requirement based on privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Aleksei, K.; Alice, K.; Maximilian, H.; Johannes, S.; Wolfgang, E. Evaluation of Grasps in an automatic Intermodal Container Unloading System. Procedia Comput. Sci. 2021, 192, 2210–2219. [Google Scholar]
Lei, L.; Yuefeng, D.; Xiaoyu, L.; Tiantian, S.; Weiran, Z.; Guorun, L.; Lichao, Y.; Du, C.; Enrong, M. An automatic forage unloading method based on machine vision and material accumulation model. Comput. Electron. Agric. 2023, 208, 107770. [Google Scholar]
Wang, Y.; Yuan, Y.; Lei, Z. Fast SIFT Feature Matching Algorithm Based on Geometric Transformation. IEEE Access 2020, 8, 88133–88140. [Google Scholar] [CrossRef]
Zhao, Y.; Su, J. Local sharpness distribution–based feature points matching algorithm. J. Electron. Imaging 2014, 23, 013011. [Google Scholar] [CrossRef]
Kulshreshtha, M.; Chandra, S.S.; Randhawa, P.; Tsaramirsis, G.; Khadidos, A.; Khadidos, A.O. OATCR: Outdoor Autonomous Trash-Collecting Robot Design Using YOLOv4-Tiny. Electronics 2021, 10, 2292. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2015; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2015; pp. 21–37. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Ziquan, L.; Huifang, W. Automatic Detection of Transformer Components in Inspection Images Based on Improved Faster R-CNN. Energies 2018, 11, 3496. [Google Scholar]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef]
Yanpeng, H.; Yuxuan, L.; Yanguang, X.; Yinghui, W. Human steering angle estimation in video based on key point detection and Kalman filter. Control Theory Technol. 2022, 20, 408–417. [Google Scholar]
Zhang, J.; Chen, Z.; Tao, D. Towards High Performance Human Keypoint Detection. Int. J. Comput. Vis. 2021, 129, 2639–2662. [Google Scholar] [CrossRef]
Alexeev, A.; Kukharev, G.; Matveev, Y.; Matveev, A. A Highly Efficient Neural Network Solution for Automated Detection of Pointer Meters with Different Analog Scales Operating in Different Conditions. Mathematics 2020, 8, 1104. [Google Scholar] [CrossRef]
Hui, L.; Xu, R.; Xie, J.; Qian, J.; Yang, J. Progressive point cloud deconvolution generation network. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 397–413. [Google Scholar]
Zhengyang, X.; Ruiqing, L.; Zhizhong, W.; Songwei, W.; Juncai, Z. Detection of Key Points in Mice at Different Scales via Convolutional Neural Network. Symmetry 2022, 14, 1437. [Google Scholar]
Jiangjin, G.; Tao, Y. Research on Real-Time Face Key Point Detection Algorithm Based on Attention Mechanism. Comput. Intell. Neurosci. 2022, 2022, 6205108. [Google Scholar]
Qingqi, Z.; Xiaoan, B.; Biao, W.; Xiaomei, T.; Yuting, J.; Yuan, L.; Na, Z. Water meter pointer reading recognition method based on target-key point detection. Flow Meas. Instrum. 2021, 81, 102012. [Google Scholar]
Zhang, H.; Qian, J.; Zhang, B.; Yang, J.; Gong, C.; Wei, Y. Low-Rank Matrix Recovery via Modified Schatten-$ p $ Norm Minimization with Convergence Guarantees. IEEE Trans. Image Process. 2019, 29, 3132–3142. [Google Scholar] [CrossRef]
Hassan, S.M.; Maji, A.K.; Jasiński, M.; Leonowicz, Z.; Jasińska, E. Identification of Plant-Leaf Diseases Using CNN and Transfer-Learning Approach. Electronics 2021, 10, 1388. [Google Scholar] [CrossRef]
Yuquan, Z.; Chen, X.; Rongxiang, D.; Qingchen, K.; Daoliang, L.; Chunhong, L. MSIF-MobileNetV3: An improved MobileNetV3 based on multi-scale information fusion for fish feeding behavior analysis. Aquac. Eng. 2023, 102, 102338. [Google Scholar]
Jihong, H.; Yongfeng, Q.; Jiaying, W. Skin Disease Classification Using Mobilenet-RseSK Network. J. Phys. Conf. Ser. 2022, 2405, 012017. [Google Scholar]
Xiaochao, H.; Rong, Y.; Qi, W.; Fei, Y.; Bo, H. A novel method for real-time ATR system of AUV based on Attention-MobileNetV3 network and pixel correction algorithm. Ocean Eng. 2023, 270, 113403. [Google Scholar]
Tianmin, D.; Yongjun, W. Simultaneous vehicle and lane detection via MobileNetV3 in car following scene. PLoS ONE 2022, 17, e0264551. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Jie, H.; Li, S.; Samuel, A.; Gang, S.; Enhua, W. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 7132–7141. [Google Scholar]
Jiang, X.; Gao, T.; Zhu, Z.; Zhao, Y. Real-time face mask detection method based on YOLOv3. Electronics 2021, 10, 837. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Guangbo, L.; Guolong, S.; Jun, J. YOLOv5-KCB: A New Method for Individual Pig Detection Using Optimized K-Means, CA Attention Mechanism and a Bi-Directional Feature Pyramid Network. Sensors 2023, 23, 5242. [Google Scholar]
Zhao, Y.; Ju, Z.; Sun, T.; Dong, F.; Li, J.; Yang, R.; Fu, Q.; Lian, C.; Shan, P. TGC-YOLOv5: An Enhanced YOLOv5 Drone Detection Model Based on Transformer, GAM & CA Attention Mechanism. Drones 2023, 7, 446. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2016; pp. 2117–2212. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Gong, C.; Zhang, H.; Yang, J.; Tao, D. Learning with inadequate and incorrect supervision. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; pp. 889–894. [Google Scholar]

Figure 1. Chemical raw material unloading system.

Figure 2. Region proposal network. Generate candidate boxes for handwheel images. The different color is use to distinguish the candidate boxes with various size and ratio scale.

Figure 3. Comparison between RoI Pooling and RoI Align. Rectangular boxes with different colors are to show respective candidate regions.

Figure 4. The structure of the Keypoint R-CNN model.

Figure 5. Comparison between Squeeze-and-Excitation module and Coordinate Attention module.

Figure 6. Comparison between Feature Pyramid Network and Bi-Directional Feature Pyramid Network. The black arrows between different feature maps represent convolutional operations, the blue arrows represent upsample operations, and the red arrows represent pooling operations.

Figure 7. The backbone network of MobileNetV3. The SE attention module is replaced with the CA attention module, and the FPN network is replaced with the BiFPN network. The bneck indicated by the green arrow was added a CA attention module. The red arrow points to the feature map extracted from a specific bneck, which is used for feature fusion.

Figure 8. Dataset sample images.

Figure 9. The accuracy of object detection.

Figure 10. The accuracy of keypoint detection.

Figure 11. Inference time per single image.

Figure 12. Heatmap comparison after adding the Coordinate Attention module.

Figure 13. Coordinate definitions. The intersection of the solid red lines represents the camera perspective, and also the center of the robot’s gripper. The green dot represents a key point on the handwheel. The solid green line is used to calculate the angle of rotation for the robot’s gripper. The dashed red line represents the offset path of the robot’s gripper.

Figure 14. Recognition results. The detected image of the handwheel was on in the green rectangle box. The green dot indicates the detected key points of the handwheel and the green line connects these key points. The red dot represents the calculated insertion point for the robot’s gripper.

Table 1. Comparison of experimental results of various models.

Experimental Model	Object	Keypoint	Average
Experimental Model	Accuracy (%)	Accuracy (%)	Time (s)
Resnet50	91.5%	98.3%	1.19
MobileNetV3	88.9%	95.3%	0.41
MobileNetV3 + CA	91.3%	95.9%	0.45
MobileNetV3 + CA + mFPN	92.4%	97.8%	0.52
MobileNetV3 + CA + BiFPN	93.3%	98.7%	0.54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, C.; Lei, Z.; Li, L.; Zhong, L.; Lei, J.; Wang, S. A Method for Detecting Key Points of Transferring Barrel Valve by Integrating Keypoint R-CNN and MobileNetV3. Electronics 2023, 12, 4306. https://doi.org/10.3390/electronics12204306

AMA Style

Huang C, Lei Z, Li L, Zhong L, Lei J, Wang S. A Method for Detecting Key Points of Transferring Barrel Valve by Integrating Keypoint R-CNN and MobileNetV3. Electronics. 2023; 12(20):4306. https://doi.org/10.3390/electronics12204306

Chicago/Turabian Style

Huang, Canyu, Zeyong Lei, Linhui Li, Lin Zhong, Jieheng Lei, and Shuiming Wang. 2023. "A Method for Detecting Key Points of Transferring Barrel Valve by Integrating Keypoint R-CNN and MobileNetV3" Electronics 12, no. 20: 4306. https://doi.org/10.3390/electronics12204306

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Method for Detecting Key Points of Transferring Barrel Valve by Integrating Keypoint R-CNN and MobileNetV3

Abstract

1. Introduction

2. Materials and Methods

2.1. Keypoint R-CNN Model

2.2. MobileNetV3 Backbone Network

2.3. Coordinate Attention Module

2.4. Bi-Directional Feature Pyramid Network

2.5. Improved Backbone Network of the Model

2.6. Experimental Platform

2.7. Image Dataset

2.8. Training Scheme

2.9. Experimental Comparison

3. Results

3.1. Experimental Results

3.2. Recognition of Handwheel Rotation Angle

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI