1. Introduction
In the context of intelligent manufacturing, robot arms with multiple joints play increasingly important roles in various industrial fields [
1,
2]. For instance, robot arms are utilized to accomplish automatic drilling, riveting and milling tasks in aerospace manufacturing; in automobile and traditional machinery manufacturing fields, robot arms can be frequently seen in automatic loading/unloading, automatic measurement and other production or assembly tasks.
In most industrial applications, a robot arm works in accordance with the pre-planned program. However, on occasions where the robot arm becomes out of control by mistake, serious collision or injury accidents may occur, especially in the work context of human–machine cooperation. Therefore, it is critical to configure monitoring means to ensure safety. On-site attitude monitoring of working robot arms is also essential for the collaborative work of multiple robot arms.
Machine vision is one of the most suitable and widely used monitoring means due to its relatively low cost, high applicability and good accuracy. To reduce the difficulty of image feature recognition and to improve monitoring accuracy and reliability, a common method in industry is to arrange cooperative visual targets on the monitored object [
3,
4]. However, arranging cooperative targets is usually cumbersome and time-consuming, and the targets in industrial sites tend to suffer from being stained or falling off. Studying methods for accurately estimating the attitude of robot arms without relying on cooperative visual markers presents a significant research challenge [
5,
6].
With their excellent ability to extract image features, deep neural networks have been widely used in the field of computer vision. They can extract natural feature information and deep semantic information from images and realize various computer vision tasks based on the rich extracted information without relying on cooperative visual markers. The base of the robot arm in most industrial scenes is fixed on the ground or a workbench. In this situation, the motion attitude of the robot arm is completely determined by the rotation angle of each joint. Therefore, monitoring the robot arm attitude is essential to determine the rotation angles of the arm joints. One approach to this is constructing an end-to-end neural network model to directly predict the attitude parameters of the robot arm through the input image of the robot arm. However, the end-to-end method requires more computing resources. In addition, it is not easy to make full use of the kinematic constraints of the robot arm and the imaging constraints from 2D to 3D space. Therefore, the attitude estimation accuracy of the end-to-end method is difficult to ensure. Another possible approach to attitude estimation is composed of two stages. First, the feature points of the robot arm are detected in the image, and then system equations are established to solve the angle of each joint. This strategy can better leverage the advantages of deep learning and 3D machine vision theory.
Keypoint detection is a major application of deep learning methods. Toshev et al. [
7] directly located the image coordinates of the keypoints on the human body through convolutional neural networks to determine the human pose. Instead of outputting determined positions for the detected keypoints, the subsequent keypoint detection networks commonly output the positions in the form of heatmaps [
8,
9,
10,
11]. Newell et al. [
8] proposed SHNet (Stacked Hourglass Network), which stacked several hourglass modules and detected keypoints based on multi-scale image feature information. Chen et al. [
9] proposed CPNet (Cascaded Pyramid Network), which is cascaded by two convolutional neural network modules. The first module is used to detect all keypoints, and the second module corrects the poor-quality results detected by the first module to improve the final detection accuracy. Peng et al. [
10] proposed PVNet (Pixel-wise Voting Network), which can obtain superior keypoint detection results when the target object is partially blocked. Sun et al. [
11] proposed HRNet (High-Resolution Network), which processes feature maps under multiple-resolution branches in parallel, so that the network can maintain relatively high-resolution representation in forward propagation. These neural networks have found successful applications in tasks such as human body posture estimation, where qualitative understanding instead of quantitative accuracy is the main concern. Due to the huge demand for computing resources for network training, these neural networks have to resort to a downsampling process for front-end feature extraction, which leads to the resolution of the output heatmaps being insufficient for high-accuracy estimation.
Determining how to make the neural network output a higher-resolution heatmap without significantly increasing the consumption of computing resources is a significant problem worth investigating. Super-resolution image recovery based on deep learning has seen great progress in recent years. SRCNNet (Super-Resolution Convolutional Neural Network) [
12], VDSRNet (Very Deep Super-Resolution Network) [
13], EDSRNet (Enhanced Deep Super-Resolution Network) [
14], etc., have been proposed. The early super-resolution reconstruction networks need to upsample a low-resolution input image to the target resolution for subsequent processing before training and prediction; therefore, the computational complexity is high. Shi et al. [
15] proposed ESPCNNet (Efficient Subpixel Convolutional Neural Network) from the perspective of reducing computational complexity. This convolutional neural network deals with low-resolution feature maps in the training process and only adds a subpixel convolution layer to realize upsampling operation in the end, which effectively increases the speed of super-resolution image reconstruction. It has the potential to improve the resolution of the output keypoint heatmaps and in turn improve the keypoint positioning accuracy by introducing the idea of super-resolution image reconstruction into the keypoint detection network.
To monitor the attitude of a robot arm, it is essential to solve the rotation angle of each joint. Based on the depth image of the robot arm, Widmaier et al. [
16] used a random forest regression model to estimate the robot arm attitude. Labbe et al. [
17] and Zuo et al. [
18] estimated the robot arm attitude based on one single grayscale image. However, serious joint occlusion is inevitable in one single-perspective image, which makes it hard to detect some keypoints and may even lead to wrong estimation results. Moreover, the depth ambiguity problem in monocular vision may lead to multiple solutions in attitude estimation, reducing the monitoring reliability of the robot arm.
In this paper, we present a two-stage high-precision attitude estimation method for base-fixed six-joint robot arms based on multi-view images. The contributions include the following: (1) A new super-resolution keypoint detection network (SRKDNet for short) is proposed. The novelty of the SRKDNet lies in that a subpixel convolution module is incorporated in the backbone neural network HRNet [
11] to learn the law of resolution recovery of the downsampled feature maps. This method can alleviate the disadvantages of low-resolution heatmaps and improve the keypoint detection accuracy without significantly increasing the computing resource consumption. (2) A coarse-to-fine detection mechanism based on dual SRKDNets is put forward. A full-view SRKDNet obtains a relatively rough keypoint detection result. Then, a close-up SRKDNet is executed to refine the results with a cropped image of the ROI determined by the results of the full-view SRKDNet. The dual-SRKDNet detection mechanism performs better than one-time detection, and the keypoint detection accuracy is drastically improved. (3) Efficient virtual-and-real sampling and neural network training methods are proposed and verified. The virtual sample data are first used to train the neural network, and then a small number of real data are applied to fine-tune the model. This method achieves accurate keypoint detection for real data without consuming a huge amount of time and manpower. (4) The constraint equations for solving the rotation angles of each joint are established; they depict the relation among the detected keypoints in the multi-view images, the camera imaging model and the kinematic model of the robot arm. A screening strategy based on the keypoint detection confidence is incorporated in the solving process and is proved to be critical for ensuring attitude estimation accuracy. Experiments demonstrate that the whole set of methods proposed in this paper can realize high-accuracy estimation of robotic arm attitude without utilizing cooperative visual markers.
The remaining contents of this paper are arranged as follows: In
Section 2, we introduce the whole set of methods, including the approaches to high-precision keypoint detection (
Section 2.1), automatic virtual sample generation (
Section 2.2) and robot arm attitude estimation (
Section 2.3). Experiments on virtual and real robot arms are reported in
Section 3. We conclude the paper in
Section 4.
4. Conclusions
We have proposed a set of methods for accurately estimating the robot arm attitude based on multi-view images. By incorporating a subpixel convolution layer into the backbone neural network, we put forward the SRKDNet to output high-resolution heatmaps without significantly increasing the computational resource consumption. A virtual sample generation platform and a keypoint detection mechanism based on dual SRKDNets were proposed to improve the keypoint detection accuracy. The keypoint prediction accuracy for the real robot arm is up to 96.07% for
[email protected] (i.e., the position deviation between the predicted and the real keypoints is within 6 pixels). An equation system, involving the camera imaging model, the robot arm kinematic model, and the keypoints detected with confidence values, was established and solved to finally obtain the rotation angles of the joints. The confidence-based keypoint screening scheme makes full use of the information redundancy of the multi-view images and is proven to be effective in ensuring attitude estimation. Plenty of experiments on virtual and real robot arm samples were conducted, and the results show that the proposed method can significantly improve the robot arm attitude estimation accuracy. The average estimation error of the joint angles of the real six-joint UR10 robot arm under three views is as low as 0.53 degrees, which is much higher than that of the comparison methods. The entire proposed method is more suitable for industrial applications with high precision requirements for robot arm attitude estimation.
In the real triple-view monitoring scenario, a total of 0.37 s was required for the keypoint detection stage and the attitude-solving stage. The keypoint detection accounted for the most time. The reason lies in that our method needs to detect keypoints in multi-view images with dual SRKDNets. Therefore, the efficiency of the proposed method is lower than that of the single-view-based method.
In this study, we only conducted experiments on one U10 robot arm. In the future, we will try to extend our method to real industrial scenes with more types of robot arms.