Skip to Content
Applied SciencesApplied Sciences
  • Article
  • Open Access

12 June 2024

FAGD-Net: Feature-Augmented Grasp Detection Network Based on Efficient Multi-Scale Attention and Fusion Mechanisms

,
,
,
,
and
1
School of Electrical Engineering and Automation, Xiamen University of Technology, Xiamen 361024, China
2
Xiamen Key Laboratory of Frontier Electric Power Equipment and Intelligent Control, Xiamen 361024, China
3
School of Computer Science and Electronic Engineering, University of Essex, Colchester CO4 3SQ, UK
4
School of Engineering Mathematics and Technology, Faculty of Engineering, University of Bristol, Beacon House, Queens Rd, Bristol BS8 1QU, UK

Abstract

Grasping robots always confront challenges such as uncertainties in object size, orientation, and type, necessitating effective feature augmentation to improve grasping detection performance. However, many prior studies inadequately emphasize grasp-related features, resulting in suboptimal grasping performance. To address this limitation, this paper proposes a new grasping approach termed the Feature-Augmented Grasp Detection Network (FAGD-Net). The proposed network incorporates two modules designed to enhance spatial information features and multi-scale features. Firstly, we introduce the Residual Efficient Multi-Scale Attention (Res-EMA) module, which effectively adjusts the importance of feature channels while preserving precise spatial information within those channels. Additionally, we present a Feature Fusion Pyramidal Module (FFPM) that serves as an intermediary between the encoder and decoder, effectively addressing potential oversights or losses of grasp-related features as the encoder network deepens. As a result, FAGD-Net achieved advanced levels of grasping accuracy, with 98.9% and 96.5% on the Cornell and Jacquard datasets, respectively. The grasp detection model was deployed on a physical robot for real-world grasping experiments, where we conducted a series of trials in diverse scenarios. In these experiments, we randomly selected various unknown household items and adversarial objects. Remarkably, we achieved high success rates, with a 95.0% success rate for single-object household items, 93.3% for multi-object scenarios, and 91.0% for cluttered scenes.

1. Introduction

With the improvement in robot autonomous manipulation, grasping has been extensively studied in recent years [1,2]. Artificial Intelligence (AI)-based robot grasping methods possess significant application prospects, such as in industrial workpiece sorting [3]. While traditional robot grasping approaches applied in industry with 2D or 3D vision require knowledge of the candidate object model (size, orientation and type), deep learning grasping methods do not rely on such information because of their strong learning and reasoning capabilities. However, the precise grasping of a random object still remains challenging because of uncertainty in target size, orientation, and type. Thus, there are issues that need further resolution in order to model the relationship between visual image and reliable grasping pose, which is required to extract grasp-related features effectively and generalize the learned knowledge to new objects.
Data-driven approaches such as deep learning [4,5,6,7,8] for robot grasping are becoming mainstream and showing great potential in unstructured scenarios. In the work [9], a pioneering representation method for grasping and detecting rectangular boxes was introduced. Subsequently, numerous deep learning methods were developed. Among of them, detection-based approaches demonstrate superior performance. Detection-based approaches generate pixel-level heatmaps of grasp detection from an N-channel input feature, where each pixel indicates a distinct level of grasp confidence. Morrison et al. [4] proposed a Generative Grasping CNN structure to model the relationship between depth images and grasping positions. This approach was also adopted by Kumra et al. [6], who improved model performance by continuously stacking residual blocks. However, the proposed model lacks sufficient emphasis on extracting grasp-related features, resulting in suboptimal grasping performance.
In recent years, several studies have incorporated attention mechanisms into grasp detection networks, and their experiments have confirmed that attention mechanisms can enhance feature extraction efficiency and improve network performance. S. Wang et al. [10] integrated the self-attention mechanism of Transformer into a grasp model to learn global features. However, this approach may lead to the loss of some local features. The works [11,12] employed Squeeze-and-Excitation attention [13] in grasp detection networks to reweight the importance of channels. However, this method overlooks spatially dependent features, thereby limiting the performance of grasp models. Zhou et al. [14] incorporated Coordinate Attention [15] into a grasp model for capturing position-related features; however, their method ignores mutual features across the entire spatial position, and the convolution is not conducive to cross-channel interactions. Furthermore, many grasp detection networks rely on encoders and decoders. While the encoding process yields advanced grasp semantic features, the gradual reduction in resolution during encoding results in the loss of spatial features. It is noteworthy that the spatial features of objects are closely related to grasp poses, and multi-scale features can help the learning system gain a comprehensive understanding of the object’s appearance and structure, which significantly impacts robot grasping quality.
To overcome these problems, we developed the Feature-Augmented Grasp Detection Network (FAGD-Net) based on encoder–decoder architecture. Within this network, we introduced two modules to enhance its focus on grasp-related features. Firstly, we proposed a Residual Efficient Multi-Scale Attention (Res-EMA) module, aimed at effectively regulating the significance of feature channels while preserving precise spatial information. Additionally, we introduced a Feature Fusion Pyramidal Module (FFPM) as an intermediary between the encoder and decoder, effectively addressing potential oversights or losses of grasp-related features as the encoder network deepens. Summarizing, the main contributions of this paper are as follows:
(1)
A new module termed Residual Efficient Multi-Scale Attention (Res-EMA) was proposed to adjust the importance of the feature channels and ensure the preservation of accurate grasping space information within the channels.
(2)
A Feature Fusion Pyramid Module (FFPM) was constructed between the encoder and decoder. This module aims to enhance the grasp-related features that may otherwise be overlooked or lost during the encoding process.
(3)
A Feature-Augmented Grasp Detection Network (FAGD-Net) was developed based on the Res-EMA and FFPM network frameworks, which is used for the real-time prediction of optimal grasp configuration. Experimental evaluations on the Cornell dataset and Jacquard dataset demonstrate exceptional performance, where the proposed method achieved high accuracy rates of 98.9% and 96.5%, respectively.
(4)
We integrated the network into an actual robot grasping system and conducted real-world grasping tests to validating the performance of our approach. It achieves advanced performance in both public datasets and real-world grasping tasks.

3. Methods

3.1. Grasp Configuration

To facilitate the representation of robot grasp detection, we abstract the grasp into the determination of a grasp point and a line segment, which offers a more concise and efficient representation compared with the use of grasp rectangle [4,6]. Specifically, in the image space, the grasp configuration is detected from a depth image D = h × w , which can be described as:
G i = x i , y i , θ i , W i , Q
where ( x i , y i ) is the grasp center, θ i is the rotation angle in image coordinates, with a range of [ π 2 , π 2 ] , and W i is the required grasp width in the image, within a range of [ 0 , W max ] . Q is the grasp confidence corresponding to each pixel, with values between 0 and 1. The goal of robot grasping is to infer the grasp configuration G i with the maximum confidence from the detected grasp configuration, which can be described as:
G i = arg max Q G i
In the robot’s 3D workspace, a grasp pose can be described as:
G r = P r , θ r , W r , Q
where P r = ( x , y , z ) represents the center position of the gripper tip, z is derived from the depth value of predicted grasp point by coordinate transformation [30], θ r denotes the gripper orientation around the Z-axis, W r denotes the closing width of the robot gripper, and Q is the grasp quality score used to predict the probability of successful grasping.
To effectively perform robot grasping tasks in real-world scenarios, it is essential to translate the grasp configurations from the image space to the robot workspace. We utilize the hand–eye calibration method [30] to compute the transformation matrix from the camera coordinate system to the robot base coordinate system. This transformation is shown in Figure 1 and described in Equation (4) [6].
G r = T c r T i c G i
where T i c refers to converting the grasp configuration from the image coordinate system to the camera’s 3D coordinate system, while T c r refers to converting the camera’s 3D coordinate system to the robot’s world coordinate system.
Figure 1. The transformation of grasp configurations from the image space to the robot workspace.
In the experiment, within the robot’s workspace, the robot approached the object based on the predicted grasp position P r = ( x , y , z ) , adjusted the gripper orientation according to the predicted angle θ r , and closed the gripper to grasp the object based on the predicted width W r .

3.2. The Proposed FAGD-Net for Grasping Detection

The architecture of the proposed grasping network, FAGD-Net, is depicted in Figure 2, the primary goal in designing of the FAGD-Net framework is to equip the grasping system with robust feature extraction capabilities to ensure high accuracy in grasp detection while keeping its architecture lightweight. FAGD-Net employs an encoder–decoder architecture to enhance contextual understanding from input data.
Figure 2. The structure of the Feature-Augmented Grasp Detection Network (FAGD-Net).
Initially, a depth image undergoes feature extraction through a backbone network, which comprises only four down-sampling modules to ensure model lightweightness. Each down-sampling module consists of convolutional layers, batch normalization, and ReLU activation functions. The first down-sampling module utilizes a 9 × 9 convolutional kernel, while subsequent modules use a 4 × 4 kernel. These larger convolutional kernels offer broad receptive fields and rich feature representations. The size of the output feature maps is halved after each down-sampling module. Subsequently, features are further processed through the Residual Efficient Multi-Scale Attention (Res-EMA) module, which adjusts the importance of feature channels to highlight effective features while preserving precise spatial information within channels. In the Res-EMA module shown in Figure 2, “g” denotes the number of groups along the feature channel dimension after passing through conv block 2, “X avg pool” represents global pooling along the horizontal direction, and “Y Avg Pool” indicates global pooling along the vertical direction. Further details are provided in the subsequent sections.
Following the encoder stage, we introduce the Feature Fusion Pyramidal Module (FFPM) between the encoder and decoder to enhance information utilization without compromising the receptive field. This effectively addresses potential oversights in grasp features during encoder down-sampling.
Finally, the up-sampling module is utilized to expand and restore image dimensions. It consists of transposed convolutional layers, batch normalization, and ReLU activation functions, with the transposed convolutional kernel size set to 4 × 4. Ultimately, the final grasp configuration is obtained through transposed convolution, completing the network’s comprehensive grasp detection process, as described by Equations (5) and (6).
F 1 = f b a c k b o n e I d e p t h F 2 = f Re s - E M A F 1 F 3 = f F F P M F 2 F 4 = f U p - s a m p l i n g 1 F 3 F 5 = f U p - s a m p l i n g 2 F 4
where F 1 , F 2 , F 3 , F 4 , and F 5 represent the feature maps after passing through the backbone ( f b a c k b o n e ), Res-EMA module ( f Re s - E M A ), Feature Fusion Pyramidal Module ( f F F P M ), the first up-sampling block ( f U p - s a m p l i n g 1 ), and the second up-sampling block ( f U p - s a m p l i n g 2 ), respectively.
q u a l i t y M a p = f t r a n s _ c o n v F 5 A n g l e M a p sin = f t r a n s _ c o n v F 5 A n g l e M a p cos = f t r a n s _ c o n v F 5 W i d t h M a p = f t r a n s _ c o n v F 5
where f t r a n s _ c o n v represents the transposed convolution and F 5 denotes the feature map output by the up-sampling block. Ultimately, the network generates feature maps of grasp quality, grasp angle, and grasp width.
It is worth noting that we do not directly output the final grasping angle θ . Instead, we separately output sin ( 2 θ ) and cos ( 2 θ ) to eliminate discontinuities near ± π 2 and maintain a unique mapping between θ and the interval π 2 , π 2 . Finally, the grasping angle is generated using Equation (7).
θ = arctan sin ( 2 θ ) cos ( 2 θ ) / 2 .

4. The Components of FAGD-Net

4.1. Res-EMA Module

Existing works claim that attention mechanisms play a crucial role in improving the expressive capability of grasping networks, as evidenced by studies such as [10,11,12,14]. However, there is still potential for optimization in these attention mechanisms. To bridge this gap, we introduce advanced Multi-Scale Attention (EMA) [31] into the realm of grasping, thereby introducing a novel Res-EMA module.
Figure 3 illustrates the structure of Res-EMA. The first part comprises the residual block structure [32], and the second part is EMA. Through skip connections, the features outputted by the backbone are aggregated with those from the attention module, aiding the network in avoiding the issues of gradient vanishing and exploding. Within the residual block structure, there are two convolution modules. Convblock1 includes a 3 × 3 convolution, batch normalization, and Rectified Linear Unit (ReLU) activation. The second convolution module repeats the operations of the first convolution module, followed by a 3 × 3 convolution and batch normalization.
Figure 3. The Res-EMA module.
The second part consists of Efficient Multi-Scale Attention (EMA). Firstly, the features after convblock2, denoted as F C × H × W , are divided into G sub-features in the channel dimension. This grouping aims to learn distinct grasp semantic information, thereby enhancing the model’s expressive capacity. The grouping process can be described by Equation (8).
F = F 0 , F i , , F G 1 , F i C / / G × H × W
Subsequently, similar to the operations in Coordinate Attention [15], given the input features, 1D global average-pooling is applied along the horizontal and vertical directions separately to encode spatial position information. Through this operation, positional information is encoded into the feature map to enhance sensitivity to spatial positions, highlighting feasible grasping regions. The 1D global average-pooling along the horizontal direction, denoted for the c-th channel over the spatial dimension H , can be described by Equation (9).
z c H ( H ) = 1 W 0 i W f c ( H , i )
Similarly, the 1D global average pooling along the vertical axis W for the c-th channel can be described by Equation (10):
z c W ( W ) = 1 H 0 j H f c ( j , W )
where c represents the number of input channels and H and W denote the spatial dimensions of the input features. The aforementioned operations are consistent with Coordinate Attention [15]. The shared components with Coordinate Attention are named the 1 × 1 branch, while the branch with the 3 × 3 kernel is named the 3 × 3 branch.
After feature encoding along the two image height directions, a shared 1 × 1 convolution is employed. Following the decomposition of the output of the 1 × 1 convolution into two vectors, the sigmoid function is utilized to fit a 2D binomial distribution on the linear convolution. Feature interaction across channels is achieved by using multiplication to aggregate two channel-wise attention maps within each group, facilitating comprehensive feature representation. The 3 × 3 branch employs a 3 × 3 convolutional operation to capture inter-channel feature interactions, thereby expanding the feature space.
The output of the 1 × 1 branch is aggregated with the output of the 3 × 3 branch across different spatial dimensions, enabling a more comprehensive aggregation of grasp-related features. Following the output of the 1 × 1 branch, 2D global average pooling is applied to encode global spatial features and model long-range dependencies. The 2D global average pooling method can be described as Equation (11)
z c = 1 H × W j H i W f c ( i , j )
After 2D global average pooling, a Softmax function is applied to fit a linear transformation. Through matrix multiplication, a spatial attention map is obtained, collecting spatial information features at different scales. Similarly, in the 3 × 3 branch, 2D global average pooling is used to encode global spatial information. Another Softmax and matrix multiplication produce a second attention map, preserving precise spatial position information. Finally, the two sets of weights are aggregated, and a sigmoid function is employed to highlight the global context for all pixels.

4.2. Feature Fusion Pyramidal Module

The encoder progressively deepens the network, extracting higher-level semantic information. However, this intensification accompanies a reduction in feature resolution, potentially neglecting and gradually losing certain spatial features. Consequently, crucial grasp-related information may be compromised during the up-sampling stage, limiting the overall quality of grasp detection. To tackle this challenge, Liu et al. [29] effectively reduced feature information loss by employing spatial pyramid pooling. Nevertheless, a large dilation rate (rate = 6, 12, 18) in this context may result in suboptimal information utilization. To overcome these concerns, we introduce the Feature-Augmented Pyramid Module (FAPM), strategically designed with a small dilation rate. The FAPM acts as a crucial bridge between the encoder and decoder, effectively mitigating the aforementioned issues and contributing to improved grasp detection accuracy.
As illustrated in Figure 4, the Feature Fusion Pyramid Module (FFPM) incorporates a multi-scale feature extraction process. Initially, the features are input into four parallel feature pyramids, each utilizing atrous convolution with different dilation rates of 1, 3, 3, and 6, respectively. This unique structure enhances information utilization while maintaining a broad receptive field. To enable comprehensive feature interaction, the extracted features from the pyramids are concatenated and added together. This process ensures a holistic representation of features across different scales. Subsequently, feature extraction is finalized through a convolution module coupled with a skip connection. This convolution module comprises a 3 × 3 convolution operation, batch processing, and Rectified Linear Unit (ReLU) nonlinear activation. The intricate design of the FFPM promotes effective multi-scale feature extraction, fostering enhanced information integration within the grasping network.
Figure 4. The feature Fusion Pyramidal Module.

4.3. Loss Function

In this paper, we treat the grasp detection problem as a regression task, framing it as the optimization of the minimum error between the predicted grasp and its corresponding label. To achieves this, we employ the L1 loss function, which has the inherent advantage of diminished sensitivity to outliers, encompassing instances that are notably large or small. This characteristic renders the L1 loss function robust and stable for our training model. The mathematical representation of the L1 loss function is denoted as Equation (12).
L 1   loss = 0.5 x 2 ,   if   x < 1 x 0.5   otherwise  
In each prediction task, the loss for grasping position prediction can be expressed as Equation (13).
L p = 1 N i N L 1   loss p i p i
where p i and p i represent the ground truth and predicted values of the grasp quality score, respectively.
The loss for prediction of the grasping angle can be expressed as Equations (14) and (15).
L cos 2 θ = 1 N i N L 1   loss cos 2 θ i cos 2 θ i
L sin 2 θ = 1 N i N L 1   loss sin 2 θ i sin 2 θ i
where cos 2 θ and cos 2 θ i represent the ground truth and predicted values of the grasp angle score, respectively. Similarly, sin 2 θ i and sin 2 θ i are also the ground truth and predicted values of the grasp angle score.
While the loss for the predicted grasp width can be expressed as Equation (16).
L w = 1 N i N L 1   loss w i w i
where w i and w i represent the ground truth and predicted values of the grasp width score, respectively.
Thus, the total loss for grasping detection can be expressed as the function in Equation (17):
L t o t a l = L p + L w + L cos 2 θ + L sin 2 θ

5. Experimental Setup and Materials

5.1. Experiment Platform

The robotic grasping platform is depicted in Figure 5. The system comprises a UR5 robotic arm, a Realsense D435i camera, and a parallel gripper (Robotiq-2F85). The RealSense D435i is mounted at the end of the robotic arm, and the camera is connected to the computer. Additionally, the UR5 robotic arm is connected to the controller along with the parallel gripper (Robotiq-2F85), and they are connected to the PC via an Ethernet cable. Communication between the PC and the controller is achieved through sockets. We use the hand–eye calibration method [30] to obtain the transformation matrix and the camera’s intrinsic parameters; the process is achievable within the Robot Operating System (ROS). PyTorch serves as the deep learning framework for both model training and prediction, which runs on an NVIDIA RTX3090 GPU (Nvidia, Santa Clara, CA, USA) with Ubuntu 18.04. For optimization, we employed the popular Adam optimizer with a learning rate set at 0.001. Additionally, we adopted a data split ratio of 9:1 for training and testing purposes.
Figure 5. Robotic platform and grasping objects.
We randomly selected a variety of unknown objects for grasping tests, which included household items and adversarial objects. The household items encompassed 16 different categories, such as mice, adhesive tapes, charging cables, screwdrivers, fruit models, medications, toiletries, beverages, etc. These items exhibited diverse colors, shapes, and sizes. To comprehensively validate the robot’s grasping generalization capability, we randomly selected adversarial objects characterized by abstract geometric properties and uncertain surface features. The test objects are shown in Figure 5. In real-world grasping experiments, objects are randomly placed on a tabletop to create experimental conditions of single-object, multi-object, and cluttered scenes. The main challenge lies in the uncertainty in object shapes, sizes, placements, and quantities. Therefore, the model trained on the Jacquard dataset was used to infer grasp configurations in these scenes.
Specifically, the robot moved to an initial position approximately 0.5 m away from the tabletop. The camera captured a depth image, which was then input into FAGD-Net. The model generated a best grasp configuration, which included grasp position, angle, and width. Then, the grasp configuration with the highest confidence was transformed into the robot coordinate system. Finally, the robot adjusted the gripper pose based on the predicted grasp position and angle, approached the target, and finally closed the gripper to grasp the object based on the predicted width. A successful grasp was recorded when the object was grasped and lifted at least 10 cm.

5.2. Dataset and Processing

The Jacquard Grasping Dataset [33] is a comprehensive resource in the field of robotic grasping, offering rich grasp-related information. It comprises 54,000 RGB-D images with a resolution of 1024 × 1024 pixels. The dataset includes 1,181,330 unique grasp annotations for training purposes and requires no pre-processing.
The Cornell Grasping Dataset [34] is a renowned dataset in robotic grasping. It comprises 1035 RGB-D images capturing 240 distinct real-world objects, with images at a resolution of 640 × 480 pixels. Among these, there are 5110 positive grasps and 2909 negative grasps annotated within the dataset.
However, the Cornell Grasping Dataset is relatively small in scale. To address this limitation, we pre-processed the data by augmenting it with random crops, zooms, and rotations.

5.3. Evaluation Index

The evaluation metrics utilized in this study are widely recognized and have been extensively applied to both the Cornell and Jacquard datasets. Our evaluation criteria consist of two conditions as expressed in Equation (18). Firstly, the Intersection over Union (IoU) between the labeled grasp and the predicted grasp must be less than 25%. Additionally, the angle deviation between the grasping direction of the predicted grasp and the actual label should not exceed 30°.
A A p , A L < 30 ° G p G L G p G L > 0.25
where A p is the grasping prediction angle, A L is the grasping label angle, G p is the grasping prediction, G L is the grasping label, G p G L represents the intersection of the grasping prediction and the grasping label, and G p G L represents the union of the grasping prediction and the grasping label.

6. Results and Analysis

6.1. Grasp Detection in the Jacquard Dataset

In this task, the Jacquard was used, with 90% of the images used for training and the remaining 10% for grasping testing. We resized the images to 240 × 240, and during network training, the batch size was set to 8.
In Figure 6, we visualize the detection results on the Jacquard dataset, which include representations of robot grasping points and grasping line segments, along with a heatmap indicating the confidence of grasping for each pixel and the corresponding grasping angles and widths. In experimental testing, our model achieves an accuracy of 96.5% on the Jacquard dataset. It is obvious that the grasp configuration predicted by the model is easily grasped by the robot. Additionally, our model is lightweight with approximately 260 K parameters compared with the 1.8 million parameters of GR-CNN [6]. This is because GR-CNN repetitively stacks residual blocks in the middle of the network to enhance the model’s ability to extract grasp-related features, but this approach comes at the expense of efficiency and parameter count.
Figure 6. Grasping detection results in the Jacquard dataset.
As recorded in Table 1, our method outperforms works [4,6,33] by 12.5%, 1.9%, and 22.3%, respectively. These methods have relatively simple network architectures and do not emphasize the extraction of grasp-related features sufficiently. For instance, GG-CNN [4] only uses ordinary convolutions for down-sampling features and then up-samples them to output grasp configurations. Furthermore, our model outperforms the latest QQGNN [22] by 1.5%. Although QQGNN employs depth-wise separable convolution to make the grasping model lightweight, this approach may limit grasp-related feature learning as it only allows each channel to interact with its corresponding convolution kernel during the depth-wise convolution stage while neglecting inter-channel correlations.
Table 1. Comparison of results in the Jacquard dataset.
Some other methods adopt attention mechanisms to efficiently extract relevant grasp features. Our model outperforms the methods in [10,11,12,14], by 1.9%, 0.8%, 1.7%, and 0.3%, respectively. The work [10] uses the self-attention mechanism of transformers to obtain global features. However, this approach may inadvertently discard certain local features, thus resulting in suboptimal performance. On the other hand, the works [11,12] employ the Squeeze-and-Excitation (SE) attention mechanism. While it is effective in isolating important channel features, it overlooks spatial information within the network and lacks the ability to handle multi-scale features. Although AAGDN [14] focuses on spatial position information, it neglects the importance of interactions among spatial positions as a whole. Moreover, the limited receptive field of 1 × 1 convolution kernels hinders the modeling and utilization of local cross-channel interactions and contextual information. In contrast, our network addresses these issues through the Res-EMA module, which effectively adjusts the importance of feature channels while preserving precise spatial information within the channels. As a result, our grasp detection network achieves an accuracy of 96.5% on the Jacquard Grasping Dataset.

6.2. Grasp Detection in the Cornell Dataset

To mitigate potential overfitting during model training in Cornell dataset, we integrated Dropout regularization and employed the Adam optimizer with a learning rate set at 0.001 while maintaining a batch size of 8.
In Figure 7, we visualize the grasping detection results, where the first row displays the detected grasp points and grasp lines. It is evident that each grasp configuration is allocated within the graspable region. While the subsequent rows show heatmaps of grasp configurations, the heatmaps indicate the grasp configuration for each pixel in the image, with higher confidence grasp positions appearing closer to red. As a result, our model achieves a grasping accuracy of 98.9%. In addition, we conducted inference speed tests in the Cornell dataset, which only required 23 milliseconds of inference time per image, thus ensuring real-time for robot applications.
Figure 7. Grasping detection results in the Cornell dataset.
In Table 2, we present a comparison of the test results among other algorithms in Cornell dataset. Compared with the well-known GG-CNN [4], our method exhibits a performance improvement of 25.9%. Our method also outperforms GR-CNN [6] by 1.2%. The main reason for the improvement lies in the other algorithms’ relatively simple network architectures and limited capability in feature extraction. Finally, our algorithm surpasses the state-of-the-art algorithms [10,11] by 0.9% and 0.7%, respectively. It is evident that our model is also advanced.
Table 2. Comparison of results in the Cornell dataset.

6.3. Ablation Experiment

6.3.1. FFPM and Res-EMA Ablation Tests

To assess the impact of each module on network performance, we conducted ablation studies using the Jacquard dataset, in which four sets of different tests were carried out.
In test 1, our network did not incorporate any additional modules. In test 2, we added the Feature Fusion Pyramid Module (FFPM) without including the Residual Efficient Multi-Scale Attention (Res-EMA) module. In test 3, we included the Res-EMA module without the FFPM. Finally, in test 4, both the Res-EMA module and FFPM were integrated into the network.
Table 3 presents the results of the model ablation experiment. Comparing test 2 with test 1, the addition of the FFPM resulted in a 1.1% increase in grasping detection, demonstrating its effectiveness in enhancing performance. Similarly, comparing test 3 with test 1, the inclusion of the Res-EMA module led to a 1.2% improvement. In test 4, where both modules were utilized simultaneously, a 2.1% increase in performance was observed compared with test 1, suggesting that the combined use of both modules yielded the best grasping detection performance.
Table 3. The results of the ablation study.
Figure 8 illustrates the samples of grasping detection results with or without the Res-EMA module. It can be observed that the addition of the Res-EMA module leads to the more accurate detection of grasp positions.
Figure 8. The Res-EMA module ablation study. (a) The results with the Res-EMA module. (b) The results without the Res-EMA module.

6.3.2. Comparison among Various Attention Mechanisms

To further validate the advantages of our proposed attention mechanism Res-EMA, we incorporated attention mechanisms such as Residual Squeeze-and-Excitation (Res-SE) and Residual Coordinate Attention (Res-CoA) [11,14] into our model in a similar manner. Additionally, we include a baseline module without any attention mechanisms, denoted as the residual block. The positions of the Res-CoA and Res-SE modules within the network are depicted in Figure 9.
Figure 9. Different attention mechanisms.
The training was conducted for 50 epochs using Jacquard dataset in a consistent environment. After each epoch, we evaluated the grasping performance on approximately 5499 previously unseen images. To ensure fairness, all models were trained using identical network architectures. The experimental results are presented in Figure 10. Comparing the performance of the residual block, Res-CoA, Res-SE, and our Res-EMA modules, it is evident that only our Res-EMA module achieves a peak accuracy of 96.5%, signifying its substantial advantage. The Res-EMA module demonstrates its superior capability in focusing on grasp-related features. This module significantly enhances the detector’s ability to accurately identify the optimal grasping region.
Figure 10. Experimental results of various attention mechanisms.
In Figure 11, the first column displays the grasping detection results with our Res-EMA module, the second column shows the results with Res-CoA module, the third column presents the results with the Res-SE module, and the last column illustrates the detection results of the baseline module with residual blocks without any attention mechanism.
Figure 11. Grasping detection results with different attention mechanisms. The first to fourth columns depict grasping detection using our Res-EMA module; Res-CoA module; Res-SE module, and residual blocks, respectively.
One can observe that the grasping produced by adding the Res-EMA module is more accurate and facilitates easier object grasping by the robot. While the other attention modules also contribute to effective grasping, their precision is comparatively lower. Among them, the model with only residual blocks performs the worst. In the last column of the fourth row in the Figure 11, representing a small clamp, the detected grasp configuration is invalid. The reason is that this model fails to extract effective grasp-related features, while the model with the Res-EMA module efficiently extracts spatial location features to ensure precise grasp positions. Moreover, extracting features at multiple scales benefits the robot in grasping objects of different sizes.

6.3.3. Comparison among Different Feature Fusion Methods

We conducted experimental evaluations on the Feature Fusion Pyramidal Module (FFPM). As the encoder progressively reduces the resolution during the encoding process, certain grasp-related features may be neglected and gradually lost. This can lead to the failure to recover crucial information during decoder up-sampling, resulting in suboptimal performance. Common approaches to address this issue include feature addition and feature concatenation, where low-level features are integrated into the decoder through skip connections to compensate for relevant features disregarded during the encoding process. While feature addition and concatenation are effective, we argue that compared with the proposed FFPM, our FFPM can more efficiently preserve grasp-related features. The mechanisms of feature addition and concatenation are illustrated in Figure 12.
Figure 12. Different feature fusion methods.
We conducted training for 50 epochs on the Jacquard dataset, evaluating the model on the test set after each epoch. The highest accuracy achieved during evaluation is depicted in Figure 13. We compared three methods, including the FFPM, feature addition, and feature and concatenation, with the FFPM demonstrating superior performance, achieving an accuracy of 0.965. In contrast, feature addition and feature concatenation exhibited mediocre performance, indicating that the FFPM excels in preserving grasp-related features.
Figure 13. Experimental results of different feature fusion methods.
In Figure 14, we visualize the grasp detection results using different feature fusion methods. It is evident from the figure that the addition of the FFPM results in more precise grasp configurations. Specifically, the grasp positions generated by the FFPM closely align with the object’s center of mass, facilitating more effective grasping. The key effectiveness of the FFPM lies in its role as a bridge between the encoder and decoder, effectively preserving grasp-related features and mitigating the loss of relevant grasp features caused by the deepening of the encoding process.
Figure 14. Grasping detection results with different feature fusion methods. The first to third columns depict grasping detection using our FFPM, feature addition, and feature concatenation, respectively.

6.4. Robotic Grasping

6.4.1. Single-Object and Multi-Object Scenes

In the robotic grasping task, we first conducted a single-object grasping test on 16 unknown household objects, varying in shape, size, and color. Each object was randomly positioned in different locations within the robot’s workspace. In total, we performed 160 grasping attempts, out of which 152 were successful, resulting in a grasping success rate of 95.0%. The experimental results demonstrate the effectiveness of the proposed FAGD-Net in grasping unknown household objects. We compared the success rates of our method with those of classical and the state-of-the-art algorithms, as shown in Table 4. The comparison indicates that our proposed method also achieves advanced success rates in single-object grasping tasks.
Table 4. Results on single objects.
Furthermore, we conducted grasping tests on adversarial objects with abstract shapes and irregular surfaces. We attempted 90 grasps, out of which 82 were successful, resulting in a success rate of 91.1%. These tests demonstrate that our grasping detection model performs well even on objects with peculiar shapes. Figure 15 depicts a schematic illustration of grasping in a single-object scenario, showcasing the results of grasping detection for both household objects and adversarial objects, followed accurate grasping by the robot. Additionally, we visualize the grasp configurations as heatmaps to intuitively present the feature maps. The grasp heatmap demonstrates the model’s effective perception of graspable regions in single-object scenes, with varying confidence levels of grasp configurations distributed across each pixel of the single object. Finally, the robot executes the grasp configuration with the highest confidence level. These results show that FAGD-Net is able to effectively grasp previously unseen objects in single-object scenarios, also showcasing its robust grasping performance.
Figure 15. Single-object grasping experiment.
To assess the effectiveness of our grasp detection model in multi-object scenarios, we randomly selected objects to construct a scene with multiple targets, in which each object was randomly placed on a tabletop. During each trial, the robot identified and executed the grasp with the highest confidence level. A total of 120 grasping attempts were conducted, resulting in 112 successful grasps and achieving an average success rate of 93.3%. The robot demonstrated high-performance grasping even in multi-object scenarios, facilitated by the model’s pixel-level learning, which effectively extracts relevant grasp features and enables the model to generalize well even in the presence of multiple objects. The grasp detection results in a multi-object scenario are depicted in Figure 16, where different colors represent varying confidence levels. We visualize the feature maps of the multi-object scenario to observe the distribution of grasp configurations at each pixel. In the grasp heatmap, it can be observed that the model effectively generalizes to multi-object scenarios, perceiving grasp configurations for each pixel. Finally, the robot selects the grasp configuration with the highest confidence for execution.
Figure 16. Grasping in multi-object scenes.

6.4.2. Robot Grasping in Cluttered Scenes

To further validate the effectiveness of our grasp detection algorithm in challenging cluttered environments, we conducted grasping tests by randomly placing 10 to 15 different objects to create a cluttered scenario. The robot executed grasping tasks in these cluttered environments. As shown in Figure 17, in 200 attempts at cluttered grasping, our system achieved an impressive success rate of 91% (182/200). This can be attributed to the model’s pixel-level grasp configuration learning, enabling the extraction of grasp-relevant features and facilitating generalization. Thus, the model can effectively infer grasp configurations in cluttered scenes, and the robot also executes the optimal grasp based on the inference results. The experimental results demonstrate the effectiveness of our grasp detection model even when the targets are cluttered.
Figure 17. Robot grasping in different cluttered scenes.
Furthermore, Table 5 presents a comparative analysis of different algorithms’ performances when handling the objects in cluttered scenes. It is clear that the proposed method outperforms the existing similar methods, which effectively grasps unknown objects in cluttered settings. The main reason for the suboptimal performance of these methods lies in the insufficiency of feature extraction. The network design in [4] is overly simplistic, leading to insufficient expressive capability of the model. The researchers in [8] utilized the Region of Interest (ROI) feature extraction method from Fast R-CNN. However, the ROI extractor resized each candidate box to a fixed size, which could potentially lead to information loss or deformation, thus impacting subsequent feature extraction. In the studies [19,40], a cascading model approach was utilized, where errors may accumulate gradually during the concatenation process, potentially resulting in larger final output errors compared with our approach with single-model output [41]. Although the grasp success rate is close to ours, their feature extraction network utilized six dense blocks, each containing convolutional modules, whereas our backbone extraction network only employed four convolutional modules [42] and two branches in the grasping network, with one branch dedicated to generating bounding boxes. However, if the detector fails to detect objects, the grasp will become ineffective, leading to performance degradation. Our model overcomes the shortcomings of the aforementioned methods.
Table 5. Results in cluttered scenarios.

7. Discussion

We conducted a comprehensive experiment to evaluate the effectiveness and advancement of our grasp detection algorithm, FAGD-Net. Firstly, we achieved accuracies of 96.5% and 98.9% on Jacquard and Cornell grasp dataset, respectively, showcasing the robustness of our approach. By comparative analysis with existing algorithms, we further highlight the advantages of FAGD-Net.
Subsequently, we conducted ablation studies to validate the contributions of the proposed Res-EMA module and FFPM. Additionally, we introduced Res-SE and Res-CoA modules based on Squeeze-and-Excitation (SE) and Coordinate Attention (CoA) mechanisms from previous grasp works [13,14]. Through comparisons with Res-EMA and Res-block, we emphasized the effectiveness of the Res-EMA module. Furthermore, we compared feature concatenation and fusion methods within the model to highlight the advantages of our FFPM.
Finally, real-world robot grasping experiments were conducted, achieving success rates of 95.0% for single-object scenes, 93.3% for multi-object scenes, and 91.0% for cluttered scenes.
In the tests, we conducted grasping experiments using diverse objects that varied in size, orientation, and type. FAGD-Net demonstrated strong generalization capability. In some other environments, our FAGD-Net can also be applicable, such as on tabletops with textured backgrounds and in low-light environments. The main reason is that textured backgrounds and low-light conditions do not severely affect depth information acquisition.
Although FAGD-Net demonstrated satisfactory grasping capabilities, some potential issues were identified. Occasional collisions with other objects occur during the grasping process, leading to the accidental removal of the object from the scene. Additionally, instances of grasping multiple objects simultaneously were observed, as depicted in Figure 18. There are two main types of failed grasping as follows: (1) when the robot approaches the object to be grasped, the gripper is blocked by cluttered objects and does not have enough space for the parallel-jaw gripper. (2) The grasped target slips from the parallel-jaw gripper because the center of gravity is unstable. Further improvements and optimizations are required to address these challenges and enhance the overall performance of the algorithm.
Figure 18. Suboptimal grasping results.
It is worth noting that we consider the grasp configuration that infers the highest confidence from the image. The attributes of the target, such as a certain object’s density and rigitity, are not accounted for in the model. The current focus of this study is primarily on objects grasping and placing. We will further study and improve upon them in the future.
Additionally, in this study, the grasped objects have different shapes, making them more suitable for grasp configurations using parallel grippers. In some scenarios, using vacuum/magnetic grippers may be more efficient, such as for flat surfaces or the objects are easily deformed.

8. Conclusions

This work proposes a high-performance grasp detection model named FAGD-Net for predicting optimal grasp configuration. The model contains two core modules, Res-EMA module and FFPM. The Res-EMA module adjusts the importance of feature channels while preserving accurate spatial information features, and the FFPM effectively addresses the issue of overlooked or lost grasp-related features during the down-sampling process in the encoder. We conducted experiments on public grasp datasets and robot experimental platforms to validate the approach. Our proposed model achieves accuracy rates of 98.9% and 96.5% on the Cornell and Jacquard datasets, respectively. Moreover, we conducted robot grasping experiments in various scenes, achieving success rates of 95.0% for single-object scenes, 93.3% for multi-object scenes, and 91.0% for cluttered scenes. The experimental results demonstrate the effectiveness and advancement of the proposed model.

Author Contributions

Conceptualization, X.Z.; methodology, X.L.; software, X.Z. and X.L.; validation, X.L., Q.L., and X.Z.; formal analysis, T.G. and Y.S.; data curation, X.Z. and X.L.; writing—original draft preparation, X.Z., X.L., and Q.L.; writing—review and editing, X.Z. and X.L.; visualization, X.L. and X.Z.; supervision, H.H.; project administration, X.Z. and Q.L.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 61703356, in part by the Natural Science Foundation of Fujian Province under Grant 2022J011256 and 2020J01285, and in part by the Xiamen Natural Science Foundation (3502Z20227215).

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hu, Y.; Wu, X.; Geng, P.; Li, Z. Evolution strategies learning with variable impedance control for grasping under uncertainty. IEEE Trans. Ind. Electron. 2019, 66, 7788–7799. [Google Scholar] [CrossRef]
  2. Li, G.; Li, N.; Chang, F.; Liu, C. Adaptive Graph Convolutional Network with Adversarial Learning for Skeleton-Based Action Prediction. IEEE Trans. Cogn. Dev. Syst. 2022, 14, 1258–1269. [Google Scholar] [CrossRef]
  3. Solowjow, E.; Ugalde, I.; Shahapurkar, Y.; Aparicio, J.; Mahler, J.; Satish, V.; Goldberg, K.; Claussen, H. Industrial Robot Grasping with Deep Learning using a Programmable Logic Controller (PLC). In Proceedings of the 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), Hong Kong, China, 20–21 August 2020; pp. 97–103. [Google Scholar] [CrossRef]
  4. Morrison, D.; Corke, P.; Leitner, J. Learning robust, real-time, reactive robotic grasping. Int. J. Robot. Res. 2020, 39, 183–201. [Google Scholar] [CrossRef]
  5. Mahler, J.; Liang, J.; Niyaz, S.; Laskey, M.; Doan, R.; Liu, X.; Ojea, J.A.; Goldberg, K. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv 2017, arXiv:1703.09312. [Google Scholar]
  6. Kumra, S.; Joshi, S.; Sahin, F. Antipodal Robotic Grasping using Generative Residual Convolutional Neural Network. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October–24 January 2020. [Google Scholar] [CrossRef]
  7. Teng, Y.; Gao, P. Generative Robotic Grasping Using Depthwise Separable Convolution. Comput. Electr. Eng. 2021, 94, 107318. [Google Scholar] [CrossRef]
  8. Zhang, H.; Lan, X.; Bai, S.; Zhou, X.; Tian, Z.; Zheng, N. ROI-based Robotic Grasp Detection for Object Overlapping Scenes. In Proceedings of the 2019 IEEE International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4768–4775. [Google Scholar] [CrossRef]
  9. Lenz, I.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2015, 34, 705–724. [Google Scholar] [CrossRef]
  10. Wang, S.; Zhou, Z.; Kan, Z. When transformer meets robotic grasping: Exploits context for efficient grasp detection. IEEE Robot. Autom. Lett. 2022, 7, 8170–8177. [Google Scholar] [CrossRef]
  11. Yu, S.; Zhai, D.-H.; Xia, Y.; Wu, H.; Liao, J. SE-ResUNet: A novel robotic grasp detection method. IEEE Robot. Autom. Lett. 2022, 7, 5238–5245. [Google Scholar] [CrossRef]
  12. Cao, H.; Chen, G.; Li, Z.; Lin, J.; Knoll, A. Residual squeeze-and-excitation network with multi-scale spatial pyramid module for fast robotic grasping detection. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13445–13451. [Google Scholar] [CrossRef]
  13. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  14. Zhou, Z.; Zhu, X.; Cao, Q. AAGDN: Attention-Augmented Grasp Detection Network Based on Coordinate Attention and Effective Feature Fusion Method. IEEE Robot. Autom. Lett. 2023, 8, 3462–3469. [Google Scholar] [CrossRef]
  15. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
  16. Chu, F.; Xu, R.; Vela, P.A. Real-world multiobject, multigrasp detection. IEEE Robot. Autom. Lett. 2018, 3, 3355–3362. [Google Scholar] [CrossRef]
  17. Yan, Y.; Tong, L.; Song, K.; Tian, H.; Man, Y.; Yang, W. SISG-Net: Simultaneous Instance Segmentation and Grasp Detection for Robot Grasp in Clutter. Adv. Eng. Inform. 2023, 58, 102189. [Google Scholar] [CrossRef]
  18. Suwoyo, H.; Hidayat, T.; Jia-nan, F. A Transformable Wheel-Legged Mobile Robot. Int. J. Eng. Contin. 2023, 2, 27–39. [Google Scholar]
  19. Liu, D.; Tao, X.; Yuan, L.; Du, Y.; Cong, M. Robotic Objects Detection and Grasping in Clutter based on Cascaded Deep Convolutional Neural Network. IEEE Trans. Instrum. Meas. 2021, 71, 1–10. [Google Scholar] [CrossRef]
  20. Zhang, H.; Zhou, X.; Lan, X.; Li, J.; Tian, Z.; Zheng, N. A real-time robotic grasping approach with oriented anchor box. IEEE Trans. Syst. Man Cybern. Syst. 2021, 51, 3014–3025. [Google Scholar] [CrossRef]
  21. Cao, H.; Chen, G.; Li, Z.; Feng, Q.; Lin, J.; Knoll, A. Efficient grasp detection network with Gaussian-based grasp representation for robotic manipulation. IEEE/ASME Trans. Mechatron. 2022, 28, 1384–1394. [Google Scholar] [CrossRef]
  22. Fu, K.; Dang, X. Light-Weight Convolutional Neural Networks for Generative Robotic Grasping. IEEE Trans. Ind. Inform. 2024, 10, 3353841. [Google Scholar] [CrossRef]
  23. Tian, H.; Song, K.; Li, S.; Ma, S.; Yan, Y. Lightweight Pixel-Wise Generative Robot Grasping Detection Based on RGB-D Dense Fusion. IEEE Trans. Instrum. Meas. 2022, 71, 1–12. [Google Scholar] [CrossRef]
  24. Li, W.; Lambert-Garcia, R.; Getley, A.C.M.; Kim, K.; Bhagavath, S.; Majkut, M.; Rack, A.; Lee, P.D.; Leung, C.L.A. AM-SegNet for additive manufacturing in situ X-ray image segmentation and feature quantification. Virtual Phys. Prototyp. 2024, 19, e2325572. [Google Scholar] [CrossRef]
  25. Ma, H.; Han, G.; Peng, L.; Zhu, L.; Shu, J. Rock thin sections identification based on improved squeeze-and-Excitation Networks model. Comput. Geosci. 2021, 152, 104780. [Google Scholar] [CrossRef]
  26. Qi, J.; Liu, X.; Liu, K.; Xu, F.; Guo, H.; Tian, X.; Li, M.; Bao, Z.; Li, Y. An improved YOLOv5 model based on visual attention mechanism: Application to recognition of tomato virus disease. Comput. Electron. Agric. 2022, 194, 106780. [Google Scholar] [CrossRef]
  27. Shaar, F.; Yılmaz, A.; Topcu, A.E.; Alzoubi, Y.I. Remote Sensing Image Segmentation for Aircraft Recognition Using U-Net as Deep Learning Architecture. Appl. Sci. 2024, 14, 2639. [Google Scholar] [CrossRef]
  28. Fan, Z.; Liu, K.; Hou, J.; Yan, F.; Zang, Q. JAUNet: A U-shape Network with Jump Attention for Semantic Segmentation of Road Scenes. Appl. Sci. 2023, 13, 1493. [Google Scholar] [CrossRef]
  29. Liu, Y.; Bai, X.; Wang, J.; Li, G.; Li, J.; Lv, Z. Image Semantic Segmentation Approach Based on DeepLabV3 Plus Network with an Attention Mechanism. Eng. Appl. Artif. Intell. 2024, 127, 107260. [Google Scholar] [CrossRef]
  30. Tsai, R.Y.; Lenz, R.K. A new technique for fully autonomous and efficient 3D robotics hand/eye calibration. IEEE Trans. Robot. Autom. 1989, 5, 345–358. [Google Scholar] [CrossRef]
  31. Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
  32. He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Computer Vision—ECCV 2016, 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part IV; Springer: Cham, Switzerland, 2016; pp. 630–645. [Google Scholar]
  33. Depierre, A.; Dellandréa, E.; Chen, L. Jacquard: A Large Scale Dataset for Robotic Grasp Detection. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 3511–3516. [Google Scholar] [CrossRef]
  34. Yun, J.; Moseson, S.; Saxena, A. Efficient grasping from RGBD images: Learning using a new rectangle representation. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 9–13 May 2011; pp. 3304–3311. [Google Scholar] [CrossRef]
  35. Zhou, X.; Lan, X.; Zhang, H.; Bai, S.; Tian, Z.; Zhang, Y.; Zheng, N. Fully convolutional grasp detection network with oriented anchor box. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 7223–7230. [Google Scholar]
  36. Ainetter, S.; Fraundorfer, F. End-to-end trainable deep neural network for robotic grasp detection and semantic segmentation from RGB. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13452–13458. [Google Scholar]
  37. Chen, L.; Huang, P.; Li, Y.; Meng, Z. Edge-dependent efficient grasp rectangle search in robotic grasp detection. IEEE/ASME Trans. Mechatron. 2020, 26, 2922–2931. [Google Scholar] [CrossRef]
  38. Li, Y.; Huang, P.; Ma, Z.; Chen, L. A Context-Free Method for Robust Grasp Detection: Learning to Overcome Contextual Bias. IEEE Trans. Ind. Electron. 2021, 69, 13121–13130. [Google Scholar] [CrossRef]
  39. Laili, Y.; Chen, Z.; Ren, L.; Wang, X.; Deen, M.J. Custom Grasping: A Region-Based Robotic Grasping Detection Method in Industrial Cyber-Physical Systems. IEEE Trans. Autom. Sci. Eng. 2022, 20, 88–100. [Google Scholar] [CrossRef]
  40. Yu, Y.; Cao, Z.; Liu, Z.; Geng, W.; Yu, J.; Zhang, W. A Two-Stream CNN with Simultaneous Detection and Segmentation for Robotic Grasping. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 1167–1181. [Google Scholar] [CrossRef]
  41. Asif, U.; Tang, J.; Harrer, S. Densely supervised grasp detector (DSGD). In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 8085–8093. [Google Scholar]
  42. Li, T.; Wang, F.; Ru, C.; Jiang, Y.; Li, J. Keypoint-based robotic grasp detection scheme in multi-object scenes. Sensors 2021, 21, 2132. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.