Datasets. We employ the REAL275 [
16] dataset for training and testing the proposed network. The REAL275 dataset primarily consists of RGB-D images of 275 objects from real-world scenes, divided into multiple categories, each containing several instances. The objects exhibit diverse poses, appearances, and materials, effectively simulating the complexity found in practical applications. The dataset includes 8000 images, all captured from real-world scenes, ensuring high authenticity and diversity. This diversity is crucial for evaluating the model’s performance in real-world applications.
Implement Details. To validate the effectiveness of the proposed network, we follow the methodologies of GPV-Pose [
14] and HS-Pose [
20] in the experimental setup. In the input stage, based on the Mask R-CNN framework, the segmented results of the targets are transferred to three-dimensional point clouds, which are uniformly sampled with 1028 points selected as the input for the network. Corresponding data augmentation strategies and loss functions are retained. The network is developed using the PyTorch deep learning framework and executed on a computer equipped with 2 NVIDIA GeForce RTX 3090 GPUs. During the model training phase, the Ranger optimizer is employed with a cosine learning rate strategy, setting the initial learning rate to
and weight decay to
. The model is trained for a total of 150 iterations with a batch size of 16. Regarding hyperparameter settings, in the statistical attention module, the dimension is reduced from 1024 to 512.
In terms of the choice of orders, an interesting finding emerges: the selection of orders does not affect the weights but only alters the computation method of the weights. No matter how many orders of statistical data are used, it only changes the way the weights are calculated and does not affect the weights themselves. The reason is that the features extracted by the network are fixed, and the statistical attention only calculates the high-order statistics of these features (which can be understood as different ways of expressing the features) without changing them. Therefore, the order of statistical attention does not affect the weight value. Therefore, during training, we use , while during validation, detailed testing is conducted, choosing as the final result.
4.1. Ablation Studies
Impact of different components on performance. In
Table 1, we investigate the effect of various components on model performance. The first row shows the baseline model, which employs the 3D-GC GPV-Pose approach. Replacing the original 3D-GC layer with the HS layer, as seen in the second row, resulted in substantial improvements, with increases of 14.5, 12.3, 13.6, and 9.4 in the 5°2 cm, 5°5 cm, 10°2 cm, and 10°5 cm metrics, respectively. This enhancement is attributed to the HS layer’s ability to exploit global geometric relationships within features, enabling better handling of complex object geometries and thus improving model performance. The final row shows further improvements achieved by integrating the SA module alongside the HS layer. The final model achieved mAP scores of 49.1, 57.5, 72.2, and 84.5 for the 5°2 cm, 5°5 cm, 10°2 cm, and 10°5 cm metrics, respectively, with peak mAP scores of 79.8 for 2 cm and 61.2 for 5°. The introduction of the SA module enhances the model’s ability to capture long-range dependencies and intricate differences between objects by utilizing higher-order statistical information.
The impact of statistical attention positions. In this section, the evaluation of statistical attention insertion positions on model performance is presented, as shown in
Table 2. The first row in the table shows the result after we replace the original 3D-GC of GPV-Pose with HS-Layer as the baseline model. Then, statistical attention is inserted at five different positions: after the global features in the encoder (Position1), before the global features in the encoder (Position2), before the convolutional block with dimension 1024 in the pose regression module (Position3), after the convolutional block with dimension 1024 in the pose regression module (Position4), and after the first convolutional block with dimension 256 in the pose regression module (Position5), as illustrated in
Figure 1. From
Table 2, it can be observed that Position1, Position3, and Position4 exhibit the best overall performance. The performance at other positions is slightly lower.
The impact of dimensions in statistical attention on performance. This section explores the impact of using 1D convolution to reduce dimensions in statistical attention on model performance. The initial input channel dimension is
. Subsequently, a 1D convolution is used to decrease or increase
D to decouple channel dimension correlations, resulting in a changed dimension
.
Table 3 presents the impact of various dimensions on model performance, including 256, 512, 1024, and 2048. It can be observed that the model achieves the optimal overall performance when
. The mAP scores for
,
and
are 83.0, 82.0, and 75.0, respectively, which are comparable to the baseline model. Specifically, for the 5°2 cm, 5°5 cm, 10°2 cm, and 10°5 cm metrics, the scores are 47.7, 56.5, 70.9, and 84.3, respectively. The model achieves 79.7 and 60.0 in terms of object rotation and translation (2 cm and 5°), respectively. Overall, these results are superior to the baseline model and other configurations, demonstrating the optimal performance.
The impact of the number of statistical attentions. The previous ablation analysis on the placement of statistical attention revealed that the models with attention at Position1, Position3, and Position4 achieved better results. In this section, we further compare the impact of the number of statistical attentions using these three optimal positions. The results, as shown in
Table 4, analyze the outcomes of adding two modules (Position1+3, Position1+4, and Position3+4) and three modules (Position1+3+4). Overall, the best results were obtained with Position1+4, which involves adding statistical attention before the global features in the encoder and after the convolutional block with a dimension of 1024 in the pose regression module, as illustrated in
Figure 1.
The impact of statistical attention orders. This section evaluates the impact of using different orders of statistical attention on model performance, as shown in
Table 5. Firstly, the second row of the table demonstrates that when only first-order statistical data are used, the model performance already shows significant improvement. This is because the first-order statistic, which includes the global mean, can effectively model each position in the entire model. Then, as the order of statistical data increases, the model performance generally shows an upward trend. Specifically, the model achieves optimal performance, especially in the 10°5 cm metric, with an improvement of 0.5 compared to using only first-order statistics. It is important to note that the change in order was applied only during inference; the same order was used during model training. Finally, further increasing the order does not lead to additional improvements and may even result in a performance decline. This decline could be due to the detrimental effect of excessively high-order statistical information on the model. Therefore, we choose to use all statistical information with an order less than or equal to 7.
Comparison with different attention modules on the object pose estimation task. In this section, we compare SAPENet with several classic attention mechanisms, including SE, CBAM, and ECA. For a fair evaluation, we utilize the same backbone network and place the attention modules at identical positions. The results, presented in
Table 6, demonstrate that SAPENet outperforms all other attention methods, achieving mAP scores of 83.1, 82.1, and 74.7 in the respective metrics. Specifically, SAPENet attains scores of 49.2, 57.6, 72.7, and 84.7 for the 5°2 cm, 5°5 cm, 10°2 cm, and 10°5 cm metrics, and achieves 80.3 and 61.4 for object rotation and translation (2 cm and 5°), respectively. The superior performance of SAPENet can be ascribed to its integration of higher-order statistical information, unlike previous attention modules. This capability enables SAPENet to model relationships more effectively, capturing long-range dependencies while also focusing on fine-grained differences between objects.
Comparison of the parameters and computational complexity between the proposed method and HS-Pose.Table 7 presents a comparison of the parameter count and computational cost between our method and HS-Pose. The first row of the table shows the HS-Pose method, which has 6.1 M parameters and a computational cost of 25.5 G, as calculated by us. Our method adds 0.3 M parameters and 0.7 G to the computational cost compared to HS-Pose. Despite these modest increases in parameters and computational cost, there is a corresponding enhancement in performance. Overall, our method achieves a reasonable balance between cost and performance.
4.2. Comparison with State-of-the-Art Methods
Comparison with the state-of-the-art methods on the REAL275 dataset is shown in
Table 8. The upper part of the table uses RGB-D-based methods during inference, while the lower part performs pose estimation using depth only. It can be observed that our SAPENet achieves the best performance in four metrics: 5°2 cm, 5°5 cm, 10°2 cm, and 10°5 cm, with scores of 49.2, 57.6, 72.7, and 84.7, respectively. Specifically for the upper part of the table utilizing RGB-D methods, classic approaches such as NOCS, DualPoseNet, SPD, CR-Net, and SGPA are compared. Among these, NOCS achieves the highest mAP of 84.9, likely due to its use of shape priors based on point clouds. Interestingly, our method outperforms NOCS in other metrics without the benefit of shape priors, demonstrating the effectiveness of our approach. Overall, SGPA achieves the best results in the upper part of the table, but there is still some gap compared to our method.
Moving to the lower part of the table, which uses depth-only methods for pose estimation, we compare recent approaches such as FS-Net, SAR-Net, RBP-Pose, GPV-Pose, and HS-Pose. GPV-Pose achieves the highest mAP of 83.0, with our method trailing by only 0.9, and our method outperforms GPV-Pose in most metrics. Additionally, HS-Pose achieves the secon dhighest results in metrics 5°2 cm, 5°5 cm, 10°2 cm, and 10°5 cm in the table. However, there is still some difference compared to our method, particularly with SAPENet outperforming HS-Pose by 4.1 in the 10°2 cm metric. These results demonstrate the effectiveness of SAPENet proposed in this paper for pose estimation tasks. We then visualize our approach and the results are shown in
Figure 3.