WoodenCube: An Innovative Dataset for Object Detection in Concealed Industrial Environments

Wu, Chao; Li, Shilong; Xie, Tao; Wang, Xiangdong; Zhou, Jiali

doi:10.3390/s24185903

Open AccessArticle

WoodenCube: An Innovative Dataset for Object Detection in Concealed Industrial Environments

by

Chao Wu

,

Shilong Li

,

Tao Xie

,

Xiangdong Wang

and

Jiali Zhou

^*

School of Mathematical Sciences, Zhejiang University of Technology, Hangzhou 310023, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(18), 5903; https://doi.org/10.3390/s24185903

Submission received: 23 August 2024 / Revised: 6 September 2024 / Accepted: 10 September 2024 / Published: 11 September 2024

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid advancement of intelligent manufacturing technologies, the operating environments of modern robotic arms are becoming increasingly complex. In addition to the diversity of objects, there is often a high degree of similarity between the foreground and the background. Although traditional RGB-based object-detection models have achieved remarkable success in many fields, they still face the challenge of effectively detecting targets with textures similar to the background. To address this issue, we introduce the WoodenCube dataset, which contains over 5000 images of 10 different types of blocks. All images are densely annotated with object-level categories, bounding boxes, and rotation angles. Additionally, a new evaluation metric, Cube-mAP, is proposed to more accurately assess the detection performance of cube-like objects. In addition, we have developed a simple, yet effective, framework for WoodenCube, termed CS-SKNet, which captures strong texture features in the scene by enlarging the network’s receptive field. The experimental results indicate that our CS-SKNet achieves the best performance on the WoodenCube dataset, as evaluated by the Cube-mAP metric. We further evaluate the CS-SKNet on the challenging DOTAv1.0 dataset, with the consistent enhancement demonstrating its strong generalization capability.

Keywords:

complex industrial scenes; robotic grasping; loss function design; an automatic annotation method; attention mechanisms

1. Introduction

With the rapid development of warehousing, logistics, internet, and artificial intelligence technologies, robots have been widely applied in various industrial scenarios [1,2,3,4,5]. As advanced manufacturing demands increasingly higher performance, industrial robotic arms are no longer limited to simple pick-and-place tasks [6,7,8,9]. They now also need to quickly detect object types and accurately localize them to efficiently complete grasping tasks [10,11]. To mitigate the high costs associated with depth cameras, more and more manufacturers have turned to RGB cameras for object recognition and localization [12]. However, a key challenge in this transition is improving the accuracy of object detection using RGB cameras [13]. Some researchers have attempted to directly apply existing object-detection algorithms [14,15], such as convolutional neural networks (CNNs [13]), to the task of object recognition for robotic arms, combined with hand–eye calibration techniques [16] (e.g., eye-in-hand or eye-on-hand) to facilitate grasping; while traditional object-detection algorithms perform well in tasks where targets are clearly visible with high contrast [17,18,19], they face difficulties in scenes where foreground and background textures are similar. In 2020, Fan et al. conducted the first systematic study on concealed object detection (COD) [20], providing a crucial foundation for addressing object detection in complex environments. This pioneering work has subsequently garnered increasing attention from scholars [21,22,23]. However, existing COD tasks predominantly focus on natural images [20,24,25,26], with a scarcity of similar tasks and datasets in the industrial domain. This shortage severely limits the capability of robotic arms to accurately identify and grasp target objects in complex industrial environments, thereby constituting a critical challenge in industrial automation processes.

Since robotic arms require precise knowledge of both the center coordinates and the rotation angle of the target for successful grasping, traditional mAP, a key evaluation metric in object detection, may be insufficient. mAP evaluates detection accuracy by first calculating the IoU (Intersection over Union) [27] between predicted and ground truth bounding boxes, then determining the accuracy based on a given IoU threshold. However, if the center pixels of the predicted and ground truth boxes are roughly the same but the angle differs significantly, mAP may still classify the detection as accurate. For robotic arms, such rotational discrepancies are critical, as they directly affect the precision and stability of the grasp process. Many researchers have proposed improved algorithms based on IoU [28,29,30]. One innovative approach involves modeling the rotated bounding box using a 2D Gaussian distribution [31]. By calculating the Gaussian Wasserstein distance, this method approximates the loss incurred by the non-differentiable rotational IoU, aligning model learning more closely with accuracy measurement and resolving issues related to loss inconsistency. Nevertheless, a notable challenge emerges in the context of square objects, as their Gaussian distribution reduces to a normal distribution with a covariance of 0, resulting in a constant loss value of 0, regardless of changes in IoU.

To solve these issues, we introduce WoodenCube, a dataset specifically focused on concealed object detection in industrial robotic arm grasping scenarios. This dataset was collected in the facade track wooden cube scene, as shown in Figure 1a, featuring up to 10 types of wooden blocks randomly scattered on the surface with similar textures. Figure 1b illustrates three different types of wooden blocks. To enhance annotation efficiency and accuracy, we propose a semi-automatic annotation method based on SAM [32], called Cube Semi-automatic Segment Anything (CS-SAM). Given the high texture similarity between the foreground and background in industrial settings, we initially provide a horizontal prediction box as the initial detection range for SAM. We then use the SAM model to obtain a high-precision mask. Further, by combining actual robotic arm operations and hand–eye calibration techniques, we can directly obtain an initial detection range. Integrating this technique with the SAM model significantly improves the efficiency and accuracy of the annotation process.

Inspired by the work of Yang et al. [31] and Jeffri M. Llerena et al. [33], we introduce a novel evaluation method—G/2-ProbIoU, tailored for scenarios with a large number of square-like features in practical applications. We firstly convert the rotated bounding box into a 2D Gaussian distribution, and then we perform a carefully designed stretching operation on the Gaussian distribution to ensure it does not become isotropic. Finally, by calculating the Bhattacharyya coefficient between the two Gaussian distributions, we accurately measure the overlap between the two bounding boxes. Based on the G/2-ProbIoU threshold, we propose a metric called Cube-mAP to comprehensively evaluate the detection and recognition performance of rotated bounding boxes. By applying this novel Cube-mAP evaluation metric to current models, we can more precisely measure model performance in object detection tasks and efficiently guide model performance optimization. Especially in handling rotated objects, Cube-mAP can capture rotational errors that traditional evaluation methods might overlook, leading to significant performance improvements in this crucial dimension. This improvement enhances the model’s practicality while establishing a solid foundation for future research and applications in the field.

In exploring robotic arm grasping tasks, we inevitably encounter challenges in accurately estimating the poses of target objects. Although existing object-detection models perform well within certain frameworks [34,35,36], they often struggle with predicting rotational components, which is a key factor restricting precise grasping. Inspired by concepts from LSKNet [37] and feature pyramid networks, we propose a multi-scale pyramid-shaped large-kernel convolution network, the Cross-Scale Selective Kernel Network (CS-SKNet). By expanding the network’s receptive field, CS-SKNet captures the rich and distinct texture features that are present in the scene, thereby leveraging the multi-scale feature extraction advantages of FPN and additionally enhancing the model’s performance in complex scenarios.

The core function of the CS selection sub-block in the network is to dynamically adjust its receptive field according to actual needs. This functionality is primarily achieved through the uniquely designed CS-SK core module, which integrates pyramid-shaped large-kernel convolutions and a spatial kernel selection mechanism, allowing the network to flexibly adjust its spatial resolution and contextual information capture capability when confronted with inputs of varying scales and complexities. The MLP sub-block, on the other hand, plays a crucial role in channel mixing and feature refinement within the neural network. It comprises an initial convolution layer, followed by group convolution layers, the GELU activation function, and an additional convolution layer. Collectively, these operations not only enhance the interaction and fusion between features but also refine feature representations through nonlinear transformations, significantly improving the quality and diversity of the resulting feature maps.

The main contributions of our paper are summarized as follows:

This paper addresses the challenges of industrial robots in recognizing and localizing objects in complex scenes: We introduce the WoodenCube dataset, which explores the detection of objects with textures similar to the background in industrial scenarios. The dataset comprises 5113 densely annotated images of 10 different types of blocks, significantly improving annotation efficiency and accuracy through a semi-automatic annotation method, CS-SAM.
To tackle the issue that mAP cannot effectively evaluate rotation scales due to the covariance of Gaussian distribution for nearly square features being zero, we propose G/2-ProbIoU and define Cube-mAP based on this function to more accurately assess the detection performance of models on cube-like objects.
To address the challenges of traditional convolutional methods in distinguishing objects with similar textures, we design a multi-scale pyramid-shaped large-kernel convolutional network, CS-SKNet. This network expands the receptive field to capture strong texture features in the scene, retaining the multi-scale feature extraction advantages of FPN while further enhancing the model’s performance in complex scenarios.
Extensive experiments conducted on the WoodenCube dataset and DOTAv1.0 dataset demonstrate the competitive performance of the proposed CS-SKNet model in terms of rotated object detection accuracy. On the WoodenCube dataset, our model achieved a Cube-mAP score of 72.64%. On the DOTAv1.0 dataset, our model achieved an mAP of 79.17% while maintaining low parameter count and FLOPs.

2. Related Works

2.1. Grasp Detection

Object grasping is a fundamental problem with widespread applications in industry, agriculture, and services. Although traditional manual teaching methods can achieve efficient task execution, they become impractical when frequent changes to robot programming are required due to environmental or other factors [38]. Among deep-learning-based grasp-detection methods, the use of RGB-D image input to detect graspable rectangles has gained popularity. Lenz et al. [39] proposed a cascaded network method that first eliminates unlikely grasps, then uses a larger network to evaluate the remaining grasps. Redmon et al. [40] proposed a different network structure that directly regresses the grasping pose in a single step, making it faster and more accurate. Beyond RGB-D images, researchers have also explored point cloud-based approaches for grasp detection, which offer a different perspective on the 3D environment. Qi et al. [41] first proposed PointNet, which learns features directly from raw point cloud input, and Qin et al. [42] extended its use to predict grasping poses in 3D space. However, despite the promising results, these methods still face significant computational challenges, limiting their widespread adoption in industrial settings.

Moreover, since the introduction of AlexNet [43,44], 2D image object detection has become increasingly mature, with models such as Fast-RCNN [45,46] and YOLO [18,47]. Liu et al. [48] proposed a grasp-pose-detection method based on a cascaded convolutional neural network, using Mask R-CNN to extract object grasp features and candidate bounding boxes. Karaoguz et al. [49] employed transfer learning based on CNN object-detection architectures to achieve grasp pose detection from RGB images. However, deep learning methods based on 2D images primarily focus on recognizing everyday objects, while industrial environments are more complex.

Concealed object detection (COD) [20] aims to detect objects that are visually highly integrated with their surrounding environment. This technology has been widely applied in areas such as species conservation [50], medical image segmentation [51], and industrial defect detection [52]. In 2020, Fan et al. released the first large-scale public dataset, COD10K [20], advancing the field of concealed object detection. This release has inspired further research in related disciplines. For example, Mei [21] proposed a distraction-aware framework for camouflaged object segmentation, which has potential applications in identifying transparent materials in natural scenes as well. Similar issues exist in the operational scenarios of industrial robotic arms, where complex environments require target detection algorithms to identify objects with highly similar foreground and background. Currently, the main camouflaged object detection datasets are CAMO [25], CHAMELEON [53], and COD10K [20], primarily consisting of images from everyday scenes. Notably, however, the industrial sector still lacks relevant large-scale datasets. To bridge this gap, we contribute a novel dataset specifically tailored for concealed object detection in industrial settings, along with the development of effective object-detection algorithms tailored for industrial applications.

2.2. Bounding Box Regression

Bounding box regression (BBR) is a fundamental component of object-detection systems, which predicts the coordinates of bounding boxes around objects in an image. The effectiveness of BBR directly impacts the overall performance of object-detection models. Various loss functions have been proposed to improve the accuracy and robustness of BBR, including those introduced by Felzenszwalb [54] et al. (2009), Girshick [55] et al. (2014), Beery [56] et al. (2020), and Wu [57] et al. (2020). IoU (Intersection over Union) is the most widely used loss function, advantaged by its ability to accurately describe the match between the predicted box and the ground truth box. However, its limitation becomes evident when the overlap is zero, as it fails to accurately describe their positional relationship, especially the rotational relationship. To address this, Rezatofighi et al. proposed GIoU [28], which considers the containment relationship and spatial distribution between bounding boxes. However, GIoU loses effectiveness when the ground truth box completely covers the predicted box. To tackle this issue, Zheng et al. proposed DIoU [29], which improves scale inconsistency by considering the distance between the center points of bounding boxes. However, when the center point of the predicted box coincides with that of the ground truth box, DIoU degenerates into the original IoU. Further, Zheng et al. proposed CIoU [58], which simultaneously considers the center point distance and aspect ratio. However, the aspect ratio defined in CIoU is relative rather than absolute.

Xue Yang [31] proposed a method that utilizes a two-dimensional Gaussian distribution to model the problem of rotated bounding boxes. By calculating the Gaussian Wasserstein distance, this approach approximates the loss caused by the non-differentiable rotating IoU, aligning model learning with accuracy measurement and addressing the inconsistency in the loss. However, if the object to be detected is a square, the calculated Gaussian distribution will always be a normal distribution with a covariance of 0 regardless of the rotation, leading to a computed loss that is always 0 while the IoU varies.

Recognizing the limitations of previous Gaussian-based methods, particularly in handling square objects, we introduce G/2-ProbIoU, a novel approach that evaluates the overlap between two Gaussian-distributed bounding boxes. This method incorporates a carefully designed stretching operation specifically tailored to handle squares effectively. We further propose Cube-mAP, a new evaluation metric based on G/2-ProbIoU, to comprehensively assess the detection and recognition performance of rotated bounding boxes.

2.3. Attention Mechanisms

Attention mechanisms [59] represent a simple yet powerful approach to enhancing neural representations across a wide range of tasks. Channel attention mechanisms, such as SE blocks [60], leverage global average information to reweigh feature channels. Spatial attention modules, including GENet [61], GCNet [62], and SGE [63], bolster the network’s capacity to model contextual information through spatial masks. CBAM [64] and BAM [65] integrate both channel and spatial attention. Additionally, kernel selection emerges as an adaptive and efficient dynamic context modeling technique. CondConv [66] and dynamic convolution [67] utilize parallel kernels to adaptively aggregate features across multiple convolutional kernels.

SKNet [68] introduces multiple branches with different convolutional kernels and selectively combines them along the channel dimension. SCNet [69] innovates with self-calibrated convolutions, enhancing convolutional transformations at each layer. ResNet [70] proposes a modular architecture that applies channel attention to different network branches, leveraging their success in capturing cross-feature interactions and learning diverse representations.

Building upon the above works, LSKNet [37] presented a lightweight detection backbone network. By weighting features processed through large-kernel convolutions and spatially merging them, it dynamically adjusts spatial receptive fields, enabling better modeling of contextual information across various object ranges in remote sensing scenes. RepLKNet [71], as a purely CNN architecture, achieves a kernel size of 31 × 31 for large-kernel convolutions, significantly exceeding the commonly used 3 × 3. Research has shown that CNNs employing large-kernel convolutions possess more effective receptive fields compared to those with smaller kernels, subsequently introducing greater shape biases into the network. Furthermore, the integration of large-kernel convolutions with residual structures effectively boosts the model’s performance. The FPN [72] model leverages the intrinsic multi-scale, pyramid hierarchical structure of deep convolutional networks to construct feature pyramids with almost no additional cost. Drawing inspiration from these methods, we propose the CS-SKNet backbone network, which incorporates the benefits of large-kernel convolutions within a multi-scale pyramid-like structure. This design aims to expand the network’s receptive field and effectively capture strong-texture features within the scene.

3. WoodenCube Dataset

Figure 2 shows that the WoodenCube dataset in this paper provides data with varying heights, apertures, and exposures under various complex textures. Each cube is placed on a baseboard with a texture similar to the cube itself, and there is a significant resemblance between certain types, such as Semi2Hole4 and Semi2Hole5, which undoubtedly increases the difficulty of object detection. The dataset includes multiple views of each block, ensuring completeness and facilitating future work in block coding. This section will describe the main characteristics of the WoodenCube dataset and how we were able to construct such a dataset in a short period.

3.1. Data Collection

To construct a high-quality dataset, we integrated a high-resolution 5-megapixel RGB camera with a precise six-axis robotic arm, as shown in Figure 3. The detailed specifications of the equipment are listed in Table 1. To ensure the comprehensive coverage and completeness of the dataset, we flexibly adjusted key parameters, including shooting distance, exposure, and gain through programming.

Furthermore, we incorporated various cubes as subjects, each having similar basic trajectory planes but possessing unique characteristics. Additionally, for every adjustment in shooting distance, we meticulously operated the camera’s focus adjustment ring to ensure each image captured met the highest clarity standards.

3.2. Data Annotation

Ultimately, we obtained the WoodenCube dataset, which comprises 5113 RGB images with a resolution of 2448 × 2048 pixels, covering 30 different target wooden cubes. Figure 4 illustrates the specific number of frames for each class and their distribution.

The annotation format of the WoodenCube dataset is “

x_{1}

,

y_{1}

,

x_{2}

,

y_{2}

,

x_{3}

,

y_{3}

,

x_{4}

,

y_{4}

, class, difficulty”. Currently, many publicly available datasets are annotated manually, a process that requires considerable manpower and resources. Moreover, due to the differences among annotators and their subjectivity, manual annotation introduces a significant amount of uncertainty. Therefore, we propose the use of the Segment Anything Model (SAM) segmentation algorithm for the semi-automatic annotation of the dataset. In the SAM segmentation algorithm [32], directly using the entire image as the segmentation scope does not yield the expected results and can result in significant errors.

To ensure that the algorithm can achieve high-accuracy masks for specific objects, we first annotate several reference points on the image of wooden cubes. Although this improves the segmentation results, in scenes with strong textures, the edges of the wooden cubes and the background textures are very similar, leading to some errors in the segmented mask at the edges. Additionally, when only one reference point is annotated, it fails to guarantee complete coverage of the entire cube by the mask; while annotating multiple reference points yields better results, it significantly increases the annotation workload. Therefore, this paper proposes providing a horizontal bounding box to annotate the object in the image. Experimental results demonstrate that annotating a horizontal bounding box as the detection range for the object, prior to applying the SAM algorithm for segmentation, achieves the highest mask accuracy. Figure 5 illustrates the effects of the three methods above of assisted annotation using the SAM algorithm.

After obtaining a mask for a given wooden cube using the SAM algorithm, it is straightforward to comprehend that the essence of this mask is a collection of points. We first propose calculating the convex hull of this point set and then determining the minimum bounding rectangle of the convex hull to fit the rotation anchor box corresponding to the wooden cube.

The above-described method is a semi-automatic annotation approach based on segment-anything, specifically designed and employed in this paper. Although this approach significantly reduces the manpower and resources required for annotating the dataset, it is not without limitations. For instance, interference from the sides of the wooden cubes can markedly increase the error of the computed rotation anchor box. Furthermore, certain strong textures or interfering textures on the background or other surfaces of the wooden cubes can cause considerable displacement of the rotation anchor box, as illustrated in Figure 6. Despite employing a range of sophisticated post-processing techniques, such as denoising, it remains challenging to fully guarantee the accuracy of the annotations. Therefore, a swift manual screening is necessary to ensure the acquisition of high-quality dataset labels. Thus, in this industrial setting, a high-precision method for rotation object detection is urgently needed.

3.3. Cube-mAP Evaluation Method

Currently, mAP (mean average precision) is an important and commonly used evaluation metric in object detection. It first computes the overlap between predicted and ground truth bounding boxes using Intersection over Union (IoU [27]). A higher overlap indicates closer proximity to the ground truth box. Then, based on a given IoU threshold, it determines whether the bounding box is TP (true positive), FN (false negative), FP (false positive), or TN (true negative). The formula for IoU calculation is as follows:

I o U = \frac{A \cap \tilde{A}}{A \cup \tilde{A}}

(1)

where A and

\tilde{A}

represent the areas of the ground truth and predicted bounding boxes, respectively. In rotated object detection, as shown in Figure 7a, the distribution of the ground truth bounding box and the predicted bounding box is such that, when the centers of the two boxes coincide and their rotation angles differ by 45°, the IoU is already 0.7071. Generally, when the threshold is set at 0.7, the situation depicted in Figure 7b would be considered a true positive (TP). However, such a large rotation angle does not align with our expectations. Therefore, using mAP as the evaluation metric for rotated object detection is not sufficiently objective, especially for datasets containing mostly square-shaped rotated objects.

Inspired by Yang [31] and Jeffri M. Llerena [33], we model rotated bounding boxes as Gaussian distributions and utilize the G/2-ProbIoU metric to evaluate the overlap between two bounding boxes. An overview of the algorithm is presented in Algorithm 1. Firstly, we transform a rotated bounding box

B (x, y, w, h, θ)

into a two-dimensional Gaussian distribution

N (m, Σ)

[31,33]. Thus, evaluating the overlap between two rotated bounding boxes

B (x_{1}, y_{1}, w_{1}, h_{1}, θ_{1})

,

B (x_{2}, y_{2}, w_{2}, h_{2}, θ_{2})

, is equivalent to assessing the difference between two two-dimensional Gaussian distributions

N (m_{1}, Σ_{1})

,

N (m_{2}, Σ_{2})

. For the convenience of calculation, we can decompose the two-dimensional Gaussian distribution into an expression in terms of the principal axes through eigenvector decomposition. However, for the aforementioned WoodenCube dataset, its rotated bounding boxes exhibit strong square-like features, leading to Gaussian distributions close to isotropic, showing approximately circular shapes. In this case, the values of various IoUs tend to be relatively high, exceeding typical IoU thresholds. According to the original calculation method of mAP, they would be categorized as true positives (TP), neglecting other important factors such as rotation angles. Therefore, it may be purposeful to stretch the Gaussian distribution to prevent it from approaching an isotropic Gaussian distribution; this stretching operation is reversible. For simplicity, let us stretch the Gaussian distribution along the principal axis y direction by a factor of 1, resulting in the covariance matrix of the transformed Gaussian distribution. The formula below demonstrates how we diagonally orthogonalize the transformed covariance matrix.

\begin{matrix} Σ & = [\begin{matrix} a & c \\ c & d \end{matrix}] = R_{α} [\begin{matrix} λ_{1} & 0 \\ 0 & 2 λ_{2} \end{matrix}] R_{α}^{T} \\ = [\begin{matrix} λ_{1} {cos}^{2} α + 2 λ_{2} {sin}^{2} α & \frac{1}{2} (λ_{1} - 2 λ_{2}) sin 2 α \\ \frac{1}{2} (λ_{1} - 2 λ_{2}) sin 2 α & λ_{1} {sin}^{2} α + 2 λ_{2} {cos}^{2} α \end{matrix}] \end{matrix}

(2)

where

R_{α}

=

[\begin{matrix} cos α & - sin α \\ sin α & cos α \end{matrix}]

represents the two-dimensional rotation matrix,

λ_{1}, 2 λ_{2}

represent the eigenvalues of the covariance matrix

Σ

, and they are also the representations of a and b in the new coordinates. Since we have already transformed the original two-dimensional Gaussian distribution into an expression along the principal axes, the method for computing the covariance matrix of the rotated bounding box is consistent with that of the horizontal bounding box. Here, we will derive the method using the rotated bounding box as an example. For the rotated bounding box, the region

Ω

degenerates into a rectangular region with a center point at

(x_{c}, y_{c})

and widths and heights, respectively, denoted as w and

2 h

. We integrate along the principal axes to compute the covariance matrix for x and y.

\begin{matrix} Σ = \frac{1}{w h} \int_{- h}^{h} \int_{- \frac{w}{2}}^{\frac{w}{2}} [\begin{matrix} x^{2} & x y \\ x y & y^{2} \end{matrix}] d x d y = [\begin{matrix} \frac{w^{2}}{6} & 0 \\ 0 & \frac{2 h^{2}}{3} \end{matrix}] \end{matrix}

(3)

Thus, we have

a = \frac{w^{2}}{6}, b = \frac{2 h^{2}}{3}, c = 0

(4)

Algorithm 1. G/2-ProbIoU

Algorithm Overview:
A Method for Calculating IoU with Stretched Gaussian Distribution.
Input: $B (x_{i}, y_{i}, w_{i}, h_{i}, θ_{i}), i = 1, 2$ .
Output: G/2ProbIoU.

1:: $a_{i}^{'} = \frac{w_{i}^{2}}{6}, b_{i}^{'} = \frac{2 h_{i}^{2}}{3}, c_{i}^{'} = 0$ .
2:: $Σ = [\begin{matrix} a_{i} & c_{i} \\ c_{i} & b_{i} \end{matrix}] = R_{θ} [\begin{matrix} a_{i}^{'} & c_{i}^{'} \\ c_{i}^{'} & b_{i}^{'} \end{matrix}] R_{θ}^{T}$ .
3:: $Σ = [\begin{matrix} a_{i}^{'} {cos}^{2} θ_{i} + b_{i}^{'} {sin}^{2} θ_{i} & (a_{i}^{'} - b_{i}^{'}) sin θ_{i} c o s θ_{i} \\ (a_{i}^{'} - b_{i}^{'}) sin θ_{i} c o s θ_{i} & a_{i}^{'} {sin}^{2} θ_{i} + b_{i}^{'} {cos}^{2} θ_{i} \end{matrix}]$ .
4:: $B_{1} = \frac{1}{4} \cdot \frac{(a_{1} + a_{2}) {(y_{1} - y_{2})}^{2} + (b_{1} + b_{2}) {(x_{1} - x_{2})}^{2}}{(a_{1} + a_{2}) (b_{1} + b_{2}) - {(c_{1} + c_{2})}^{2}}$ .
5:: $B_{2} = \frac{1}{2} \frac{(c_{1} + c_{2}) (x_{2} - x_{1}) (y_{1} - y_{2})}{(a_{1} + a_{2}) (b_{1} + b_{2}) - {(c_{1} + c_{2})}^{2}}$ .
6:: $B_{3} = \frac{1}{2} \cdot ln (\frac{(a_{1} + a_{2}) (b_{1} + b_{2}) - {(c_{1} + c_{2})}^{2}}{4 \sqrt{(a_{1} b_{1} - c_{1}^{2}) (a_{2} b_{2} - c_{2}^{2})}})$ .
7:: $D_{B} = B_{1} + B_{2} + B_{3}$ .
8:: $B C = e^{- D_{B}}$ .
9:: $H (p, q) = \sqrt{1 - B C (p, q)}$ .
10:: $G / 2 ProbIoU (p, q) = 1 - H (p, q)$ .

Two rotated bounding boxes are transformed into two Gaussian distributions, denoted as

p \sim N (μ_{1}, Σ_{1})

and

q \sim N (μ_{2}, Σ_{2})

. According to the above theory, we stretch the Gaussian distributions p and q, keeping the mean vectors

μ_{1}

and

μ_{2}

unchanged.

Jeffri M. Llerena [33] uses the Bhattacharyya coefficient to measure the overlap between two distributions, thus deriving an analytical expression for the Bhattacharyya coefficient concerning the Bhattacharyya distance in the two-dimensional case. Subsequently, the Hellinger distance is employed to gauge the similarity between two probability distributions. Under the theoretical framework of this paper, we can similarly utilize the Hellinger distance to measure the similarity between Gaussian distributions p and q and obtain analogous analytical expressions. The Bhattacharyya coefficient

B C

can be used to quantify the overlap between distributions p and q, and the Bhattacharyya distance

D_{B}

between distributions p and q can be derived from the logarithmic definition of the Bhattacharyya coefficient.

\begin{matrix} \{\begin{matrix} B C (p, q) = \int_{R}^{2} \sqrt{p (x) q (x)} d x \\ D_{B} (p, q) = - ln B C (p, q) \end{matrix} \end{matrix}

(5)

Then, an analytical expression for the Bhattacharyya distance

D_{B}

can be obtained.

D_{B} = \frac{1}{8} \cdot {(μ_{1} - μ_{2})}^{T} Σ^{- 1} (μ_{1} - μ_{2}) + \frac{1}{2} \cdot ln (\frac{|Σ|}{\sqrt{|Σ_{1}| |Σ_{2}|}})

(6)

where

Σ = \frac{1}{2} \cdot (Σ_{1} + Σ_{2})

. Thus, derive an analytical expression for the Bhattacharyya coefficient in terms of the Bhattacharyya distance.

B C = e^{- D_{B}}

(7)

We use the Hellinger distance to measure the similarity between Gaussian distributions p and q, which can be regarded as the complement of the square root of the Bhattacharyya coefficient, and its values range from 0 to 1.

H (p, q) = \sqrt{1 - B C (p, q)}, H (p, q) > 0

(8)

Then, we use

1 - H (p, q)

to measure the similarity between Gaussian distributions

p (x)

and

q (x)

, denoted as G/2ProbIoU.

G / 2 ProbIoU (p, q) = 1 - H (p, q)

(9)

According to Figure 7b, for two squares with overlapping centers, the IoU curve is mostly above the G/2-ProbIoU curve, with the lowest point of the IoU curve being 0.1 higher than G/2-ProbIoU. The period of IoU variation concerning rotation angle is

π / 2

, while for G/2-ProbIoU, it is

π

, which also reflects the idea of stretching Gaussian distributions for G/2-ProbIoU.

Returning to the distribution of true bounding boxes and predicted bounding boxes as shown in Figure 7a, when the centers of the two bounding boxes coincide, and the rotation angles differ by 45°, the IoU is 0.7071, while G/2-ProbIoU is only 0.6586. At a typical threshold of 0.7, the scenario depicted in Figure 7a would not be counted as a true positive (TP), which aligns with the expected behavior for detecting rotated objects resembling squares.

Below, based on mAP and the previously introduced G/2-ProbIoU, we construct a new evaluation metric called Cube-mAP. The calculation method is as follows:

First, compute the G/2-ProbIoU value between the predicted bounding box and the ground truth bounding box. Next, determine whether the predicted bounding box is categorized as TP (True Positive), FN (False Negative), FP (False Positive), or TN (True Negative) based on a given G/2-ProbIoU threshold. In this case, we set the threshold to 0.7, meaning that if the G/2-ProbIoU of the predicted box is greater than 0.7, it is considered a TP; otherwise, it is an FP. FP represents the number of falsely detected negative samples, and FN represents the number of missed positive samples. Then, we calculate the P (Precision) and R (Recall) values for each category using the following formulas:

Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN}

Subsequently, we plot the corresponding P-R curve for each category. Using the 11-point interpolation method, we calculate the AP for each category. This involves taking the maximum Precision value at 11 points of Recall from 0 to 1 (0.0, 0.1, 0.2, …, 1.0) and averaging these values:

AP = \frac{1}{11} \sum_{r \in {0.0, 0.1, . . ., 1.0}} max_{\tilde{r} \geq r} Precision (\tilde{r})

Finally, the Cube-mAP is obtained by averaging the AP values of all categories.

4. CS-SKNet Architecture

The overall architecture of CS-SKNet is shown in Figure 8. It is composed of a series of repeated CS-SKNet blocks, where each CS-SKNet block consists of two key sub-blocks: the CS selection sub-block and the multi-layer perceptron (MLP) sub-block, drawing inspiration from LSKNet [37], RepLKNet [71], and FPN [72].

As shown in Figure 9, the role of the CS selection sub-block is to dynamically adjust the network’s receptive field as needed, which is achieved through the CS-SK core module. This core module comprises a series of pyramid-like large-kernel convolutions and spatial kernel selection mechanisms, which will be detailed later. The MLP sub-block, on the other hand, is used for channel fusion and feature refinement. It consists of operations, including the first fully connected layer, group convolution layer, GELU activation function, and the second fully connected layer. Combining these operations effectively enhances the quality and diversity of feature representation.

4.1. Multi-Scale Pyramid-like Large-Kernel Convolution Block

In order to achieve high-precision localization and classification of goods in industrial scenes using the WoodenCube dataset proposed in this paper, we aim to fully utilize the extensive contextual information in the images to be detected. Therefore, we propose a multi-scale pyramid-like large-kernel convolution in the CS selection sub-block. Unlike the FPN [72] pyramid hierarchical structure, this pyramid-like large-kernel convolution does not change the feature maps’ size or the number of channels at each layer. Although the dimensions of the feature maps remain unchanged, under the influence of this multi-scale large-kernel convolution block, we can generate large receptive fields at different scales, which facilitates subsequent spatial kernel selection.

For convolutional layers, we know that the theoretical receptive field size calculation formula is as follows

\begin{matrix} \{\begin{matrix} R F_{0} = 1 \\ R F_{i} = R F_{i - 1} + (k_{i}^{'} - 1) S_{i}, i > = 1 \end{matrix} \end{matrix}

(10)

where

k_{i}^{'}

represents the effective kernel size and

S_{i}

denotes the product of the strides from the 1st to the ith layer.

\begin{matrix} k_{i}^{'} & = k_{i} + (k_{i} - 1) (d_{i} - 1) \end{matrix}

(11)

\begin{matrix} S_{i} & = \prod_{j = 1}^{i} s_{j} = S_{i - 1} \cdot s_{i} \end{matrix}

(12)

where

k_{i}

represents the actual size of the convolution kernel,

d_{i}

represents the dilation rate of the i-th convolutional layer, and

s_{i}

represents the stride of the i-th convolutional layer.

In the CS selection sub-module, we employ a pyramid-shaped large-kernel convolution, followed by the addition of the two feature maps produced by the final large-kernel convolution before outputting them to the subsequent model. This implies that the theoretical receptive field is not equal to the latter after the final layer of large-kernel convolution. This design has two advantages. Firstly, simplifying the original N feature maps into two feature maps effectively reduces the complexity of the model, decreases the number of parameters, and enhances the model’s computational speed. Secondly, it enables the full integration of information across different receptive fields, thereby obtaining features with richer contextual information.

4.2. A Plug-and-Play MLP Based on Group Convolution

The MLP sub-module constructed in this paper comprises a first fully connected layer, a group convolutional layer, a GELU activation function, and a second fully connected layer. Figure 10 illustrates the structure of the MLP sub-module. In the MLP sub-module, we incorporate fully connected layers before and after the sub-module to better capture abstract features in the input data and enhance the model’s expressive capability. Introducing fully connected layers before and after the feature map can reduce the computational cost of this sub-module, improve the model’s generalization ability, and ensure that the scale and number of channels of the feature map remain unchanged before and after entering the sub-module, thus achieving plug-and-play functionality.

The core of MLP based on group convolution lies in integrating group convolution into the MLP sub-module. Group convolution was first introduced in AlexNet due to hardware resource constraints at the time of training. Training the entire AlexNet network on a single GPU was not feasible. Therefore, the authors distributed the convolution operation across multiple GPUs and merged the results from these GPUs, thus giving rise to group convolution. In the MLP sub-module of this paper, the feature maps with dimensions H, W, and

c_{1}^{'}

after the first fully connected layer are input into the group convolution layer.

We maintain the dimensions H, W, and

c_{1}^{'}

of the input feature maps while dividing them into g groups based on the number of channels, resulting in each group having dimensions H, W, and

c_{1}^{'} / g

. Correspondingly, the convolutional kernels remain the same size, with input channels of

c_{1}^{'} / g

, and each group has

c_{2} / g

convolutional kernels. We apply independent convolutional kernels to each group of feature maps, perform standard convolution operations, and concatenate the resulting g groups of feature maps with dimensions H, W, and

c_{2} / g

to obtain an output feature map with dimensions H, W, and

c_{2}

. Through the use of group convolution, the number of model parameters in the MLP sub-module can be reduced by a factor of g.

H \cdot W \cdot \frac{c_{1}^{'}}{g} \cdot \frac{c_{2}}{g} \cdot g = \frac{H W c_{1}^{'} c_{2}}{g}

(13)

5. Experiments

5.1. Datasets and Implementation Details

After collecting and annotating the WoodenCube dataset, we divided the dataset into training, validation, and test sets using an 8:1:1 ratio. All experiments were conducted on an RTX 4090 GPU with a batch size of 4. During the training phase, all images in WoodenCube were resized to 1024 × 1024 pixels. The evaluation was performed on the test set of WoodenCube, and all reported FLOPs in this paper were calculated based on input images of size 1024 × 1024.

To validate the generality of our approach, we also conducted experiments on the DOTAv1.0 [74] public dataset. The DOTAv1.0 dataset consists of 2806 aerial images captured by various sensors and platforms, with image sizes ranging from 800 to 4000 pixels. The images display objects of various scales, orientations, and shapes, annotated by domain experts, covering 15 common object categories. The images in the DOTAv1.0 dataset are fully annotated, totaling 188,282 instances, with each instance marked using quadrilaterals. The training, validation, and test sets of the DOTAv1.0 dataset consist of 1411, 458, and 937 images, respectively. All images are cropped to a size of 1024 × 1024 pixels. For multi-scale training, images are resized to 0.5×, 1.0×, and 1.5×of their original size before cropping, with a 500-pixel overlap.

The evaluation model in this study is built based on the Oriented RCNN [75] framework, implemented under the mmrotate [76] framework, and all models are trained on the training and validation sets and tested on the test set. Unlike LSKNet [37], this experiment did not adopt the pre-training backbone network strategy but instead trained these models from scratch, with an initial learning rate set to 0.0002 and weight decay to 0.05. During training, we utilized horizontal, vertical, and diagonal flips, as well as random polygon rotation as data augmentation methods, and employed exponential moving average (EMA) to weight-smooth the data and model weights. This paper evaluates the WoodenCube dataset using the Cube-mAP evaluation metric.

5.2. Main Result

In this section, we present the experiments conducted on the WoodenCube dataset and the DOTAv1.0 dataset to demonstrate the Cube-mAP metric’s rationality and the CS-SKNet model’s feasibility. We compared our approach with several current state-of-the-art models and mainstream frameworks, all of which are well-established and widely recognized in the field. As shown in Table 2, Cube-mAP provides a highly accurate evaluation on the WoodenCube dataset, whereas mAP often yields a score of 100 in many frameworks, which is not listed in the table. From the visualization results in Figure 11, it can be observed that the CS-SKNet model can accurately detect the rotational components of the wooden cubes, whereas the S²A-Net [77] and LSKNet [37] models, although proficient in locating the center points of the cubes, still have room for improvement in detecting rotational components.

Using the DOTAv1.0 dataset, we compared our approach with 12 state-of-the-art methods, as shown in Table 3. Our CS-SKNet achieved an mAP of 79.17%. It is worth noting that CS-SKNet achieved an inference speed of 31.9 FPS on a single RTX 4090 for images sized at 1024 × 1024. From the visualization result of Figure 12 on the DOTAv1.0, we can observe that the CS-SKNet model proposed in this paper performs well in detecting small and large objects. Furthermore, the CS-SKNet model, equipped with a large receptive field, can effectively capture environmental information surrounding the targets, thereby minimizing false detections.

Overall, the CS-SKNet model and Cube-mAP metric provide an effective solution for object detection tasks in industrial scenarios and advance the field’s development. CS-SKNet maintains high accuracy while keeping parameter counts low and inference speeds high, making it highly promising for practical applications.

5.3. Ablation Study

5.3.1. The Rationality of Cube-mAP

To evaluate the universality of the Cube-mAP metric, we conducted performance evaluations on the WoodenCube dataset using different detection frameworks. This includes the two-stage detection framework Oriented RCNN [75] and the one-stage detection frameworks S²A-Net [77] and R3Det [36], among other currently popular frameworks.

The results in Table 2 indicate that the Cube-mAP metric performs well in evaluating the detection performance on square datasets. Due to mAP yielding perfect scores of 100 in many frameworks, we did not list it in the table. It fails to accurately assess the detection performance of square-like objects, particularly in datasets with single backgrounds, such as our WoodenCube dataset. Thereby, the Cube-mAP metric provides a meaningful evaluation method for real-time robotic assembly of wooden cubes in scenarios like the one presented in this paper, where robots are tasked with picking up and assembling cubes in an environment where the foreground and the background are similar.

5.3.2. Large Kernel Decomposition

From the results in Table 4, we explored several different strategies for decomposing large kernels to compare their impact on frames per second (FPS) and Cube-mAP. We found that the optimal solution for our proposed CS-SKNet model is to decompose the large kernel into three grouped convolution kernels and concatenate them in series.

This decomposition method maintains detection accuracy while also ensuring computational efficiency. it can be seen that the best solution for CS-SKNet on the WoodenCube dataset proposed in this paper is to decompose the large kernel into three grouped convolution kernels in series. Table 5 shows that too small or too large receptive fields will affect the performance of the CS-SKNet model, and in the WoodenCube dataset, a receptive field size of approximately 63 was determined to be the most effective, with a rate of 33.3 frames per second. This is sufficient to meet the needs of industrial robotic arms proposed in this paper for gripping scenarios.

5.3.3. The Feasibility of CS-SKNet

To validate the performance improvements of the proposed CS-SKNet under different frameworks, we conducted a large number of experiments on the WoodenCube dataset and DOTAv1.0 dataset. The focus of the experiment was to compare the performance of CS-SKNet with the classic ResNet-50 [70] and ResNet-101 [70] backbone network, as shown in Table 2. In the WoodenCube dataset, we compared CS-SKNet with ResNet-50 in three detection frameworks: R3Det [36], Gliding Vertex [34], and Rotated FCOS [35]. For instance, in the Rotated FCOS framework, CS-SKNet achieves a 12% improvement in Cube-mAP compared to ResNet-50 and a 15% improvement over ResNet-101. Moreover, the increase in parameters and FLOPs across different detection frameworks is minimal compared to ResNet-101, demonstrating CS-SKNet’s adaptability and efficiency in multi-faceted wooden block scenarios.

In the DOTAv1.0 dataset, we also compared the detection performance of CS-SKNet and ResNet-50, as shown in Table 6. The results indicate that CS-SKNet significantly outperforms ResNet-50 on this dataset. Across all detection frameworks, CS-SKNet’s mAP is consistently about 4% higher than ResNet-50, with only a minimal increase in parameters and FLOPs.

6. Conclusions

We propose the WoodenCube dataset, comprising 5113 industrial scene images with similar foreground and background, featuring 10 different types of building blocks. Each image has been densely annotated with object-level categories, bounding boxes, and rotation angles. To facilitate the creation and annotation of this dataset, we introduced a semi-automatic annotation method, CS-SAM, which annotates a horizontal bounding box as the detection range for the object. Additionally, for near-square objects, we innovatively proposed the G/2-ProbIoU and Cube-mAP evaluation metrics, effectively addressing the issue of zero covariance in square Gaussian distributions. Furthermore, the multi-scale pyramid large-kernel convolution structure, CS-SKNet, designed in this study, expands the network’s receptive field to capture strong texture features in the scene, achieving high-precision localization and classification on the WoodenCube dataset. To further test the model’s generalization capability, we conducted extensive experiments on the DOTAv1.0 dataset, achieving a mAP of 79.17% while maintaining a low parameter count and computational load (FLOPs).

Author Contributions

C.W.: conceptualization, funding acquisition, methodology, investigation, formal analysis, and writing—original draft preparation. S.L.: methodology, software, investigation, formal analysis, writing—original draft preparation, and validation. T.X.: software, investigation, formal analysis, resources, and visualization. X.W.: validation, visualization, and data curation. J.Z.: conceptualization, funding acquisition, resources, supervision, and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the “Pioneer” and “Leading Goose” R and D Program of Zhejiang Province, Grant Nos. 2023C01125, 2023C01130, 2024C01147, 2024C01059, 2024C01093.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, Y.; Wu, J.; He, C.; Zhang, S. Intelligent warehouse robot path planning based on improved ant colony algorithm. IEEE Access 2023, 11, 12360–12367. [Google Scholar] [CrossRef]
Yang, C.; Yuan, B.; Zhai, P. Actor-Hybrid-Attention-Critic for Multi-Logistic Robots Path Planning. IEEE Robot. Autom. Lett. 2024, 9, 5559–5566. [Google Scholar] [CrossRef]
Li, F.; Kim, Y.C.; Lyu, Z.; Zhan, H. Research on Path Planning for Robot Based on Improved Design of Non-standard Environment Map with Ant Colony Algorithm. IEEE Access 2023, 11, 99776–99791. [Google Scholar] [CrossRef]
Zhang, Y.; Ren, J.; Jin, Q.; Zhu, Y.; Mo, Z.; Chen, Y. Design of Control System for Handling, Sorting, and Warehousing Robot Based on Machine Vision. In Proceedings of the 2023 5th International Symposium on Robotics & Intelligent Manufacturing Technology (ISRIMT), Changzhou, China, 22–24 September 2023; pp. 375–383. [Google Scholar]
Prawira, I.F.A.; Habbe, A.H.; Muda, I.; Hasibuan, R.M.; Umbrajkaar, A. Robot as Staff: Robot for Alibaba E-Commerce Warehouse Process. In Proceedings of the 2023 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 26–28 April 2023; pp. 1619–1623. [Google Scholar]
Mnyusiwalla, H.; Triantafyllou, P.; Sotiropoulos, P.; Roa, M.A.; Friedl, W.; Sundaram, A.M.; Russell, D.; Deacon, G. A bin-picking benchmark for systematic evaluation of robotic pick-and-place systems. IEEE Robot. Autom. Lett. 2020, 5, 1389–1396. [Google Scholar] [CrossRef]
Wong, C.C.; Tsai, C.Y.; Chen, R.J.; Chien, S.Y.; Yang, Y.H.; Wong, S.W.; Yeh, C.A. Generic development of bin pick-and-place system based on robot operating system. IEEE Access 2022, 10, 65257–65270. [Google Scholar] [CrossRef]
Surati, S.; Hedaoo, S.; Rotti, T.; Ahuja, V.; Patel, N. pick-and-place robotic arm: A review paper. Int. Res. J. Eng. Technol. 2021, 8, 2121–2129. [Google Scholar]
Yu, F.; Kong, X.; Yao, W.; Zhang, J.; Cai, S.; Lin, H.; Jin, J. Dynamics analysis, synchronization and FPGA implementation of multiscroll Hopfield neural networks with non-polynomial memristor. Chaos Solitons Fractals 2024, 179, 114440. [Google Scholar] [CrossRef]
Zheng, Z.; Ma, Y.; Zheng, H.; Gu, Y.; Lin, M. Industrial part localization and grasping using a robotic arm guided by 2D monocular vision. Ind. Robot. Int. J. 2018, 45, 794–804. [Google Scholar] [CrossRef]
Abdullah-Al-Noman, M.; Eva, A.N.; Yeahyea, T.B.; Khan, R. Computer vision-based robotic arm for object color, shape, and size detection. J. Robot. Control 2022, 3, 180–186. [Google Scholar] [CrossRef]
Gao, M.; Jiang, J.; Zou, G.; John, V.; Liu, Z. RGB-D-based object recognition using multimodal convolutional neural networks: A survey. IEEE Access 2019, 7, 43110–43136. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Luo, Z.; Tang, B.; Jiang, S.; Pang, M.; Xiang, K. Grasp detection based on faster region cnn. In Proceedings of the 2020 5th International Conference on Advanced Robotics and Mechatronics (ICARM), Shenzhen, China, 18–21 December 2020; pp. 323–328. [Google Scholar]
Yu, Y.; Cao, Z.; Liu, Z.; Geng, W.; Yu, J.; Zhang, W. A two-stream CNN with simultaneous detection and segmentation for robotic grasping. IEEE Trans. Syst. Man Cybern. Syst. 2020, 52, 1167–1181. [Google Scholar] [CrossRef]
Jiang, J.; Luo, X.; Luo, Q.; Qiao, L.; Li, M. An overview of hand–eye calibration. Int. J. Adv. Manuf. Technol. 2022, 119, 77–97. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Song, J.; Gao, S.; Zhu, Y.; Ma, C. A survey of remote sensing image classification based on CNNs. Big Earth Data 2019, 3, 232–254. [Google Scholar] [CrossRef]
Fan, D.P.; Ji, G.P.; Cheng, M.M.; Shao, L. Concealed object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6024–6042. [Google Scholar] [CrossRef]
Mei, H.; Ji, G.P.; Wei, Z.; Yang, X.; Wei, X.; Fan, D.P. Camouflaged object segmentation with distraction mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8772–8781. [Google Scholar]
He, C.; Li, K.; Zhang, Y.; Tang, L.; Zhang, Y.; Guo, Z.; Li, X. Camouflaged object detection with feature decomposition and edge reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22046–22055. [Google Scholar]
Bi, H.; Zhang, C.; Wang, K.; Tong, J.; Zheng, F. Rethinking camouflaged object detection: Models and datasets. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 5708–5724. [Google Scholar] [CrossRef]
Lyu, Y.; Zhang, J.; Dai, Y.; Li, A.; Liu, B.; Barnes, N.; Fan, D.P. Simultaneously Localize, Segment and Rank the Camouflaged Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11586–11596. [Google Scholar]
Le, T.N.; Nguyen, T.V.; Nie, Z.; Tran, M.T.; Sugimoto, A. Anabranch Network for Camouflaged Object Segmentation. J. Comput. Vis. Image Underst. 2019, 184, 45–56. [Google Scholar] [CrossRef]
Yan, J.; Le, T.N.; Nguyen, K.D.; Tran, M.T.; Do, T.T.; Nguyen, T.V. MirrorNet: Bio-Inspired Camouflaged Object Segmentation. IEEE Access 2021, 9, 43290–43300. [Google Scholar] [CrossRef]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM international Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 11830–11841. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 3992–4003. [Google Scholar]
Llerena, J.M.; Zeni, L.F.; Kristen, L.N.; Jung, C. Gaussian bounding boxes and probabilistic intersection-over-union for object detection. arXiv 2021, arXiv:2106.06072. [Google Scholar]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar] [CrossRef]
Li, Z.; Hou, B.; Wu, Z.; Ren, B.; Yang, C. FCOSR: A simple anchor-free rotated detector for aerial object detection. Remote Sens. 2023, 15, 5499. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. Proc. AAAI Conf. Artif. Intell. 2021, 35, 3163–3171. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 16794–16805. [Google Scholar]
Kober, J.; Peters, J. Imitation and reinforcement learning. IEEE Robot. Autom. Mag. 2010, 17, 55–62. [Google Scholar] [CrossRef]
Lenz, I.; Lee, H.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2015, 34, 705–724. [Google Scholar] [CrossRef]
Redmon, J.; Angelova, A. Real-time grasp detection using convolutional neural networks. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 1316–1322. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 77–85. [Google Scholar]
Qin, Y.; Chen, R.; Zhu, H.; Song, M.; Xu, J.; Su, H. S4g: Amodal single-view single-shot se (3) grasp detection in cluttered scenes. In Proceedings of the Conference on Robot Learning, PMLR, Cambridge, MA, USA, 16–18 November 2020; pp. 53–65. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Alom, M.Z.; Taha, T.M.; Yakopcic, C.; Westberg, S.; Sidike, P.; Nasrin, M.S.; Van Esesn, B.C.; Awwal, A.A.S.; Asari, V.K. The history began from alexnet: A comprehensive survey on deep learning approaches. arXiv 2018, arXiv:1803.01164. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, D.; Tao, X.; Yuan, L.; Du, Y.; Cong, M. Robotic objects detection and grasping in clutter based on cascaded deep convolutional neural network. IEEE Trans. Instrum. Meas. 2021, 71, 1–10. [Google Scholar] [CrossRef]
Karaoguz, H.; Jensfelt, P. Object detection approach for robot grasp detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4953–4959. [Google Scholar]
Margolin, R.; Zelnik-Manor, L.; Tal, A. How to evaluate foreground maps? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 248–255. [Google Scholar]
Chen, G.; Liu, S.J.; Sun, Y.J.; Ji, G.P.; Wu, Y.F.; Zhou, T. Camouflaged object detection via context-aware cross-level fusion. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6981–6993. [Google Scholar] [CrossRef]
Bhajantri, N.U.; Nagabhushan, P. Camouflage defect identification: A novel approach. In Proceedings of the 9th International Conference on Information Technology (ICIT’06), Bhubaneswar, India, 18–21 December 2006; pp. 145–148. [Google Scholar]
Skurowski, P.; Abdulameer, H.; Błaszczyk, J.; Depta, T.; Kornacki, A.; Kozieł, P. Animal camouflage analysis: Chameleon database. Unpubl. Manuscr. 2018, 2, 7. [Google Scholar]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Beery, S.; Wu, G.; Rathod, V.; Votel, R.; Huang, J. Context r-cnn: Long term temporal context for per-camera object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13075–13085. [Google Scholar]
Wu, Y.; Chen, Y.; Yuan, L.; Liu, Z.; Wang, L.; Li, H.; Fu, Y. Rethinking classification and localization for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10186–10195. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef]
Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-excite: Exploiting feature context in convolutional neural networks. Adv. Neural Inf. Process. Syst. 2018, 31, 9423–9433. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 June 2018; pp. 1971–1980. [Google Scholar]
Li, Y.; Li, X.; Yang, J. Spatial group-wise enhance: Enhancing semantic feature learning in cnn. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 316–332. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Yan, H.; Li, Z.; Li, W.; Wang, C.; Wu, M.; Zhang, C. ConTNet: Why not use convolution and transformer at the same time? arXiv 2021, arXiv:2104.13497. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11027–11036. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Liu, J.J.; Hou, Q.; Cheng, M.M.; Wang, C.; Feng, J. Improving convolutional networks with self-calibrated convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10093–10102. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11953–11965. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Stephan, P.; Heck, I.; Krau, P.; Frey, G. Evaluation of Indoor Positioning Technologies under industrial application conditions in the SmartFactoryKL based on EN ISO 9283. IFAC Proc. Vol. 2009, 42, 870–875. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3500–3509. [Google Scholar]
Zhou, Y.; Yang, X.; Zhang, G.; Wang, J.; Liu, Y.; Hou, L.; Jiang, X.; Liu, X.; Yan, J.; Lyu, C.; et al. Mmrotate: A rotated object detection benchmark using pytorch. In Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 10–14 October 2022; pp. 7331–7334. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Guo, Z.; Liu, C.; Zhang, X.; Jiao, J.; Ji, X.; Ye, Q. Beyond bounding-box: Convex-hull feature adaptation for oriented and densely packed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 14–19 June 2021; pp. 8788–8797. [Google Scholar]
Lang, S.; Ventola, F.; Kersting, K. DAFNe: A one-stage anchor-free approach for oriented object detection. arXiv 2021, arXiv:2109.06148. [Google Scholar]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8231–8240. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2844–2853. [Google Scholar]
Wang, J.; Yang, W.; Li, H.C.; Zhang, H.; Xia, G.S. Learning center probability map for detecting objects in aerial images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4307–4323. [Google Scholar] [CrossRef]
Yang, X.; Yan, J. Arbitrary-oriented object detection with circular smooth label. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 677–694. [Google Scholar]

Figure 1. (a) A faceted rail track wooden cube scene, where the floor and the blocks share the same material, and the blocks are randomly arranged on this wooden board. (b) A bird’s-eye view shows examples of three different types of blocks, with the blocks to be detected having a texture very similar to that of the baseboard.

Figure 2. Single cube samples from WoodenCube. The wooden cube material is the same as the background; both are made of oak wood.

Figure 3. Data collection equipment. The left shows the MV-CS050-10UC industrial camera from Hikvision, while the right depicts the KUKA KR6 R900-2 robot.

Figure 4. Class distribution of WoodenCube dataset.

Figure 5. Compares the fitting effects of three auxiliary annotation methods. The two left images show annotation with 1 and 4 reference points, respectively; the top right image depicts the entire image as the reference area, and the bottom right image shows a green horizontal box as the reference area. The green annotated points and box are obtained through manual annotation, while SAM obtains the red box combined with computing the minimum bounding rectangle of the convex hull.

Figure 6. The influence of interfering texture points on the fitting of the resulting rotated anchor boxes.

Figure 7. The performance of IoU and G/2-ProbIoU on class-square datasets containing mostly square-shaped objects. (a) The relationship between IoU and G/2-ProbIoU when two bounding boxes are rotated 45°with their centers overlapped. (b) The variation in IoU and G/2-ProbIoU with the rotation angle when the centers of the bounding boxes overlap.

Figure 8. Overall framework of CS-SKNet.

Figure 9. CS selection sub-block.

Figure 10. The structure of multi-layer perceptron.

Figure 11. Visualization comparison of three methods on the WoodenCube dataset. (a–c) Results corresponding to the S2A-Net, LSKNet, and CS-SKNet models.

Figure 12. Visualization comparison of three methods on the DOTAv1.0 dataset. (a–c) Results corresponding to the OrientedRCNN, LSKNet, and CS-SKNet models.

Table 1. Equipment conditions.

Equipment	Details
Industrial camera	The MV-CS050-10UC, a second-generation industrial-line scan RGB camera, utilizes Sony’s IMX264 CMOS chip with a resolution of 2448 × 2048. It transmits uncompressed images in real-time via a USB 3.0 interface, with a maximum frame rate of up to 60 fps.
Robot	The KUKA KR6 R900 sixx six-axis robot weighs approximately 52 kg, with a maximum payload capacity of 6 kg. It has a maximum motion range of 901.5 mm and a pose repetition accuracy (ISO 9283 [73]) of ±0.03 mm.

Table 2. Comparison of CS-SKNet, ResNet-50 [70], and ResNet-101 [70] backbones under different detection frameworks on WoodenCube.

Frameworks	Backbone	Cube-mAP	Params (M)	FLOPs (G)
Oriented RCNN [75]	ResNet-50	75.38	41.14	211.4
	ResNet-101	72.89	60.13	289.32
	CS-SKNet	74.13	51.59	292.9
Rotated Faster RCNN [46]	ResNet-50	72.90	41.13	211.3
	ResNet-101	71.92	60.13	289.19
	CS-SKNet	70.44	51.59	294.6
R3Det [36]	ResNet-50	75.69	41.90	335.7
	ResNet-101	79.12	60.78	411.12
	CS-SKNet	79.60	48.61	419.5
Gliding Vertex [34]	ResNet-50	70.87	41.14	211.3
	ResNet-101	70.96	60.13	289.19
	CS-SKNet	71.53	51.59	294.6
Rotated FCOS [35]	ResNet-50	47.92	31.91	206.7
	ResNet-101	44.52	50.9	284.55
	CS-SKNet	60.75	42.41	293.2
S²A-Net [77]	ResNet-50	59.24	38.60	197.6
	ResNet-101	56.86	57.57	275.01
	CS-SKNet	57.26	45.31	281.4

Note: Bold indicates the best performance.

Table 3. Comparison with state-of-the-art methods on the DOTA-v1.0 dataset with multi-scale training and testing.

Method	mAP	# P(M)	FLOPs (G)	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC
R3Det [36]	76.47	41.9	336	89.8	83.77	48.11	66.77	78.76	83.27	87.84	90.82	85.38	85.51	65.57	62.68	67.53	78.56	72.62
CFA [78]	76.67	-	-	89.08	83.20	54.37	66.87	81.23	80.96	87.17	90.21	84.32	86.09	52.34	69.94	75.52	80.76	67.96
DAFNe [79]	76.95	-	-	89.4	86.27	53.70	60.51	82.04	81.17	88.66	90.37	83.81	87.27	53.93	69.38	75.61	81.26	70.86
S²A-Net * [77]	71.41	38.6	198	88.32	74.50	49.39	72.84	78.36	80.48	87.04	90.83	74.66	83.17	49.64	60.50	70.27	66.00	45.17
SCRDet [80]	72.61	-	-	89.98	80.65	52.09	68.36	68.36	60.32	72.41	90.85	87.94	86.86	65.02	66.68	66.25	68.24	65.21
RoT Trans * [81]	75.57	55.1	225	87.60	82.12	55.18	73.93	77.05	82.35	87.50	90.85	78.62	84.36	60.86	58.48	76.78	74.98	62.80
G.V. [34]	75.02	41.1	198	89.64	85.00	52.26	77.34	73.01	73.14	86.82	90.74	79.02	86.81	59.55	70.91	72.94	70.86	57.32
Oriented RCNN * [75]	75.05	41.1	211	88.49	79.14	53.32	77.34	76.93	82.67	87.98	90.85	77.41	83.35	59.56	64.26	75.48	68.83	60.14
CenterMap [82]	76.03	41.1	198	89.83	84.41	54.60	70.25	77.66	78.32	87.19	90.66	84.89	85.27	56.46	69.23	74.13	71.56	66.06
CSL [83]	76.17	37.4	236	90.25	85.53	54.64	75.31	70.44	73.51	77.62	90.84	86.15	86.69	69.6	68.04	73.83	71.10	68.93
LSKNet * [37]	78.96	31.0	174	89.03	83.12	56.76	79.32	78.40	84.66	87.97	90.91	85.62	85.04	63.68	65.89	77.94	79.31	76.73
CS-SKNet (Ours)	79.17	51.3	293	88.77	83.56	56.87	80.71	78.93	84.68	87.93	90.88	85.29	87.25	64.22	66.09	77.96	79.33	75.15

Note: * represents the results reproduced in this paper. Bold indicates the best performance.

Table 4. The effects of the number of decomposed large kernels on the inference FPS and mAP were examined through experiments conducted on the WoodenCube dataset.

(k,d) Sequence	Num.	RF	FPS	Cube-mAP
(47,1)	1	47	23.1	72.30
(7,1), (9,5)	2	47	38.3	72.26
(5,1), (7,3), (9,3)	3	47	33.7	72.62

Note: Bold indicates the best performance.

Table 5. The effectiveness of the key design components of CS-SKNet was examined through experiments conducted on the WoodenCube dataset.

( $k_{1}$ , $d_{1}$ ) Sequence	( $k_{2}$ , $d_{2}$ ) Sequence	( $k_{3}$ , $d_{3}$ ) Sequence	RF	FPS	Cube-mAP
(3,1)	(5,2)	(7,3)	29	34.7	71.58
(3,1)	(5,3)	(7,4)	39	34.7	72.65
(5,1)	(7,3)	(9,3)	47	33.7	72.62
(5,1)	(7,4)	(9,3)	53	32.9	73.26
(7,1)	(9,2)	(11,4)	63	33.3	74.13
(7,1)	(9,3)	(11,4)	71	33.0	72.80

Note: Bold indicates the best performance.

Table 6. Comparison of CS-SKNet and ResNet-50 [70] backbones under different detection frameworks on DOTAv1.0.

Frameworks	Backbone	mAP	Params (M)	FLOPs (G)
Oriented RCNN [75]	ResNet-50	75.05	41.14	211.4
Oriented RCNN [75]	CS-SKNet	79.17	51.33	292.9
RoI Trans [81]	ResNet-50	75.57	55.13	225.3
RoI Trans [81]	CS-SKNet	78.74	76.32	306.8
S²A-Net [77]	ResNet-50	71.41	38.60	197.6
S²A-Net [77]	CS-SKNet	76.84	45.31	281.4
R3Det [36]	ResNet-50	69.55	41.90	335.7
R3Det [36]	CS-SKNet	75.08	48.61	419.5

Note: Bold indicates the best performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, C.; Li, S.; Xie, T.; Wang, X.; Zhou, J. WoodenCube: An Innovative Dataset for Object Detection in Concealed Industrial Environments. Sensors 2024, 24, 5903. https://doi.org/10.3390/s24185903

AMA Style

Wu C, Li S, Xie T, Wang X, Zhou J. WoodenCube: An Innovative Dataset for Object Detection in Concealed Industrial Environments. Sensors. 2024; 24(18):5903. https://doi.org/10.3390/s24185903

Chicago/Turabian Style

Wu, Chao, Shilong Li, Tao Xie, Xiangdong Wang, and Jiali Zhou. 2024. "WoodenCube: An Innovative Dataset for Object Detection in Concealed Industrial Environments" Sensors 24, no. 18: 5903. https://doi.org/10.3390/s24185903

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WoodenCube: An Innovative Dataset for Object Detection in Concealed Industrial Environments

Abstract

1. Introduction

2. Related Works

2.1. Grasp Detection

2.2. Bounding Box Regression

2.3. Attention Mechanisms

3. WoodenCube Dataset

3.1. Data Collection

3.2. Data Annotation

3.3. Cube-mAP Evaluation Method

4. CS-SKNet Architecture

4.1. Multi-Scale Pyramid-like Large-Kernel Convolution Block

4.2. A Plug-and-Play MLP Based on Group Convolution

5. Experiments

5.1. Datasets and Implementation Details

5.2. Main Result

5.3. Ablation Study

5.3.1. The Rationality of Cube-mAP

5.3.2. Large Kernel Decomposition

5.3.3. The Feasibility of CS-SKNet

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI