1. Introduction
6D object pose estimation is crucial for industrial robotics [
1]. As the manufacturing industry moves towards High-Mix, Low-Volume (HMLV) production, robots must adapt to diverse products without human intervention. In the past, robots were limited to fixed repetitive actions, but HMLV manufacturing requires more flexibility [
2]. 6D object pose estimation enables robots to interact with diverse objects in different compositions. This eliminates the need for costly and inflexible fixtures, resulting in more adaptable and precise systems. Despite its significance and the extensive history of research in this area, 6D object pose estimation remains an open problem and an active area of research [
3].
Datasets are fundamental to academic research, especially in fields such as computer vision. They form the foundation for creating and validating new theories and technologies. Prominent benchmarks, like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [
4], Common Objects in Context (COCO) [
5], the KITTI Vision Benchmark Suite (KITTI) [
6], and the Benchmark for 6D Object Pose Estimation (BOP) [
7], have become synonymous with progress in image recognition, object detection, and more recently, 6D object pose estimation. These benchmarks rely on real-world datasets to evaluate the effectiveness of different methods, measure progress, and compare advancements. In addition, learning-based approaches have outperformed traditional techniques in many computer vision tasks [
8]. Datasets play a critical role in training these methods. The extent and depth of the data used in training are crucial factors in these systems’ ability to generalize to new, real-world scenarios.
Creating high-quality datasets for 6D object pose estimation is a difficult task. Labeling the six degrees of freedom that define an object’s pose is a complex, time-consuming process that requires great attention to detail. Moreover, it is challenging to ensure the accuracy of these labels. Synthetic data have emerged as a potential alternative to manual labeling. Recent advances in computer graphics make it feasible to generate large amounts of labeled data. Some 6D object pose estimation datasets adopt this strategy [
9,
10]. However, the resulting images often fail to capture essential aspects of their real-world counterparts, like object materials, scene compositions, lighting, and camera viewpoints. Variations between synthetic and real-world data can significantly hamper model performance—the so-called reality gap [
11]. A proposed solution to the reality gap is domain randomization [
12], in which parts of the data generation process are randomized. With enough variability, real-world data may appear to be another variation. However, generating
enough variability to envelope real-world data points is non-trivial. Many variations are wasteful and can unnecessarily increase the task difficulty [
13]. Matching the distribution of real-world data through synthetic data generation is a significant challenge. Additionally, generating high-quality synthetic data requires significant computational power and storage capabilities. Balancing the quality and quantity of data while keeping computational and storage costs in check requires careful planning and strategic decision-making. It is difficult to determine which variations are valuable, as existing datasets do not provide information about the variations used in the data generation process. Currently, no datasets exist that enable sim-to-real research on these variations for industrial settings.
In recent years, there has been significant progress in performance in the task of 6D object pose estimation and progress on current datasets is slowing [
8]. Therefore, it is crucial to focus on more challenging conditions to maintain the momentum of technological progress and ensure that research remains relevant and capable of addressing complex real-world applications. Many existing datasets for estimating the pose of 6D objects have limitations, which make them less applicable to industrial scenarios. The Benchmark for 6D Object Pose Estimation (BOP) [
7] unifies 12 datasets, of which the majority focus on household objects, which differ significantly from industrial ones [
14]. Household objects are often non-symmetric with diverse textures, making them easier to identify and position. However, industrial objects are often symmetrical and lack discernible textures, making them more challenging to estimate accurately. Additionally, the reflectiveness of many industrial objects and environmental factors like scratches, dust, or dirt can further complicate this process. Other datasets incorporate textureless industrial objects [
15,
16] but focus on objects with limited reflectivity. However, reflective materials are widespread. For example, most high-precision machine components are made of shiny metals. These objects are particularly interesting as depth sensors cannot capture them correctly [
17,
18]. Secondly, most current 6D pose estimation datasets rely on single-view images, leading to appearance and depth perception issues [
19]. However, in many industrial settings, multi-view approaches can be used to circumvent these ambiguities. Images can be captured from different viewpoints using multi-camera or eye-in-hand setups, enriching the information available.
To summarize, existing datasets present a significant sim-to-real gap, provide no information about the variations in data generation, focus on single-view scenes, and are irrelevant for industrial use cases. We aim to overcome these limitations by presenting the following contributions:
A collection of over 30,000 real-world images depicting industrial metal objects. These images showcase objects of varying shapes, materials, composition types, carriers, and lighting conditions. They were captured using different cameras and from multiple angles;
A collection of over 500,000 synthetic images. We provide synthetic counterparts for all real-world images, along with additional images that were generated by varying environmental conditions, such as lighting, object materials, object positions, and carriers in a controlled manner.
Figure 1 shows an example of a real-world image and its synthetic counterpart.
Figure 2 illustrates the challenges encountered in handling objects that are highly reflective and lack texture. Although 3D sensors and methods that are trained on large datasets can effectively deal with objects that have texture and are non-reflective, they face significant difficulties when it comes to dealing with industrial parts and backgrounds that lack texture and have high reflectivity.
In addition to our dataset, we provide an open-source tool that enables users to easily label 6D object poses for multi-view data. Correspondences between image pixels and object CAD points can be marked to calculate coarse object poses. The tool then overlays the object CAD file on the image, allowing users to fine-tune the 6D pose label using 3D position tools. It handles multi-view images, jointly labeling all views, improving accuracy, and reducing labeling effort.
Our dataset has significant value for various industry-relevant research problems. The congruity between synthetic and real-world data and controlled variations will facilitate sim-to-real research. Our dataset is conducive to researching several vital computer vision tasks involving industrial objects. We focus our discussion on 6D object pose estimation, but our dataset also applies to object detection, instance segmentation, novel view synthesis, 3D reconstruction, and active perception.
In addition to our dataset, we provide an open-source tool that enables users to easily label 6D object poses for multi-view data. Many specialized tools exist for labeling computer vision data. Prominent platforms, such as V7 [
21], Labelbox [
22], Scale AI Rapid [
23], SuperAnnotate [
24], Dataloop [
25], Supervise.ly [
26], and Segments.ai [
27], have established themselves as key players in this field. Each platform offers a unique set of features tailored to specific annotation needs, such as classification, object detection, key points, and segmentation. However, it is noteworthy that these tools do not support labeling 6D object poses, which involves annotating the three-dimensional position and orientation of objects in space. This complex task requires a higher degree of precision and understanding of spatial relationships, which is not typically addressed by standard annotation tools. Furthermore, most of these platforms do not emphasize the consistency of labels across different viewpoints and sensor types, except for Segments.ai. This consistency, however, is crucial for developing robust computer vision models that can reliably interpret data from various perspectives and sensors, which is key for many real-world robotics applications.
Only a few tools are available that enable the precise labeling of 6D object poses, highlighting a niche but significant area in the field of data annotation for computer vision.
Table 1 gives an overview of the tools for 6D object pose estimation.
6D-PAT [
28] closely resembles the functionalities of our tool but does not support labeling multi-view images with consistent labels across views. DoPose [
29] has limited capabilities and also lacks support for multi-view data. This feature is essential for creating datasets where the same object is observed from various angles, ensuring consistent and accurate labels in 3D, which, in turn, is crucial for robotics applications. Labelfusion [
30] relies heavily on 3D reconstruction and Iterative Closest Point (ICP) algorithms. However, these methods are not suitable for dealing with images of reflective objects, which is a limitation that significantly impacts the tool’s applicability in many industrial contexts where reflective surfaces are common.
Our tool addresses these specific gaps. It is unique in its capability of labeling multi-view-consistent 6D object poses, which is critical for achieving spatial accuracy in 3D. Additionally, it introduces a wide range of unique features that streamline the labeling process, making it more versatile and efficient for complex 6D pose labeling tasks. This is especially important in scenarios where multi-view consistency and precision are paramount, such as in high-precision robotics tasks.
2. Materials and Methods
First, we introduced a data format that is based on the BOP format [
7] and is specifically designed for multi-view 6D object pose data with controlled conditions. Next, we established a system that allowed us to collect real-world data with different camera and conditions. With this system, we gathered a significant real-world dataset of industrial metal objects. To label the collected dataset, we developed a specialized tool that is dedicated to the annotation of 6D object poses. Finally, we detailed our approach for producing high-fidelity synthetic data that closely replicates real-world conditions, along with controlled variants.
2.1. Data Format
We adopted and extended the BOP format [
7]. The goal of BOP is to capture the state-of-the art in 6D object pose estimation. The benchmark provides 12 datasets in a unified format. The format incorporates camera intrinsics and extrinsics. However, for multi-view datasets, the provided per-image object labels often do not match in a shared world frame. In addition, no information is provided about the conditions in which scenes were captured. We extended this format to include this information. Object labels in a joint world frame were added, along with information about the lighting, carrier, and object composition type. Additionally, we included drawings that can be used to construct each of the objects, so they can be used in real-world experiments. The resulting folder structure is shown in
Figure 3. All distances are recorded in mm, coordinate systems are right-handed, and matrices and vectors are flattened, following the BOP convention.
Object CAD models are stored in PLY format in models. models_info.json contains information about the size and symmetries of each object, as shown in Listing 1. Continuous symmetries are described by axis and offset values, discrete symmetry by transformation matrices.
Listing 1. models_info.json contains information about the CAD models. |
|
Technical drawings that can be used for manufacturing the various parts are contained in construction. Scenes are organized by camera type (both real and virtual) and identified by a unique id. scene_camera.json contains camera intrinsics and extrinsics for each image, as shown in Listing 2.
Listing 2. scene_camera.json contains camera intrinsics and extrinsics for each image. |
|
Object poses are recorded in scene_gt.json and scene_gt_world.json, as shown in Listings 3 and 4. For each image, scene_gt.json contains object poses relative to the camera reference frame. As poses are consistent between different viewpoints, we also record object poses in a shared global reference frame (scene_gt_world.json).
scene_info.json contains information about the data generation process, as presented in Listing 5. For each image, we record which lighting conditions apply, which carrier is used, the type of composition (heterogeneous or homogeneous), and the camera viewpoint. This information allows selecting sub-sets on the data based on specific variations.
Listing 3. scene_gt.json contains object poses (relative to the camera) for each image. |
|
Object poses (relative to a global world reference) are contained in
Listing 4. scene_gt_world.json. These poses are valid for all images. |
|
Information about the data generation process for each image is contained in
Listing 5. scene_info.json. |
|
2.2. Real-World Data
We used different camera sensors (
Table 2) mounted on an industrial robot to capture multi-view images of diverse scenes.
Figure 4 shows the setup for collecting real-world data.
Figure A1 shows images of each of the sensors. A custom-designed end-effector attached all cameras to a Fanuc M20ia robot, as shown in
Figure 5. With a repeatability of 0.1 mm, the robot was used to change camera viewpoints accurately. We further stabilized camera poses by canceling backlash using the Fanuc instruction
IRVBKLSH. We calibrated all cameras using ChArUco targets, undistorted all images, and performed hand–eye calibration [
31]. We used high-precision aluminum-LDPE calibration targets with
DICT_5x5 dictionary codes and 15 mm-wide checkers. ChArUco detection and calibration were implemented using OpenCV [
32]. Each scene was captured with every camera from 13 different viewpoints.
Figure A2 shows images of each of the viewpoints. These viewpoints were spread out on a hemisphere and oriented towards the center of the object carrier, as shown in
Figure 6.
Varying object shapes, materials, carriers, compositions, and lighting created diverse scenes.
Figure A3 shows example images with different objects, carriers, compositions, and lighting types. The selected variations were meant to imitate realistic variations that occur in industrial settings, specifically in metal manufacturing factories. We used six object categories, shown in
Figure 7, with different shapes and material properties. Most objects (
Figure 7a–d) were highly reflective and exhibited real-world features like scratches and saw patterns. These objects represent raw or half-finished workpieces. We also included less reflective (
Figure 7e,f) and textured objects (
Figure 7e) for comparison. These objects represent finished workpieces. We used three object carriers, as shown in
Figure 8. In manufacturing environments, objects are usually transported in bins or on pallets, with or without cardboard. Parts are stacked in various compositions with different levels of occlusion. Lighting conditions were varied by controlling Nanguan Luxpad 43 lights. We replicated both daylight and nighttime conditions, with and without factory lighting.
2.3. Labeling Tool
To label the recorded real-world data and, more generally, the datasets stored in the presented format, we developed a tool for the quick and accurate labeling of 6D object poses that are consistent across multiple views. Labeling 6D object poses on 2D images can be time-consuming and tedious. An object’s six degrees of freedom interact in intricate ways, making it difficult for annotators to bridge the gap between the 2D image and the object’s actual 3D position and orientation. In addition, labelers must understand 3D perspectives and correctly handle object occlusions. Achieving 3D accuracy is crucial for industrial robotics applications, where sub-millimeter precision is often required. However, minor errors made during labeling in pixel space can lead to significant errors in 3D space. For example, moving objects along the camera’s z-axis can lead to a big difference in their 3D position, with minimal pixel changes. We address these challenges by providing an intuitive user interface and allowing for the joint labeling of multi-view images.
Figure 9 shows the labeling tool.
2.3.1. Workflow
Figure 10 shows the workflow of our labeling tool.
After loading a dataset, users can browse the available scenes and select one. To create a new object label, users first select the object category. Then, an initial pose can be calculated after marking at least four correspondences between image pixels and points on the object CAD model.
Figure 11 shows the interface for marking correspondences.
Once this initial pose has been calculated, users can fine-tune it using the interactive controls provided by the visualization. These controls allow for changing the 6D object pose, including translation and rotation along all axes or planes. Additionally, users can incrementally update the pose by moving fixed steps along all axes. The alpha and color of each object can be changed to improve the visualization.
The entire dataset can be saved at any moment.
It is essential to note that object poses are configured on a per-scene level. This means that users can adjust the pose of the same object across different views, significantly improving the 3D accuracy of the 6D object poses. This feature also speeds up the labeling process for multi-view data.
2.3.2. Technical Details
The tool is a web-based application that uses Python for the back end and Angular for the front end. The back end runs on a RESTful API and relies on OpenCV and Scipy for all calculations regarding 6D object poses. For all 3D visualizations and interactions, ThreeJS is used. To ensure a perfect match between the provided RGB images and the 3D visualizations, high-quality object models and camera intrinsics and extrinsics are necessary.
2.4. Synthetic Data
Using a Unity project, we generated high-fidelity synthetic data. Real-world conditions were carefully simulated. We adopted camera intrinsics and extrinsics and model objects and carriers. Environment maps of the different real-world lighting conditions were created by combining bracketed images. We used a Rico Theta S 360° camera to capture the environment with different exposures. The resulting images were combined into a single HDRI environment map to light the virtual scene. We mimicked real-world features like scratches and saw patterns by synthesizing textures from real-world images of example objects. These images were relighted and cropped to be used as albedo textures. A normal map was created by estimating normal vectors from these textures. A texture synthesis algorithm was then used to create new variations of the captured textures. This texture synthesis algorithm works in two steps. First, random patches are taken from an example image. Next, to reduce seams between copied patches, pixels search their neighborhood for similar pixels and update their value accordingly. This step is repeated multiple times, reducing the radius of the neighborhood with each turn until the neighborhood consists of a single pixel. This procedure generates unique and realistic appearances for every virtual object.
Figure 12 shows a close-up of multiple generated object textures. Our synthetic data generation setup resulted in synthetic images that resemble their real-world counterparts.
A large amount of additional data were generated by varying object material, composition, and lighting types. Object materials were generated using the texture synthesis algorithm described before, using additional example images. We used Unity’s physics engine to generate physically plausible object poses, starting from random initializations. Lighting was varied by randomly selecting an environment map from a predefined set. As we were in total control of the data generation process, labels were free. Generating 1 scene took 2 min using an NVIDIA GeForce 2080 Ti.
4. Conclusions
We present a diverse dataset of industrial, reflective objects with real-world and synthetic data. Real-world data were obtained by recording multi-view images of scenes with varying object shapes, materials, carriers, compositions, and lighting conditions. This resulted in over 31,200 accurately labeled images using a new public tool that reduces the time and effort needed to label 6D object poses for multi-view data. Synthetic data were obtained by carefully simulating real-world and varying environmental conditions in a controlled and realistic way. This results in over 553,800 synthetic images.
The close resemblance of our synthetic images to their real-world counterparts, along with the controlled variations in the data generation process, make our dataset valuable for sim-to-real research.
Our dataset is also useful for several computer vision tasks, including 6D pose estimation, object detection, instance segmentation, novel view synthesis, 3D reconstruction, and active perception. These tasks are crucial for various industrial applications such as automated assembly, bin picking, quality control, welding, painting, (de-)palletizing, and machine tending.
In future work, we aim to broaden the adoption of our dataset within key academic benchmarks like BOP. This will enhance its visibility and utility for comparative studies. Additionally, we plan to explore methodologies that utilize multi-view data, a relatively under-explored area in current research. Given the complexities in many industrial settings, this multi-view approach holds significant promise for developing more robust and capable solutions. By focusing on these two strategic areas, we intend not only to elevate the dataset’s academic prominence but also unlock new potential in practical applications.