TranSpec3D: A Novel Measurement Principle to Generate A Non-Synthetic Data Set of Transparent and Specular Surfaces without Object Preparation

Junger, Christina; Speck, Henri; Landmann, Martin; Srokos, Kevin; Notni, Gunther

doi:10.3390/s23208567

Open AccessArticle

TranSpec3D: A Novel Measurement Principle to Generate A Non-Synthetic Data Set of Transparent and Specular Surfaces without Object Preparation

by

Christina Junger

^1,*

,

Henri Speck

²,

Martin Landmann

²

,

Kevin Srokos

² and

Gunther Notni

^1,2,*

¹

Group for Quality Assurance and Industrial Image Processing, Technische Universität Ilmenau, 98693 Ilmenau, Germany

²

Fraunhofer Institute for Applied Optics and Precision Engineering IOF Jena, 07745 Jena, Germany

^*

Authors to whom correspondence should be addressed.

Sensors 2023, 23(20), 8567; https://doi.org/10.3390/s23208567

Submission received: 7 September 2023 / Revised: 6 October 2023 / Accepted: 12 October 2023 / Published: 18 October 2023

(This article belongs to the Special Issue Stereo Vision Sensing and Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Estimating depth from images is a common technique in 3D perception. However, dealing with non-Lambertian materials, e.g., transparent or specular, is still nowadays an open challenge. However, to overcome this challenge with deep stereo matching networks or monocular depth estimation, data sets with non-Lambertian objects are mandatory. Currently, only few real-world data sets are available. This is due to the high effort and time-consuming process of generating these data sets with ground truth. Currently, transparent objects must be prepared, e.g., painted or powdered, or an opaque twin of the non-Lambertian object is needed. This makes data acquisition very time consuming and elaborate. We present a new measurement principle for how to generate a real data set of transparent and specular surfaces without object preparation techniques, which greatly reduces the effort and time required for data collection. For this purpose, we use a thermal 3D sensor as a reference system, which allows the 3D detection of transparent and reflective surfaces without object preparation. In addition, we publish the first-ever real stereo data set, called TranSpec3D, where ground truth disparities without object preparation were generated using this measurement principle. The data set contains 110 objects and consists of 148 scenes, each taken in different lighting environments, which increases the size of the data set and creates different reflections on the surface. We also show the advantages and disadvantages of our measurement principle and data set compared to the Booster data set (generated with object preparation), as well as the current limitations of our novel method.

Keywords:

deep stereo matching; monocular depth estimation; data set; real; non-synthetic; object preparation; object painting; non-Lambertian surface; transparent; specular; thermal 3D sensor; TMRP

1. Introduction

Transparent and specular objects are omnipresent and belong to optically uncooperative objects in the visual spectral range (VIS). Representatives are various glass objects, e.g., glass walls or glass flasks, and transparent or translucent plastic parts, e.g., clear orthodontic aligners or car headlights. Typical areas of application are as follows: (a) human–robot interactions, e.g., for confidential detection of visually uncooperative objects [1]; (b) autonomous robot navigation, e.g., collision prevention of glass walls; (c) laboratory automation, e.g., for grasping visually uncooperative objects [2,3,4,5]; (d) medical section, e.g., 3D reconstruction of clear orthodontic aligners; (e) autonomous waste sorting and recycling, and (f) augmented reality [6]. In these use cases, there are two main tasks:

Locating optically uncooperative objects. This includes object segmentation [7,8] and object pose estimation [9,10,11].
Accurately estimating the depth of optically uncooperative objects. This includes accurate and reliable depth estimates, also known as deep depth completion [2,12,13,14], 3D reconstruction methods [3,15,16], and stereo vision [17,18,19].

This paper describes the current challenges in the stereo depth estimation of transparent and specular objects and presents a new measurement principle for the acquisition of real ground truth data sets.

The conventional 3D sensors in the VIS and near-infrared (NIR) spectral range are not suitable for the perception of transparent, translucent, and reflective surfaces [2,6,13,17,19,20] since stereo matching, i.e., the search for correspondence points in the left and right image, is error prone [18,19]. The limitations are described in detail in Section 2.1. To overcome this limitation, data-driven approaches of artificial intelligence (AI)-based stereo matching methods [17,18,21] or monocular depth estimation [22,23,24] are applied. In the process, known (uncooperative) objects that were set during the training time can be perceived without object preparation (also called in distribution). However, there are currently two challenges (A) and (B) for deep stereo methods for visually uncooperative surfaces.

(A): This method requires a large training and test data set with ground truth disparity maps. Synthetic data sets or real data sets can be used. Real data sets, unlike synthetic data sets [2], capture the environment most realistically but are difficult [2], very time consuming and expensive to create [18,25]. That is why hardly any real ground truth data sets exist. The most complex part is the generation of the ground truth, so-called annotation. Therefore, optically uncooperative objects are prepared (e.g., diffuse reflective coating) in order to optically detect them in the VIS or NIR spectral range [17,18,26,27,28]. Figure 1a shows that the manipulated surface can thereby be captured three-dimensionally. This technique is very elaborate and very highly time consuming due to the object preparation process [18,25]; see Figure 1a. This process also includes high effort in positioning prepared objects to the previous place of unprepared objects [25], and possible object cleaning. Object preparation is not suitable or appropriate for many objects that may not be prepared, such as historical glass objects.
(B): The transparency awareness ability is a corner case in deep stereo matching networks [17,19]. Furthermore, current deep stereo matching approaches—regardless of the challenge with transparent objects—are generally limited to a specific data set due to divergent key factors in multiple data sets, and generalize poorly to others. Three key factors are unbalanced disparity distributions and divergent textures and illuminations [20,29].

The performance of deep stereo matching networks is strongly dependent on the performance of the training data [29]. (The definition and key role of the performance of the data set are described in detail in Section 2.2.) For deep stereo matching and monocular depth estimation, data sets with ground truth are needed. Synthetic data can be produced cheaply without complex surface preparation in large quantities compared to real data sets. Therefore, more synthetic data sets are available than real ones [6]. In order to synthesize a representative data set of transparent objects and artifacts such as specular highlights and caustics, however, very high-quality rendering and 3D models are required [2]. Real data sets are preferable to synthetic data sets for the following reasons:

Real data are authentic and reflects the real world;
Real data contain errors, inaccuracies and inconsistencies;
Real data represent human behavior or complex interactions better.

Nevertheless, real-world ground truth data sets for optically uncooperative objects are difficult to obtain [2], still time consuming and expensive [18,25] due to the necessary object preparation (see Section 2.1). Yet, there are hardly any available real-world data sets for stereo systems suitable for disparity estimation (Table 1) as well as for mono depth estimation (Table 2). Table 1 shows an overview of real-world (non-synthetic) stereo data sets with transparent and specular objects suitable for disparity estimation. Liu et al. [25] created the Transparent Object Data Set (TOD) for pose estimation and depth estimation. The generation of the ground truth data set is very time consuming and elaborate. The ground truth depth of the transparent object was acquired in an additional step with an opaque twin in the same position as the previously acquired transparent object. The challenging part is the exact placement of the opaque twin at exactly the same position as the transparent object. Ramirez et al. [18] created the first real stereo data set for transparent and reflective surfaces, named Booster. The ground truth disparity map is obtained by additionally preparing the scene. All non-Lambertian surfaces in the scene are painted or sprayed to allow the projection of textures over them. Our TranSpec3D data set is, according to our research, the first real (stereo) data set for visually uncooperative materials generated without object preparation, e.g., white titanium dioxide powder. This shortens and simplifies the ground truth data acquisition process by eliminating the need for object preparation and the accurate placement of the respectively prepared opaque twin object (cf. [18,25]). With our novel measuring principle, we overcome the challenges (A). For this purpose, we additionally use a thermal 3D sensor developed by Landmann et al. [30] to generate real ground truth data. With this additional sensor, the fast and easy acquisition of real ground truth data is possible with little effort and without object preparation; see Figure 1b.

Figure 1. Two techniques to three-dimensionally record transparent, translucent, and reflective surfaces [31]. Object: fist-shaped glass flacon with metal-covered plastic cap. (a) State-of-the-art technique: Using an active VIS 3D sensor requiring object preparation (diffuse reflective coating) [11,17,18,27,28]. (b) Alternative measurement technology: Using a thermal 3D sensor [30] without object preparation. Wavelength of stereo system

λ

; measuring time

t_{meas}

; number of fringes N (sequential fringe projection).

Figure 1. Two techniques to three-dimensionally record transparent, translucent, and reflective surfaces [31]. Object: fist-shaped glass flacon with metal-covered plastic cap. (a) State-of-the-art technique: Using an active VIS 3D sensor requiring object preparation (diffuse reflective coating) [11,17,18,27,28]. (b) Alternative measurement technology: Using a thermal 3D sensor [30] without object preparation. Wavelength of stereo system

λ

; measuring time

t_{meas}

; number of fringes N (sequential fringe projection).

Figure 1b shows an alternative method published by Landmann et al. [30] for measuring freeform surfaces without any object preparation with high accuracy. The object surface is heated up locally by only a few Kelvin under the generation of a heat pattern. The surface itself emits this heat pattern which is recorded by thermal cameras. Like in VIS or NIR, the camera pictures are evaluated, and a 3D shape is reconstructed. The fully automatic 3D reconstruction takes place within seconds [30]. Three disadvantages of this technology, however, are the high hardware costs, the necessary safety-related enclosure and the longer measurement time compared to conventional stereo systems. Objects with very high thermal conductivity or good thermal conductors are not measurable. Nevertheless, with higher costs, the measurement time can also be reduced. There is still potential for development to remedy some disadvantages.

The main contributions of our work are as follows:

We introduce a novel measurement principle TranSpec3D to generate for transparent and specular objects the first-ever real data set with ground truth without object preparation (e.g., object painting or powdering). The absence of object preparation greatly simplifies the creation of the data set, both in terms of object reusability and time, as there is no need to prepare the objects or generate opaque twins, including drying and accurately placing the non-prepared and prepared objects (cf. [2,12,18,25]). In addition, the surface of the object is not manipulated (cf. [18,25]). For data set generation, any conventional 3D sensor is supplemented by a thermal 3D sensor developed by Landmann et al. [30]. The thermal 3D sensor captures the optically uncooperative objects three-dimensionally without time-consuming object preparation. This measurement principle can be used to generate real monocular as well as stereo data sets, which can be applied, e.g., to monocular depth estimation (depth-from-mono) [4,23,24] or deep stereo matching [21].
Based on the new measurement principle, we created a new real-world (non-synthetic) stereo data set with ground truth disparity maps, named TranSpec3D. Our data set is available at https://QBV-tu-ilmenau.github.io/TranSpec3D-web (accessed on 6 September 2023).

2. Background Information

2.1. Limitation of Three-Dimensional Perception of Transparent Objects

The conventional 3D sensors in the VIS and near-infrared (NIR) spectral range are not suitable for the perception of transparent, translucent, and reflective surfaces [2,6,13,17,19,20] since stereo matching, i.e., the search for correspondence points in the left and right image, is error prone [18,19]. Figure 2 shows the comparison of stereo matching with optically cooperative (a) and uncooperative (b) objects in VIS and NIR. Two errors can occur when detecting uncooperative objects (non-Lambertian surface). In the case of depth error I (missing depth), for example, no depth values can be determined due to specular highlights on the surface [2,6,18]. In case of depth errors II (inaccurate/background depth), the same point (or points with the same feature) of the background surface behind the transparent object is detected instead of the actual optically uncooperative object surface. For our problem, it is irrelevant that the measured depth value is also inaccurate. Error II is unfavorably named “background error” by Jiang et al. [6] (RGB-D sensor). In the case of a stereo system, this can lead to misunderstanding (see following text). It is therefore better to name it as an “inaccurate error” as used in [2] for RGB-D sensors.

The type I error occurs very frequently and can occur due to different effects. Figure 2 shows different depth errors due to missing depth on the example of our rectified stereo images from our novel TranSpec3D data set (passive NIR 3D sensor). Figure 2a shows the measurement setup consisting of our (passive) NIR 3D sensor and two NIR emitters. The scene contains two reflective, translucent vases and a transparent Galileo thermometer filled with a liquid. Figure 2b shows missing depth errors due to (i) specular highlights on the surface or (ii) the detection of different background areas behind the transparent object. The background is distorted by transparent objects that have different refractive indices than surrounding air

n_{1}

. Case (ii) is not to be confused with the “inaccurate/background error” (type II), although here, two different backgrounds (vase and diffuse background) are optically detected. In addition, further effects can occur, (iii) such as total reflection at the interface 4–5, due to the refractive index difference between glass n₄ and liquid n₅ (n₄ > n₅; see Figure 3c). Furthermore, (iv) with active VIS or NIR 3D sensors, the projection pattern onto the surface of the transparent object is not visible (see Figure A1, Appendix A) [31].

2.2. Key Role of Data Set

Data sets play a key role in AI-based image-based approaches as well as for the required hardware. This is because the performance of deep learning networks (data driven) is directly dependent on the performance of the data set. The performance is defined by the scale, quality and speed. Scale stands for the resolution and dimension of the images as well as the size of the data set. As the image dimensions increase, the disparity range and the number of occluded and unstructured pixels also increase, making it difficult to find the accurate corresponding pixels. In addition, high-resolution images are a limiting factor for most stereo matching as well as monocular state-of-the-art networks. Figure S1 shows the strange effects of resolution. The processing is much more complex compared to low-resolution images and requires networks that feature larger receptive fields and reason at multiple context levels [18,35]. The receptive field must contain contextual clues, e.g., margins, and occlusions [35]. Miangoleh et al. [35] developed a new approach for monocular depth estimation to solve this problem. Quality stands for data without any defects, such as blur, noise, distortion, or misalignment, the selection of meaningful modalities and the consistent and accurate ground truth (annotation). For ground truth stereo data, this means without missing and inaccurate correspondence points (cf. Figure 2). The relevance of real data sets is described in Section 1. In addition, a reasonable minimum selection of meaningful modalities reduces the load on subsequent networks by having fewer dimensions and supports the networks by having better quality features. Speed stands for the generation time of data collection. When creating real data sets for transparent and specular objects, this time can be reduced with our novel TranSpec3D measurement principle (Section 3), as object preparation is no longer necessary. Due to the reducing time requirements, the size as well as the number of freely available real data sets for optically uncooperative objects will presumably increase in the future.

3. TranSpec3D Measurement Principle

In this section, we report in detail our novel measurement principle and experimental setup (Section 3.1), as well as the method for generating the data set (Section 3.2).

3.1. Measurement Principle and Experimental Setup

Our goal is to create a real data set (for monocular or stereo systems) with ground truth depth or disparity maps for transparent, translucent and specular objects without the time-consuming and cost-intensive state-of-the-art object painting [18]. We achieve this by extending our conventional stereo system with an alternative sensor technology (thermal 3D sensor [30]) that can detect these objects, which are optically uncooperative in the VIS, in three dimensions. The conventional stereo system can be replaced by a monocular camera or another 3D sensor, e.g., an RGB-D sensor.

Figure 4 shows our experimental setup, consisting of the thermal 3D sensor (sensor₁,

λ_{{sensor}_{1}} = 3 - 5 μ m

) for recording the actual measurable ground truth depth and a conventional stereo system sensor₂. The objects are detected in this technology (sensor₁) by re-emission. The result is a low resolution point cloud of about

0.3

Mpx (see Section 3.2.4). As sensor₂, we use a conventional active NIR 3D sensor based on GOBO (GOes Before Optics) projection (

λ_{{sensor}_{2}} = 850 n m

) [36], whereby we only use the projection to set up the system (see Section 3.2.1) but not to generate the stereo images (with

1.9

Mpx image resolution). Furthermore, we use an additional monocular camera (VIS) camera₃. With this camera, it is possible to create a monocular data set in the VIS spectral range. Table A2 (Appendix C) shows the properties of the 3D sensors. In order to have the same viewing angle of the static object from both 3D sensors, we arranged the sensors horizontally next to each other and integrated a rotary table. To avoid inaccuracies due to a backlash of the turning axis, only one direction is rotated (mathematical direction of rotation). Furthermore, we use two NIR emitters (Instar IN-903 at

λ = 850 n m

) to achieve the variance in the stereo images (sensor₂) through different specular reflections on the objects (natural data augmentation).

3.2. Generation of Data Set

Figure 5 shows our data capturing and annotation pipeline: (a) sensor calibration; (b) estimate extrinsic parameters of measuring system; (c) data collection; (d) data analysis and annotation of ground truth disparity maps. Step (a) is described in Section 3.2.1, step (b) in Section 3.2.2, step (c) in Section 3.2.3 and step (d) in Section 3.2.4.

3.2.1. Sensor Calibration

Each system is first calibrated separately with different planar calibration targets (see Figure A2, Appendix D). To increase the accuracy, sensor₂ is used as an active stereo system, i.e., with GOBO projection. Table 3 shows the calibration details (method, target, etc.). The calibration scrw-factor is calculated using a spherical bar (Figure A2, right). This value is measured in one plane by default. Actually, however, this should be determined over the entire measuring field. There is no method for this yet. The rectification for sensor₂ takes place via the openCV function stereoRectify(). After rectification, the reprojection matrix resp. disparity-to-depth mapping matrix

Q_{{sensor}_{2}}

for the sensor₂ (NIR 3D sensor) is available.

Q_{{sensor}_{2}}

contains the focal length (f), the principal points of the left (

c x_{1}, c y_{1}

) and right (

c x_{1}, c y_{1}

) rectified cameras, and the horizontal base distance

T_{x}

of the cameras. The components of

Q_{{sensor}_{2}}

are used to create the two

3 \times 4

projections matrices

P_{rect 1 / 2}

; see Equation (1). These are mandatory for the conversion of the depth values into disparity values (Section 3.2.4). After the sensor calibration, the systems are ready for measurement. The depth values of the point clouds are in

m m

:

\begin{matrix} P_{rect 1} = (\begin{matrix} f & 0 & c x_{1} & 0 \\ 0 & f & c y_{1} & 0 \\ 0 & 0 & 1 & 0 \end{matrix}), P_{rect 2} = (\begin{matrix} f & 0 & c x_{2} & T_{x} \cdot f \\ 0 & f & c y_{1} & 0 \\ 0 & 0 & 1 & 0 \end{matrix}) \end{matrix}

(1)

3.2.2. Calibration of the Measuring System

In the following, details of the calibration of the axis of the rotary table, rotary angle

Δ α

, thermal 3D sensor and NIR 3D sensor in the world coordinate system (wcs) are described. Figure 6 shows the test specimens utilized for this purpose. Test specimen (a), a glass sphere, is for the determination of the turning axis (

r_{axisY}

and

t_{cntrP}

), (b) is for the determination of rotary axis

Δ α

, and (c) (four hemispheres) is for the determination of the world coordinate system (wsc). Table A4 shows the set and actual values of the specimen (c). Below, we describe the determination of the remaining parameters of the TranSpec3D sensor homogeneous coordinate transformations.

Determinationof the Turning Axis of the Rotary Table

The specimen (a) is placed on the rotary table (see Figure 6). Eighteen measurements are performed at different positions of the rotary table (see Figure 7). For each measurement position, the sphere center is determined in the 3D point cloud. A circle is then fitted from the sphere centers, where its center and the normal result in the center of rotation

t_{cntrP}

and the axis

r_{axisY}

. Equation (2) describes the turning y-axis and the rotary angle

Δ α

:

r_{axisY} = {(\begin{matrix} r_{x} & r_{y} & r_{z} \end{matrix})}^{T}, t_{cntrP} = {(\begin{matrix} x_{cntrP} & y_{cntrP} & z_{cntrP} \end{matrix})}^{T}, Δ α = 31.4^{\circ}

(2)

Determination of the Rotary Angle of Rotary Table

To determine the rotary angle

Δ α

, the specimen (b) is applied (see Figure 6). The flat frontal surface of the specimen is aligned orthogonally to the optical axis of the sensor₁. The specimen is then detected in this position by both 3D sensors. After that, the same is performed, except that the flat frontal surface of the specimen is aligned orthogonally to sensor₂. With the help of the recorded 3D points and the number of steps of the motor, we can infer the angle

Δ α

.

Determination of Transformations to the World Coordinate System

For this purpose, a four hemisphere specimen is utilized; see Figure 6c. The test specimen is detectable for both 3D sensors. Homogeneous coordinate transformation (wcs-to-gt): Both calibrated 3D sensors synchronously capture the test specimen (c) by a rotary angle (rotary table) of

α = 0^{\circ}

. In the acquired point clouds, the centers of the four hemispheres of the specimen are determined (

s 0

,

s 1

,

s 2

,

s 3

), see Equation (3), by fitting a sphere per hemisphere and determining the center from it. We define the center of these spherical centers as our world coordinate system (wcs):

s 0 = (\begin{matrix} X_{s 0} \\ Y_{s 0} \\ Z_{s 0} \end{matrix}), s 1 = (\begin{matrix} X_{s 1} \\ Y_{s 1} \\ Z_{s 1} \end{matrix}), s 2 = (\begin{matrix} X_{s 2} \\ Y_{s 2} \\ Z_{s 2} \end{matrix}), s 3 = (\begin{matrix} X_{s 3} \\ Y_{s 3} \\ Z_{s 3} \end{matrix})

(3)

The translation vector

t_{wcs}^{gt}

, Equation (6), is the midpoint of the four spherical centers (

s 0

,

s 1

,

s 2

,

s 3

); see Equation (4). With the help of the spherical centers (

s 0

,

s 1

,

s 2

), we can calculate vectors

u

,

v

and

w

; see Equation (5). These vectors describe the rotation matrix

R_{wcs}^{gt}

. See Equation (4):

s x = \frac{\sum_{i = 0}^{3} X_{s i}}{4}, s y = \frac{\sum_{i = 0}^{3} Y_{s i}}{4}, s z = \frac{\sum_{i = 0}^{3} Z_{s i}}{4}, with i \in [0, 1, 2, 3]

(4)

t = \frac{s 2 - s 0}{∥ s 2 - s 0 ∥}, u = \frac{s 1 - s 0}{∥ s 1 - s 0 ∥}, w = [t ✕ u], v = [w ✕ u]

(5)

R_{wcs}^{gt} = (\begin{matrix} u \\ v \\ w \end{matrix}), t_{wcs}^{gt} = (\begin{matrix} s x \\ s y \\ s z \end{matrix})

(6)

Homogeneous coordinate transformation (wcs-to-S): The determination of the rotation matrix

R_{wcs}^{S}

and the translation vector

t_{wcs}^{S}

is analogous to the calculation of the homogeneous coordinate transformation (wcs-to-gt). The only difference is the different rotary angle

α + = 31.4^{\circ}

of the rotary table at data collection.

3.2.3. Data Collection

Figure 8 shows the data collection process per object. After placing the objects on the rotary table, the scene is captured with the thermal 3D sensor (sensor₁). The output is a point cloud, which is our ground truth depth. Then, the rotary table is rotated by

Δ α = 31.4^{\circ}

. Then the data acquisition is performed with the NIR 3D sensor without GOBO projection (sensor₂). On average, this scene is acquired with six to seven different positions of the two NIR emitters. In this way, we increase our data set and the variance regarding specular reflections. Therefore, there are several stereo images for this scene under different lighting conditions.

3.2.4. Data Analysis and Annotation

Figure 9 describes the data analysis and annotation of the ground truth disparity maps. For the raw stereo data of sensor₂, a lens distortion correction and rectification is performed. The calibration parameters (see Section 3.2.1 and Section 3.2.2) are used for this. The raw point cloud of sensor₁ is first transformed into the coordinate system of sensor₂ based on the calibration parameter. Then, the depth values are converted into disparity values. The resulting point cloud is projected into a 2D raster image using our polygon-based Triangle-Mesh-Rasterization-Projection (TMRP) method [39]. The result is the ground truth disparity map.

In the following, we write vectors in bold lower-case (

o

) and matrices using bold-face capitals (

O

). Three-dimensional rigid body transformations that bring points from the coordinate system a into the coordinate system b are denoted by

T_{a}^{b}

, where

T

stands for “transformation” (according to [40]). Figure 10 shows an overview of the homogeneous coordinated transformations details.

Homogeneous Coordinate Transformation

Equation (7) describes the matrix

R_{axisY}

consisting of components of the rotation vector

r_{a x i s Y}

, and Equations (2) and (8) describe the rotation matrix

R_{y (Δ α)}

using the Rodrigues’ rotation formula [41], where

I

is the unit matrix:

R_{axisY} = [\begin{matrix} 0 & - r_{z} & r_{y} \\ r_{z} & 0 & - r_{x} \\ - r_{y} & r_{x} & 0 \end{matrix}]

(7)

R_{y (Δ α)} = I + sin (θ) \cdot R_{axisY} + (1 - cos (θ)) \cdot R_{axisY}^{2} with θ = - Δ α \cdot \frac{π}{180}

(8)

Equation (11) describes the homogeneous coordinates transformation

T_{{sensor}_{1}}^{{sensor}_{2}}

of the ground truth point

P_{gt}

mathematically. The required transformation matrices are described in Equations (9) and (10):

T_{gt}^{wcs} = [\begin{matrix} {(R_{gt}^{wcs})}^{- 1} & - {(R_{gt}^{wcs})}^{- 1} \cdot t_{gt}^{wcs} \\ 0_{1 \times 3} & 1 \end{matrix}], T_{S}^{wcs} = [\begin{matrix} R_{wcs}^{S} & t_{S}^{wcs} \\ 0_{1 \times 3} & 1 \end{matrix}]

(9)

T_{wcs}^{cntrP} = \underset{rotation at Δ α in at y axis}{\underset{︸}{[\begin{matrix} R_{y (} Δ α) & t_{cntrP} \\ 0_{1 \times 3} & 1 \end{matrix}]}}, T_{cntrP}^{wcs} = \underset{retransformation to origin}{\underset{︸}{[\begin{matrix} I & - t_{cntrP} \\ 0_{1 \times 3} & 1 \end{matrix}]}}

(10)

T_{{sensor}_{1}}^{{sensor}_{2}} = T_{gt}^{S} = T_{wcs}^{S} \cdot \underset{turning by Δ α}{\underset{︸}{T_{cntrP}^{wcs} \cdot T_{wcs}^{cntrP}}} \cdot T_{gt}^{wcs}, (\begin{matrix} X_{gt 2 S} \\ Y_{gt 2 S} \\ Z_{gt 2 S} \\ 1 \end{matrix}) = T_{gt}^{S} \cdot \underset{point P_{gt}}{\underset{︸}{(\begin{matrix} X_{gt} \\ Y_{gt} \\ Z_{gt} \\ 1 \end{matrix})}}

(11)

Conversion Depth to Disparity

This step only applies if sensor₂ is a stereo system and not a monocular 3D sensor, such as an RGB-D sensor, since a stereo data set will require disparity maps as ground truth. Equation (12) describes the homogeneous transformation and projection applied to the 3D point

P_{gt}

of the sensor₁. Thus, the point is projected from the rectified first camera coordinate system of the sensor₁ into the first and second rectified camera images of the sensor₂.

P_{S}^{rect 1}

and

P_{S}^{rect 2}

are the

3 \times 4

projection matrices (see description of openCV stereoRectify() function). We created these two projection matrices of the reprojection matrix

Q

(Section 3.2.1):

(\begin{matrix} X_{Pl} \\ Y_{Pl} \\ Z_{Pl} \end{matrix}) = \underset{projection to left}{\underset{︸}{P_{S}^{rect 1}}} \cdot T_{gt}^{S} \cdot (\begin{matrix} X_{gt} \\ Y_{gt} \\ Z_{gt} \\ 1 \end{matrix}), (\begin{matrix} X_{\Pr} \\ Y_{\Pr} \\ Z_{\Pr} \end{matrix}) = \underset{projection to right}{\underset{︸}{P_{S}^{rect 2}}} \cdot T_{gt}^{S} \cdot (\begin{matrix} X_{gt} \\ Y_{gt} \\ Z_{gt} \\ 1 \end{matrix})

(12)

(\begin{matrix} x_{Pl} \\ y_{Pl} \end{matrix}) = (\begin{matrix} X_{Pl} / Z_{Pl} \\ Y_{Pl} / Z_{Pl} \end{matrix}), (\begin{matrix} x_{\Pr} \\ y_{\Pr} \end{matrix}) = (\begin{matrix} X_{\Pr} / Z_{\Pr} \\ Y_{\Pr} / Z_{\Pr} \end{matrix})

(13)

Equation (14) describes the 2D point cloud with the ground truth disparity values

d_{x}

.

r x_{gt}

and

r y_{gt}

are the 2D coordinates of the raw points

P_{gt}

of sensor₁ (see Equation (11):

(\begin{matrix} x_{Pl} \\ y_{Pl} \\ d_{x} \\ r x_{gt} \\ r y_{gt} \end{matrix}) \overset{pseudo - real ground truth}{\leftarrow} \underset{calculate pseudo - real disparity d_{x} based on two point clouds ({points}_{Pl} and {points}_{\Pr})}{\underset{︸}{d_{x} (x_{Pl}, y_{Pl}) = \{\begin{matrix} x_{Pl} (y_{Pl}) - x_{\Pr} (y_{Pl}) & (d_{x} \geq 0) \\ NaN & (d_{x} < 0) \end{matrix}, for (y_{Pl} = = y_{\Pr})}}

(14)

Projection of Plane Cloud into 2D Raster Image

The low-resolution (

0.3

Mpx) transformed point cloud

P (x, y, d_{x}, r x, r y)

is projected into a dense, accurate 2D disparity map with an image resolution of 1680 px × 1304 px. We use our polygon-based Triangle-Mesh-Rasterization-Projection (TMRP) method [39,42]. The generated disparity map is created with eight decimal places. The disparity map is saved as an M-alpha image. The alpha channel indicates the valid gaps. Disparity maps are saved as PNG and PFM files.

4. TranSpec3D Data Set

4.1. Data Set Statistics

Our real stereo data set TranSpec3D contains the raw as well as the undistorted and rectified stereo images (NIR 3D sensor) and the ground truth disparity maps. Our data set includes 110 objects with transparent or specular surfaces. We have various glass objects (vases, drinking glasses, chemical utensils, and historical glass objects), various transparent and translucent plastics (packaging boxes or technical objects), medical objects (clear orthodontic aligners), mirror objects, etc. Figure A4 shows examples of captured objects and scenes from our data set. The split of the data: 70% as the training data set, 20% as the validation data set and 10% as the test data set. For a balanced split, we categorized the captured scenes by surface materials (top category) and complexity (sub category). On average, a scene consists of six to seven pairs of images with different lighting, which means that the stereo images have different reflections. The top category distinguishes between transparent (exclusively or strongly predominant) and mixed surface materials (transparent, translucent and reflective). The subcategory defines the complexity of the scene. Here, it is roughly divided into “without” and “with overlap”. A further subdivision is made in the subcategory “without overlap” between one object and several objects and in the subcategory “with overlap” between the complexity of the objects in the foreground. Figure 11 shows the subdivision into the top and subcategories. The validation and test data set contains data from each category according to the percentage. We made sure that all three data sets contain unique objects (novel objects unseen in training). Figure A3 (Appendix E) shows the distribution of the training/validation/test according to the categories.

4.2. Accuracy Assessment

The characterization of the thermal 3D sensor (sensors₁) is carried out according to the VDI/VDE guideline 2634. For a field size of 160

m

m

× 128

m

m

(horizontal × vertical) the measurement quality is 10

μ

m

to 150

μ

m

[43]. The characterization of the NIR 3D sensor (sensors₂) is determined using the test specimen (b) from Figure 6. We calculate the 3D point standard deviation of the measured plane front surface to an ideal plane. The 3D standard deviation is about 42

μ

m

.

4.3. Comparison of TranSpec3d Data Set with Booster Data Set

To show the advantages and disadvantages of our measurement principle and data set TranSpec3D compared to the Booster data set [18], we compared different properties and present them in Table 4. Compared to the generation of the Booster data set, our measurement principle has significant advantages in terms of time and effort for the generation of the ground truth data. By eliminating the object preparation, steps three and four are skipped in the acquisition pipeline. This also eliminates the influence of error due to possible inaccurate positioning of the prepared object (step four) at the original position of the acquired transparent object (step two). In addition, our method does not damage the objects, which is resource conserving and especially important for sensitive objects, such as historical glass. However, a disadvantage can be the lower resolution of the thermal 3D sensor compared to the NIR stereo system. Therefore, the low-resolution raw depth of the thermal 3D sensor must be extrapolated to the high resolution of the NIR 3D sensor. To achieve high accuracy and density, we use the polygon-based TMRP method (including up-sampling) [39]. This possible error influence is not present in the Booster data set since, here, the ground truth data are generated due to the object preparation with classic active stereo (based on six RGB projectors). Another limitation of our measuring principle is the measurement volume. This is limited by the safety-related enclosure of the thermal 3D sensor (Section 5.2.1). Therefore, our data set contains only laboratory scenes.

5. Discussion

5.1. Advantages of Our New Proposed Measuring Principle TranSpec3D

We created a real stereo data set TranSpec3D for transparent and specular objects using ground truth without object preparation. To our knowledge, our novel measurement principle TranSpec3D is the first-ever method to generate real (non-synthetic) monocular or stereo data sets with ground truth values (depth or disparity maps) for transparent and specular objects. To achieve this, we used an additional thermal 3D sensor (sensor₁) developed by Landmann et al. [30] to a conventional 3D sensor (sensor₂). The advantages of this sensor are as follows:

No preparation of free-form surfaces necessary for the first time;
Heating locally and only a few Kelvin;
Fully automatic measurement in seconds with subsequent automated evaluation;
Objects from different materials measurable;
High accuracy;
Still a current research topic with high potential for improvement.

The presented new measurement principle TranSpec3D can be used to create stereo data sets with ground truth disparity for stereo matching [21,34] as well as monocular data sets with ground truth depth for depth-from-mono [22,23,24]. Depending on the application, only the conventional 3D sensor (Figure 4, sensor₂) has to be replaced or extended. For a monocular data set, for example, an RGB-D sensor can be used (cf. [12,32]). For a stereo data set, for example, an RGB stereo system with a parallel or convergent camera arrangement can be used (cf. [18]).

5.2. Limitations

Creating a data set for transparent and specular objects with our novel measurement principle does not require object preparation, which saves object resources and a lot of time (see Section 2.1). In addition, the surface of the object is not prepared by, for example, the varnish. But there are limitations. Section 5.2.1 describes the limitation of the method and our data set due to the thermal 3D sensor. Section 5.2.2 describes the limitation of our data set due to the current measurement setup.

5.2.1. Due to the Thermal 3D Sensor

Physical limitations of thermal 3D sensor technology: The measured accuracy of the measured depth values depend mainly on two material parameters: complex refractive index n (a material and wavelength-dependent (dispersion) quantity, which is composed of the real refractive index $n^{'}$ and the absorption index $κ$ as follows $n = n^{'} + i κ$ ) and thermal conductivity. The objects must not be transmissive at the irradiation wavelength ( $10.6$ $μ$ $m$ , see Table A2), and must reflect a sufficiently large portion. In addition, the object must be heatable close to the surface, and heat should not disappear immediately. This means objects with very high thermal conductivity, e.g., metals or ceramics, cannot be captured accurately [31].
Measurement volume limitations: Due to the CO₂ laser (40 $W$ ), the measuring system must be enclosed for safety reasons. Therefore, this method is limited to objects of a certain size and to laboratory scenes (indoor/outdoor).
Low resolution of the 3D point cloud: The resolution of the raw depth point cloud ( $0.3$ Mpx), which is the basis for the ground truth disparity of our data set, is very low due to the thermal imaging cameras (FLIR A6753sc). When using a VIS/NIR 3D sensor with a much higher resolution than $0.3$ Mpx, we lose information and therefore accuracy due to under-sampling despite the application of the TMRP algorithm [39].
High hardware costs: The thermal 3D sensor is very expensive. The high cost share is due to the cooled thermal cameras (FLIR A6753sc) and will probably decrease in the next few years due to further developments. In the future, more affordable technologies will make it possible to create customized training data sets, e.g., for small series, directly on site.

5.2.2. Due to the Current Measurement Setup

The following limitation refer to our current measurement setup (Figure 4). However, the current limitations will be resolved in the future. Measuring volume and environment: Our current measurement setup can only measure static objects under laboratory conditions (indoors and outdoors) with a measurement volume of

160 m m \times 128 m m \times 100 m m

. Currently, the mid-wave infrared (MWIR) cameras (sensor₁) at 125 fps are the limiting component to also capture dynamic objects. The use of high-speed LWIR cameras at 1000 fps handles this limitation and shows a dynamic measurement process of a crushing bottle with a 20 fps 3D rate [46].

5.3. Open Question

There are two approaches to creating the ground truth of a real monocular or stereo data set (see Table A1, Appendix B). These are shown in Figure 1. Which approach is better in terms of ground truth data accuracy: (a) the state-of-the-art approach with manipulation of the surface by object preparation as with the booster data set [18], or (b) our approach based on the thermal 3D sensor [30] (cf. Figure 1)?

5.4. Future Work

In the future, we want to quantitatively investigate the difference in the 3D point clouds of objects with and without object preparation (cf. Section 5.3).
In the future, we want to expand our measurement system as follows:
- – With further modalities, such as an RGB stereo system in the VIS range and a polarization camera for segmenting the transparent objects [47,48].
- – With different backgrounds to obtain a data set with different appearances of glass objects [6,49]. To generate disparity values from the background, in the future, the disparity map of sensor₂ should be merged with the current one to have 100% “visual” density (merge similar to [12]).
- – With a rotary table without a spindle.
With our new measurement principle, additional real data sets can also be created for further optical uncooperative objects in the VIS or NIR range, e.g., black objects.

6. Conclusions

In various applications, such as human–robot interactions, autonomous robot navigation or autonomous waste recycling, the perception or 3D reconstruction of transparent and specular objects is required. However, such objects are optically uncooperative in the VIS and NIR ranges. The capture of transparent surfaces is still a corner case in stereo matching [18,19]. This can also be seen by the fact that most deep stereo matching networks perform worse on transparent and other visually uncooperative objects [20,29]. This is also due to the fact that the generation of real data sets with ground truth disparity maps is very time consuming and costly due to a necessary object preparation (or an additional opaque twin), which is reflected in the small number of available data sets (Table 1). For this reason, we introduce our novel measurement principle TranSpec3D that accelerates and simplifies the generation of a stereo or monocular data set with real measured ground truth for transparent and specular objects without the state-of-the-art object preparation [18,25] or an additional opaque twin [2,14]. In contrast to conventional techniques that require object preparation [18,25], opaque twins [2,14] or 3D models [12,32] to generate a ground truth, we obtain the ground truth using an additional thermal 3D sensor developed by [30]. The thermal 3D sensor captures the optically uncooperative objects three-dimensionally without time-consuming object preparation. With our measurement principle, the time and effort required to create the data set is massively reduced. The time-consuming object preparation as well as the time-consuming object placement of transparent and opaque objects is eliminated. In addition, the surface of the object is not manipulated, which means that sensitive objects can also be detected, e.g., historical glass. Another special feature is the generalizability of the measurement principle, i.e., any conventional 3D sensor in VIS or NIR (sensor₂) can be extended with the thermal 3D sensor (sensor₁), e.g., by a RGB stereo system (with parallel or convergent camera arrangement) for a stereo data set with ground truth disparity values (for deep stereo matching [21]) or by a RGB-D sensor for monocular data sets with ground truth depth values (for monocular depth estimation [4,23,24]). In addition, there is a high development potential to optimize the thermal 3D sensor to make this technology as accessible as possible.

We apply this measurement principle to generate our data set TranSpec3D. For this, we use a conventional NIR 3D sensor and the thermal 3D sensor [30]. To enlarge the data set naturally (data augmentation), we record each scene with different NIR emitter positions. After the data collection, a data analysis and annotation of ground truth disparities takes place. To ensure that the ground truth disparity map has the same resolution as the stereo images, the Triangle-Mesh-Rasterization-Projection (TMRP) method [39] is used. Our data set TranSpec3D consists of stereo imagery (raw as well as undistorted and rectified) and ground truth disparity maps. We capture 110 different objects (transparent, translucent or specular) at different illumination environments, thus increasing the data set and creating different reflections on the surface (natural data augmentation). We categorize the captured 148 scenes by surface materials (top category) and complexity (sub category) (Figure 11). This allows us to have a balanced split of the data set into training/validation/test data sets (Figure A3). Our data set consists of 1037 image sets (consisting of stereo image and disparity map). Our data set is available at https://QBV-tu-ilmenau.github.io/TranSpec3D-web (accessed on 6 September 2023). We present the advantages and disadvantages of our method by comparing our TranSpec3D data set with the Booster data set [18] (cf. Table 4).

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/s23208567/s1, Figure S1. “The strange effects of resolution” according to Miangoleh et al. [50].

Author Contributions

Conceptualization, formal analysis, investigation, methodology, software, validation, visualization, writing—original draft preparation, C.J.; resources, H.S., K.S. and C.J.; data curation, C.J. and H.S., writing—review and editing, C.J., H.S., M.L. and G.N.; supervision, project administration, funding acquisition, G.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work is a cooperative of the group for Quality Assurance and Industrial Image Processing of the Technische Universität Ilmenau and of Fraunhofer Institute for Applied Optics and Precision Engineering (Jena). This research was funded by the Carl-Zeiss-Stiftung as part of the project Engineering for Smart Manufacturing (E4SM)—Engineering of machine learning-based assistance systems for data-intensive industrial scenarios.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Our data set TranSpec3D is available at https://QBV-tu-ilmenau.github.io/TranSpec3D-web (accessed on 6 September 2023).

Acknowledgments

Many thanks to all those who have provided us with further objects. Many thanks to Andy Tänzer. As part of the ProKI network, TU Ilmenau supports companies in the manufacturing sector with its know-how in the field of artificial intelligence.

Conflicts of Interest

This article has no conflict of interest with any organization.

Abbreviations

The following abbreviations are used in this manuscript:

A0/A1/A2	approaches 0/1/2
AI	artificial intelligence
CO₂	carbon dioxide
coord.	coordinate
(D)	diffuse objects
2D/3D	two/three dimensional
FR	flame retardant
fg	foreground
GOBO	goes before optics
gt	ground truth
img.	image
MWIR	mid-wave infrared
n_*	refractive index of *
NaN	not a number
NIR	near infrared
NN	non nominatus
obj.	object
PFM	Portable FloatMap
PMMA	polymethyl methacrylate
prep.	preparation
proj.	projection
(R)	reflective objects
rep.	representatives
res.	resolution
RGB-D	image with four channels: red, blue, green, depth
ROI	region of interest
SOTA	state of the art
suppl.	supplementary
(T)	transparent objects
(Tl)	translucent objects
TMRP	Triangle-Mesh-Rasterization-Projection
transp.	transparent
TOD	Transparent Object Data Set [12]
VIS	visible spectrum
wcs	world coordinate system

Appendix A. Limitations of Active 3D Sensors in VIS and NIR with Optically Uncooperative Objects

Figure A1 shows the limitations of active 3D sensors in VIS or NIR with optically uncooperative objects. Because the projection patterns are not visible, no correspondence points can be found (see depth error I in Section 2.1).

Figure A1. Limitations of active 3D sensors in VIS and NIR with optically uncooperative objects. Missing depth due to non-visible pattern projection in VIS or NIR (according to [31]). Object: fist-shaped glass flacons with metal-covered plastic cap.

Appendix B. Approaches to 3D Detection of Optically Uncooperative Objects

Table A1 compares three approaches for the 3D detection of optically uncooperative objects in the VIS and NIR spectral ranges.

A0:: This approach uses conventional stereo systems in the VIS and NIR and prepared objects; see Figure 1a.
A1:: This approach continues to use conventional stereo systems, but due to the deep stereo matching approaches, no object preparation is necessary during the process.
A2:: This approach captures transparent objects with alternative measurement technologies [30]. No object preparation and no data set are necessary; see Figure 1b.

Table A1. Approaches to minimize resp. solve the current limitation of stereo vision, the optically uncooperative surfaces in the visible spectral range.

	Approach A0	Approach A1	Approach A2
Systems	conventional	conventional	thermal 3D sensor [30]
Stereo matching	model based	data driven (AI based)	s. [30]
Object preparation	necessary	yes (A0)/no (A2)	no
Application: For in-line quality control or dynamic processes	(no)	possible	too long measurement time
Application area (in general)	flexible	flexible	limited (required housing)
Task type	-	outdoor/indoor	indoor, laboratory
Flexibility of transparent objects	-	lower, depending on data set and training	high, limited due to technology †
Transparent objects measurable	w/preparation	only w/AI	measurable w/o preparation †
data set (training and testing)	no	required $^{‡}$	not required
Objects of the data set: still usable	no	no $^{§}$ /yes (TranSpec3D)	yes *
Raw res. of ground truth vs. image	-	same	lower (TMRP [39])
Hardware costs	low	middle	very high
Representatives	-	Booster [18]; our TranSpec3D	thermal 3D sensor [30]

† limitations when using the thermal 3D sensor, see Section 5.2.1;

^{‡}

(i) synthetic or (ii) real-world data set; (ii) currently very rare (Booster data set [18]) or cost- and time intensive for own generation;

^{§}

only for training: object painting/powdering [17,18]; * Object is not destroyed by heat input [51].

Appendix C. Sensor Specifications

Table A2 shows the properties of our sensors utilized to generate the TranSpec3D data set (see measurement setup, Figure 4).

Table A2. Properties of used measuring systems.

Properties	Thermal 3D Sensor [45]	NIR 3D Sensor [36]	Monocular Camera (Optional)
	sensor₁	sensor₂	camera₃
system	FLIR A6753sc	Blackfly®S USB3 $^{†}$
stereo vision arrangement	convergent	convergent	-
base distance	211 $m$ $m$	130 $m$ $m$	-
image size (raw)	640 px × 512 px	1616 px × 1240 px	1616 px × 1240 px
image size (rectified)	696 px × 534 px	1680 px × 1304 px	-
$λ$ of image acquisition	MWIR, 3 $μ$ $m$ –5 $μ$ $m$	850 $n$ $m$	400 $n$ $m$ –700 $n$ $m$
$λ$ of projector	$10.6$ $μ$ $m$	850 $n$ $m$	-
pattern projection	sequential fringe projection [30]	GOBO	-

^{†}

Modell: BFS-U3-20S4M-C: 2.0 MP, 175 FPS, Sony IMX422, Mono, https://www.flir.de/products/blackfly-s-usb3/?model=BFS-U3-20S4M-C&vertical=machine+vision&segment=iis (accessed on 1 September 2023).

Appendix D. Sensor Calibration and Test Specimens

Figure A2 shows the utilized calibration targets. Table A3 shows the details of the test specimen spherer bar.

Figure A2. Planar calibration targets for TranSpec3D data set setup. (left) Calibration target (circuit board FR-4) with copper-coated ArUco markers and symmetrical circles for sensor₁ (thermal 3D sensor [30]), (mid) checkerboard with “3-circle marking” for sensor₂ (NIR 3D sensor) on glass plate. (right) Sphere bar, for determination of the skew factor of sensor₂ (Ser. No. 2016; Ingenieria y Servicios Metrologia Tridimensional S.L.).

Table A3. Details of the test specimen ball bar or sphere bar (Figure A2).

Spherical Bar	$l_{distance - of - ball - cent .}$	$d_{1}$	$d_{2}$
set value	$100.06908$ $m$ $m$	$14.99727$ $m$ $m$	$14.99929$ $m$ $m$

Table A4. Details of the test specimen for determining the world coordinated system (wcs) of TranSpec3D data set.

Four Hemispheres Specimen	d	$l_{point - to - point}$
set value	40 $m$ $m$	$60.1$ $m$ $m$
actual value	$σ =$ 20 $μ$ $m$ to 30 $μ$ $m$	$60.2$ $m$ $m$

Appendix E. TranSpec3D Data Set

Figure A3 shows the distribution of the TranSpec3D data set (70.4% training/20.2% validation/9.4% test). Figure A4 shows examples of the captured objects and scenes (NIR images).

Figure A3. Training data set (730 image sets), validation data set (210 image sets) and test data set (97 image sets). A image set consists of an image pair and a disparity map (ground truth).

Figure A4. Examples of collected images of TranSpec3D data set (only undistorted and rectified NIR images).

References

Erich, F.; Leme, B.; Ando, N.; Hanai, R.; Domae, Y. Learning Depth Completion of Transparent Objects using Augmented Unpaired Data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2023), London, UK, 29 May–2 June 2023. [Google Scholar]
Sajjan, S.; Moore, M.; Pan, M.; Nagaraja, G.; Lee, J.; Zeng, A.; Song, S. Clear Grasp: 3D Shape Estimation of Transparent Objects for Manipulation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May 2020; pp. 3634–3642. [Google Scholar] [CrossRef]
Dai, Q.; Zhu, Y.; Geng, Y.; Ruan, C.; Zhang, J.; Wang, H. GraspNeRF: Multiview-based 6-DoF Grasp Detection for Transparent and Specular Objects Using Generalizable NeRF. arXiv 2023, arXiv:2210.06575v3. [Google Scholar]
Wang, Y.R.; Zhao, Y.; Xu, H.; Eppel, S.; Aspuru-Guzik, A.; Shkurti, F.; Garg, A. MVTrans: Multi-View Perception of Transparent Objects. arXiv 2023, arXiv:cs.RO/2302.11683. [Google Scholar]
Landmann, M.; Heist, S. Transparente Teile Erfassen. Robot. Produktion 2022, 6, 70. [Google Scholar]
Jiang, J.; Cao, G.; Deng, J.; Do, T.T.; Luo, S. Robotic Perception of Transparent Objects: A Review. arXiv 2023, arXiv:cs.RO/2304.00157. [Google Scholar]
Xie, E.; Wang, W.; Wang, W.; Ding, M.; Shen, C.; Luo, P. Segmenting Transparent Objects in the Wild. arXiv 2020, arXiv:2003.13948. [Google Scholar]
Mei, H.; Yang, X.; Wang, Y.; Liu, Y.; He, S.; Zhang, Q.; Wei, X.; Lau, R.W. Don’t Hit Me! Glass Detection in Real-World Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Chang, J.; Kim, M.; Kang, S.; Han, H.; Hong, S.; Jang, K.; Kang, S. GhostPose: Multi-view Pose Estimation of Transparent Objects for Robot Hand Grasping. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 31 September–2 October 2021; pp. 5749–5755. [Google Scholar] [CrossRef]
Lysenkov, I.; Rabaud, V. Pose estimation of rigid transparent objects in transparent clutter. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation, Karlsruhe, Germany, 6–10 May 2013; pp. 162–169. [Google Scholar] [CrossRef]
Chen, X.; Zhang, H.; Yu, Z.; Opipari, A.; Chadwicke Jenkins, O. ClearPose: Large-scale Transparent Object Dataset and Benchmark. In Proceedings of the Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 381–396. [Google Scholar]
Xu, H.; Wang, Y.R.; Eppel, S.; Aspuru-Guzik, A.; Shkurti, F.; Garg, A. Seeing Glass: Joint Point-Cloud and Depth Completion for Transparent Objects. In Proceedings of the 5th Annual Conference on Robot Learning, London, UK, 8–11 November 2021. [Google Scholar]
Zhang, Y.; Funkhouser, T. Deep Depth Completion of a Single RGB-D Image. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 6–9 June 2018; pp. 175–185. [Google Scholar] [CrossRef]
Jiang, J.; Cao, G.; Do, T.T.; Luo, S. A4T: Hierarchical Affordance Detection for Transparent Objects Depth Reconstruction and Manipulation. IEEE Robot. Autom. Lett. 2022, 7, 9826–9833. [Google Scholar] [CrossRef]
Ichnowski, J.; Avigal, Y.; Kerr, J.; Goldberg, K. Dex-NeRF: Using a Neural Radiance field to Grasp Transparent Objects. In Proceedings of the Conference on Robot Learning (CoRL), Virtual Event, 16–18 November 2020. [Google Scholar]
Kerr, J.; Fu, L.; Huang, H.; Avigal, Y.; Tancik, M.; Ichnowski, J.; Kanazawa, A.; Goldberg, K. Evo-NeRF: Evolving NeRF for Sequential Robot Grasping of Transparent Objects. In Proceedings of the 6th Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2023; Volume 205, pp. 353–367. [Google Scholar]
Ramirez, P.; Tosi, F.; Poggi, M.; Salti, S.; Mattoccia, S.; Stefano, L.D. Open Challenges in Deep Stereo: The Booster Dataset. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 21136–21146. [Google Scholar] [CrossRef]
Zama Ramirez, P.; Costanzino, A.; Tosi, F.; Poggi, M.; Salti, S.; Di Stefano, L.; Mattoccia, S. Booster: A Benchmark for Depth from Images of Specular and Transparent Surfaces. arXiv 2023, arXiv:2301.08245. [Google Scholar] [CrossRef]
Wu, Z.; Su, S.; Chen, Q.; Fan, R. Transparent Objects: A Corner Case in Stereo Matching. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2023), London, UK, 6–9 June 2023. [Google Scholar]
He, J.; Zhou, E.; Sun, L.; Lei, F.; Liu, C.; Sun, W. Semi-Synthesis: A Fast Way To Produce Effective Datasets for Stereo Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 14–19 June 2021; pp. 2884–2893. [Google Scholar]
Poggi, M.; Tosi, F.; Batsos, K.; Mordohai, P.; Mattoccia, S. On the Synergies between Machine Learning and Binocular Stereo for Depth Estimation from Images: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 2021, 8566. [Google Scholar] [CrossRef]
Watson, J.; Aodha, O.M.; Turmukhambetov, D.; Brostow, G.J.; Firman, M. Learning Stereo from Single Images. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Zhao, C.; Sun, Q.; Zhang, C.; Tang, Y.; Qian, F. Monocular depth estimation based on deep learning: An overview. Sci. China Technol. Sci. 2020, 63, 1612–1627. [Google Scholar] [CrossRef]
Bhoi, A. Monocular Depth Estimation: A Survey. arXiv 2019, arXiv:cs.CV/1901.09402. [Google Scholar]
Liu, X.; Jonschkowski, R.; Angelova, A.; Konolige, K. KeyPose: Multi-View 3D Labeling and Keypoint Estimation for Transparent Objects. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11599–11607. [Google Scholar] [CrossRef]
Stavroulakis, P.I.; Zabulis, X. Transparent 3D: From 3D scanning of transparent cultural heritage items to industrial quality control of transparent products. In 35. Control Internationale Fachmesse für Qualitätssicherung; Hal Theses: Stuttgart, Germany, 2023. [Google Scholar]
Valinasab, B.; Rukosuyev, M.; Lee, J.; Ko, J.; Jun, M.B.G. Improvement of Optical 3D Scanner Performance Using Atomization-Based Spray Coating. J. Korean Soc. Manuf. Technol. Eng. 2015, 24, 23–30. [Google Scholar] [CrossRef]
Díaz-Marín, C.; Aura-Castro, E.; Sánchez-Belenguer, C.; Vendrell-Vidal, E. Cyclododecane as opacifier for digitalization of archaeological glass. J. Cult. Herit. 2016, 17, 131–140. [Google Scholar] [CrossRef]
Shen, Z.; Song, X.; Dai, Y.; Zhou, D.; Rao, Z.; Zhang, L. Digging Into Uncertainty-based Pseudo-label for Robust Stereo Matching. arXiv 2023, arXiv:2307.16509v1. [Google Scholar] [CrossRef] [PubMed]
Landmann, M.; Speck, H.; Dietrich, P.; Heist, S.; Kühmstedt, P.; Tünnermann, A.; Notni, G. High-resolution sequential thermal fringe projection technique for fast and accurate 3D shape measurement of transparent objects. Appl. Opt. 2021, 60, 2362–2371. [Google Scholar] [CrossRef] [PubMed]
Landmann, M. Schnelle und Genaue 3d-Formvermessung Mittels Musterprojektion und Stereobildaufnahme im Thermischen Infrarot. Ph.D. Thesis, Friedrich-Schiller-Universität, Jena, Germany, 2022. [Google Scholar]
Dai, Q.; Zhang, J.; Li, Q.; Wu, T.; Dong, H.; Liu, Z.; Tan, P.; Wang, H. Domain Randomization-Enhanced Depth Simulation and Restoration for Perceiving and Grasping Specular and Transparent Objects. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23 October 2022. [Google Scholar]
Junger, C.; Notni, G. Optimisation of a stereo image analysis by densify the disparity map based on a deep learning stereo matching framework. In Proceedings of the Dimensional Optical Metrology and Inspection for Practical Applications XI. International Society for Optics and Photonics, Orlando, FL, USA, 24–26 April 2022; Volume 12098, pp. 91–106. [Google Scholar] [CrossRef]
Scharstein, D.; Szeliski, R. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
Miangoleh, S.M.H.; Dille, S.; Mai, L.; Paris, S.; Aksoy, Y. Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 21–25 June 2021; pp. 9680–9689. [Google Scholar] [CrossRef]
Speck, H.; Munkelt, C.; Heist, S.; Kühmstedt, P.; Notni, G. Efficient freeform-based pattern projection system for 3D measurements. Opt. Express 2022, 30, 39534–39543. [Google Scholar] [CrossRef]
Landmann, M.; Heist, S.; Dietrich, P.; Lutzke, P.; Gebhart, I.; Templin, J.; Kühmstedt, P.; Tünnermann, A.; Notni, G. High-speed 3D thermography. Opt. Lasers Eng. 2019, 121, 448–455. [Google Scholar] [CrossRef]
Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Junger, C.; Buch, B.; Notni, G. Triangle-Mesh-Rasterization-Projection (TMRP): An Algorithm to Project a Point Cloud onto a Consistent, Dense and Accurate 2D Raster Image. Sensors 2023, 23, 7030. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Dai, J.S. Euler-Rodrigues formula variations, quaternion conjugation and intrinsic connections. Mech. Mach. Theory 2015, 92, 144–152. [Google Scholar] [CrossRef]
Junger, C.; Notni, G. Investigations of Closed Source Registration Methods of Depth Technologies for Human-Robot Collaboration; Ilmenau Scientific Colloquium: Ilmenau, Germany, 2023. [Google Scholar]
Landmann, M.; Speck, H.; Schmieder, J.T.; Heist, S.; Notni, G. Mid-wave infrared 3D sensor based on sequential thermal fringe projection for fast and accurate shape measurement of transparent objects. In Proceedings of the Dimensional Optical Metrology and Inspection for Practical Applications X, Online, 12–16 April 2021; Volume 11732, p. 1173204. [Google Scholar] [CrossRef]
Ramirez, P.Z.; Tosi, F.; Poggi, M.; Salti, S.; Mattoccia, S.; Di Stefano, L. Booster Dataset. University of Bologna. May. 2022. Available online: https://amsacta.unibo.it/id/eprint/6876/ (accessed on 11 October 2023).
Landmann, M.; Heist, S.; Dietrich, P.; Speck, H.; Kühmstedt, P.; Tünnermann, A.; Notni, G. 3D shape measurement of objects with uncooperative surface by projection of aperiodic thermal patterns in simulation and experiment. Opt. Eng. 2020, 59, 094107. [Google Scholar] [CrossRef]
Landmann, M.; Speck, H.; Gao, Z.; Heist, S.; Kühmstedt, P.; Notni, G. High-speed 3D shape measurement of transparent objects by sequential thermal fringe projection and image acquisition in the long-wave infrared. In Proceedings of the Thermosense: Thermal Infrared Applications XLV, Orlando, FL, USA, 6–12 June 2022; Volume 12536, p. 125360P. [Google Scholar] [CrossRef]
Kalra, A.; Taamazyan, V.; Rao, S.K.; Venkataraman, K.; Raskar, R.; Kadambi, A. Deep Polarization Cues for Transparent Object Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 August 2020; pp. 8599–8608. [Google Scholar] [CrossRef]
Mei, H.; Dong, B.; Dong, W.; Yang, J.; Baek, S.H.; Heide, F.; Peers, P.; Wei, X.; Yang, X. Glass Segmentation using Intensity and Spectral Polarization Cues. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12612–12621. [Google Scholar] [CrossRef]
Chen, G.; Han, K.; Wong, K.Y.K. TOM-Net: Learning Transparent Object Matting from a Single Image. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9233–9241. [Google Scholar] [CrossRef]
Miangoleh, S.M.H.; Dille, S.; Mai, L.; Paris, S.; Aksoy, Y. Boosting Monocular Depth Estimation to High Resolution (poster). Available online: http://yaksoy.github.io/highresdepth/CVPR21PosterSm.jpg (accessed on 1 September 2023).
Schmieder, J.T. Untersuchung des Einflusses thermischer 3D-Messungen auf die Probenunversehrtheit. Bachelor’s Thesis, Friedrich-Schiller-Universität Jena, Jena, Germany, 2022. [Google Scholar]

Figure 2. (a) Searching algorithm of the correspondence of optically cooperative features in VIS and NIR (Lambertian surface). Spanning triangle

Δ H_{1} H_{2} P

with the two principle points

H_{1 / 2}

and the actual 3D point

P (X, Y, Z)

(according to [33]). (b) Limitations in stereo matching due to optically uncooperative features in VIS and NIR (non-Lambertian). Two typical depth errors are shown (cf. errors in RGB-D sensors, according to [2,6]). Type I error (missing depth): typically occur due to (1) specular reflections on the uncooperative surface or due to (2) the detection of different background areas behind the transparent surface. Type II error (inaccurate depth): due to the detection of the same point in the background (bg) area behind the transparent surface. Instead of the actual surface, the background is captured. Incidentally, the background has an inaccurate depth due to the change in the refractive index, and therewith in the direction of the intersecting rays. n1

\hat{=}

refractive index of air; n₂

\hat{=}

refractive index of optically uncooperative surface, e.g., glass.

Figure 2. (a) Searching algorithm of the correspondence of optically cooperative features in VIS and NIR (Lambertian surface). Spanning triangle

Δ H_{1} H_{2} P

with the two principle points

H_{1 / 2}

and the actual 3D point

P (X, Y, Z)

(according to [33]). (b) Limitations in stereo matching due to optically uncooperative features in VIS and NIR (non-Lambertian). Two typical depth errors are shown (cf. errors in RGB-D sensors, according to [2,6]). Type I error (missing depth): typically occur due to (1) specular reflections on the uncooperative surface or due to (2) the detection of different background areas behind the transparent surface. Type II error (inaccurate depth): due to the detection of the same point in the background (bg) area behind the transparent surface. Instead of the actual surface, the background is captured. Incidentally, the background has an inaccurate depth due to the change in the refractive index, and therewith in the direction of the intersecting rays. n1

\hat{=}

refractive index of air; n₂

\hat{=}

refractive index of optically uncooperative surface, e.g., glass.

Figure 3. Type I error (missing depth) in conventional stereo matching [21,34] (VIS or NIR) of visually uncooperative objects. (a) Measurement setup consisting of a passive NIR 3D sensor (convergent camera setup) and two NIR emitters to produce different reflections. Objects: two reflective and translucent vases and transparent glass Galileo thermometer (filled with liquid). n₁

\hat{=}

refractive index of air. (b) missing depth error “occure due to (i) specular reflections on the transparent surface” [6] or due to (ii) the detection of different background areas behind the transparent surface. The background is distorted by anything that has a different refractive index than air n₁ and an optical thickness. (c) Rectified stereo image (NIR) of TranSpec3D data set with drawn-in errors (i) and (ii) and further effect (iii). Further effect: Total reflection at the interface 4-5, due to refractive index difference between glass n₄ and liquid n₅ (n₄ > n₅).

Figure 3. Type I error (missing depth) in conventional stereo matching [21,34] (VIS or NIR) of visually uncooperative objects. (a) Measurement setup consisting of a passive NIR 3D sensor (convergent camera setup) and two NIR emitters to produce different reflections. Objects: two reflective and translucent vases and transparent glass Galileo thermometer (filled with liquid). n₁

\hat{=}

refractive index of air. (b) missing depth error “occure due to (i) specular reflections on the transparent surface” [6] or due to (ii) the detection of different background areas behind the transparent surface. The background is distorted by anything that has a different refractive index than air n₁ and an optical thickness. (c) Rectified stereo image (NIR) of TranSpec3D data set with drawn-in errors (i) and (ii) and further effect (iii). Further effect: Total reflection at the interface 4-5, due to refractive index difference between glass n₄ and liquid n₅ (n₄ > n₅).

Figure 4. Measurement setup of TranSpec3D data set. (a) Top view of setup, consisting of a thermal 3D sensor (sensor₁) and an NIR 3D sensor (sensor₂), two NIR emitters and a rotary table so that the two 3D sensors can measure the object from the same viewing direction. The rotary table is turned by

Δ α

. (Any 3D sensor, e.g., active/passive stereo system or RGB-D sensor, can be utilized as sensor₂.) (b,c) Top–front view and side view of our setup with a measurement volume of

160 m m \times 128 m m \times 100 m m

. Our background [37] is diffusely reflective with an angle of 30

^{\circ}

(c).

Figure 4. Measurement setup of TranSpec3D data set. (a) Top view of setup, consisting of a thermal 3D sensor (sensor₁) and an NIR 3D sensor (sensor₂), two NIR emitters and a rotary table so that the two 3D sensors can measure the object from the same viewing direction. The rotary table is turned by

Δ α

. (Any 3D sensor, e.g., active/passive stereo system or RGB-D sensor, can be utilized as sensor₂.) (b,c) Top–front view and side view of our setup with a measurement volume of

160 m m \times 128 m m \times 100 m m

. Our background [37] is diffusely reflective with an angle of 30

^{\circ}

(c).

Figure 5. Our data capturing and annotation pipeline for TranSpec3D stereo data set: (a) sensor calibration of (top) thermal 3D sensor and (bottom) NIR 3D sensor; (b) calibration of NIR 3D sensor, thermal 3D sensor, and the axis of the rotary table (turned by

Δ α

) in world coordinate system (wcs) using different test specimens, e.g., a 4-hemisphere specimen; (c) data collection of 110 different objects; (d) data analysis and annotation (generation of ground truth depth or disparity maps).

Figure 5. Our data capturing and annotation pipeline for TranSpec3D stereo data set: (a) sensor calibration of (top) thermal 3D sensor and (bottom) NIR 3D sensor; (b) calibration of NIR 3D sensor, thermal 3D sensor, and the axis of the rotary table (turned by

Δ α

) in world coordinate system (wcs) using different test specimens, e.g., a 4-hemisphere specimen; (c) data collection of 110 different objects; (d) data analysis and annotation (generation of ground truth depth or disparity maps).

Figure 6. Utilized test specimens for calibration of TranSpec3D data set setup. (a) Glass ball for determining and calibrating the axis of rotation of the rotary table. (b) Prism with frontal surface in matte white for determining the angle of rotation

Δ α

. (c) (left) Four-hemisphere for determination of transformations to the global world coordinate system (wcs). (Filament of 3D printing part: colorFabb PLA ECONOMY SILVER

2.85

m

m

; Painted with pure white matte RAL 9010). (right) Technical drawing;

l = 60.1 m m

;

r = 40 m m

; center of the hemispheres:

s 0

–

s 3

. The vectors

u

,

v

and

w

describe the rotation matrix

R_{wcs}^{gt}

.

Figure 6. Utilized test specimens for calibration of TranSpec3D data set setup. (a) Glass ball for determining and calibrating the axis of rotation of the rotary table. (b) Prism with frontal surface in matte white for determining the angle of rotation

Δ α

. (c) (left) Four-hemisphere for determination of transformations to the global world coordinate system (wcs). (Filament of 3D printing part: colorFabb PLA ECONOMY SILVER

2.85

m

m

; Painted with pure white matte RAL 9010). (right) Technical drawing;

l = 60.1 m m

;

r = 40 m m

; center of the hemispheres:

s 0

–

s 3

. The vectors

u

,

v

and

w

describe the rotation matrix

R_{wcs}^{gt}

.

Figure 7. Principle for determining and calibrating the rotation axis of the rotary table. For our data set, we measured the test specimen (glass ball) at 18 different positions Pos1 to Pos18 (by turning the turntable). The rotary table shown on the left is different from the rotary table used in our TranSpec3D data set. However, the test specimen is the same.

Figure 8. Data collection: Pipeline per object or object composition. (left) Data acquisition of the object with sensor₁ thermal 3D sensor (ground truth depth); (mid) alignment of the object to the NIR sensor by rotary table by rotary angle

Δ α

. (right) images taken with sensor₂ (NIR 3D sensor w/o GOBO projection) at different illumination positions. This is for natural data augmentation.

Figure 8. Data collection: Pipeline per object or object composition. (left) Data acquisition of the object with sensor₁ thermal 3D sensor (ground truth depth); (mid) alignment of the object to the NIR sensor by rotary table by rotary angle

Δ α

. (right) images taken with sensor₂ (NIR 3D sensor w/o GOBO projection) at different illumination positions. This is for natural data augmentation.

Figure 9. Data analysis and annotation: (left) Processing pipeline: transformation of the low resolution depth point cloud of sensor₁ to the left rectified camera of sensor₂, calculation of disparities and projection of points into 2D raster image using TMRP method [39]. (right) Pipeline of stereo images of sensor₂. Object: (mid, top-down) Transparent waterproof case for action camera, Petri dish (glass) and polymethyl methacrylate (PMMA) discs with different radii.

Figure 10. Homogeneous coordinate transformations

T_{from}^{to}

to generate ground truth disparities for TranSpec3D data set. Sensor₁

\hat{=}

thermal 3D sensor; gt

\hat{=}

ground truth depth (sensor₁); wcs

\hat{=}

world coordinate system; cntrP

\hat{=}

central point of the turning axis of the rotary table; sensor₂

\hat{=}

NIR 3D sensor; S

\hat{=}

sensor₂;

P_{S}^{rect 1 / 2}

\hat{=}

projection matrices of sensor₂.

Figure 10. Homogeneous coordinate transformations

T_{from}^{to}

to generate ground truth disparities for TranSpec3D data set. Sensor₁

\hat{=}

thermal 3D sensor; gt

\hat{=}

ground truth depth (sensor₁); wcs

\hat{=}

world coordinate system; cntrP

\hat{=}

central point of the turning axis of the rotary table; sensor₂

\hat{=}

NIR 3D sensor; S

\hat{=}

sensor₂;

P_{S}^{rect 1 / 2}

\hat{=}

projection matrices of sensor₂.

Figure 11. Classification of the captured scenes according to the surface material (top category) and complexity of the scene (lower category). A scene consists of six to seven pairs of images with different lighting. Transparent (transp.); object (obj.); foreground (fg).

Table 1. Overview of existing real stereo data sets for transparent objects and for disparity estimation (with ground truth). Transparent Object Data Set (TOD) [25]—first method of keypoint-based pose estimation for transparent 3D objects from stereo RGB images. Booster data set exclusively for disparity estimation [18]. TranSpec3D data set with ground truth first-ever generated without object preparation (obj. prep.). All data sets show indoor scenes. # Objects refers to the number of objects. Material: transparent objects (T); translucent objects (Tl); specular objects (S).

Data Set	# Objects	Type of Material	Scene Type	Ground Truth	Stereo Modality	Camera Arrangement
TOD [25]	15	(T)	single obj.	opaque twin $^{†}$	RGB	parallel
Booster [17,18]	606	(T) + (Tl) + (S)	scene	w/ obj. prep.	RGB	parallel
TranSpec3D (ours)	110	(T) + (Tl) + (S)	1–5 obj., partly overlap	w/o obj. prep. $^{‡}$	NIR	convergent

^{†}

only depth maps of transparent and opaque object (RGB-D using a Microsoft Azure Kinect sensor) instead of disparities;

^{‡}

with an additional thermal 3D sensor.

Table 2. Overview of existing real-world mono data sets for transparent objects with ground truth depth. ClearGrasp-Real data set [2] for robot manupulation (grasping tasks). Toronto transparent objects depth data set (TODD) [12], a large-scale real Transparent Object Data Set. TRANS-AFF, an affordance data set for transparent objects [14]. STD data set [32], composed of transparent, specular, and diffuse objects. All data sets show indoor scenes. # Objects and # Samples refers to the number of objects and samples. (S), (T) and (D) refer to specular, transparent and diffuse materials.

Data Set	# Objects / # Samples	# Type of Material	Scene Type	Ground Truth	Modality RGB-D (RealSens)
ClearGrasp-Real [2]	10/286	(T) + (D)	1–6 objects	opaque twin $^{†}$	D415
TODD [12]	6/ $1.5$ k	(T)	1–3 obj., overlap	3D model $^{‡}$	D415
TRANS-AFF [14]	NN/ $1.3$ k	(T)	single obj.	opaque twin	D435i & D415
STD [32]	50/ 27 k	(S) + (T) + (D)	>4 obj., cluttered	3D model $^{§}$	D415

^{†}

spray-painting; replacing of opaque object with a GUI app, see [2];

^{‡}

due to AprilTags on the base template, the 3D model of the object(s) can be adjusted to the appropriate locations to complete the ground truth depth;

^{§}

due object capture API on macOS [32].

Table 3. Overview of calibration methods used.

	Thermal 3D Sensor	NIR 3D Sensor
	(sensor₁)	(sensor₂)
method/process	bundle adjustment (cond.: only 1 static pattern over time)	according to Zhang [38]
target pattern (Figure A2)	ArUco marker and symmetrical circles	checkerboard with three circle marker
target material	circuit board FR-4 w/ copper	glass plate

Table 4. Comparison of two available real stereo data sets for transparent and specular surfaces.

	Booster [44]	TranSpec3D (Ours)
	(w/ Object Preparation)	(w/o Object Preparation)
Data set acquisition pipeline
1. Calibration	only one 3D sensor	two 3D sensors
2. Acquisition w/o obj. prep.	only stereo images	stereo images and raw depth
3. Object/scene preparation	necessary $^{†}$	-
4. Acquisition w/ obj. prep.	for gt (obj. positioning $^{†}$ )	-
5. Data analysis & annotation	normal process	+ convert raw depth to gt
Required time	very high	low
Required effort	very high	low
Object
Preparation	required	not required
Reusability	no → mostly irreparable	yes
w/high thermal conductivity	detectable	not detectable (Section 5.2.1)
w/low optical density	detectable	not detectable
Ground truth
Influence on ground truth	manipulated surface	real surface (Section 5.2.1)
Density of ground truth image	dense	dense, but w/o background
Resolution of raw depth	same to images	lower ( $0.37$ MPx)
Resolution of ground truth	same to images	same to images (but w/up-sampling [39])
Experimental setup
Additional hardware	6x projectors; spray	thermal 3D sensor [45]
Measuring volume (indoor)	not limited	limited (only laboratory ‡)
Measuring volume (outdoor)	-	limited (only laboratory ‡)
Camera arrangement	parallel	convergent
Wavelength of stereo system	VIS	NIR, $λ = 850 n m$
Image resolution	$12.4$ Mpx/ $2.3$ Mpx	$2.2$ Mpx
Baseline of stereo system	80 $m$ $m$ / 40 $m$ $m$	130 $m$ $m$
Costs	high personnel costs and obj. consumption	very high hardware costs (Section 5.2.1)

^{†}

time intensive; ‡ technology also works in sunlight; “laboratory condition” only refers to the scenery of the objects, which is given by the safety-related enclosure (CO₂ laser of the thermal 3D sensor).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Junger, C.; Speck, H.; Landmann, M.; Srokos, K.; Notni, G. TranSpec3D: A Novel Measurement Principle to Generate A Non-Synthetic Data Set of Transparent and Specular Surfaces without Object Preparation. Sensors 2023, 23, 8567. https://doi.org/10.3390/s23208567

AMA Style

Junger C, Speck H, Landmann M, Srokos K, Notni G. TranSpec3D: A Novel Measurement Principle to Generate A Non-Synthetic Data Set of Transparent and Specular Surfaces without Object Preparation. Sensors. 2023; 23(20):8567. https://doi.org/10.3390/s23208567

Chicago/Turabian Style

Junger, Christina, Henri Speck, Martin Landmann, Kevin Srokos, and Gunther Notni. 2023. "TranSpec3D: A Novel Measurement Principle to Generate A Non-Synthetic Data Set of Transparent and Specular Surfaces without Object Preparation" Sensors 23, no. 20: 8567. https://doi.org/10.3390/s23208567

APA Style

Junger, C., Speck, H., Landmann, M., Srokos, K., & Notni, G. (2023). TranSpec3D: A Novel Measurement Principle to Generate A Non-Synthetic Data Set of Transparent and Specular Surfaces without Object Preparation. Sensors, 23(20), 8567. https://doi.org/10.3390/s23208567

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TranSpec3D: A Novel Measurement Principle to Generate A Non-Synthetic Data Set of Transparent and Specular Surfaces without Object Preparation

Abstract

1. Introduction

2. Background Information

2.1. Limitation of Three-Dimensional Perception of Transparent Objects

2.2. Key Role of Data Set

3. TranSpec3D Measurement Principle

3.1. Measurement Principle and Experimental Setup

3.2. Generation of Data Set

3.2.1. Sensor Calibration

3.2.2. Calibration of the Measuring System

Determinationof the Turning Axis of the Rotary Table

Determination of the Rotary Angle of Rotary Table

Determination of Transformations to the World Coordinate System

3.2.3. Data Collection

3.2.4. Data Analysis and Annotation

Homogeneous Coordinate Transformation

Conversion Depth to Disparity

Projection of Plane Cloud into 2D Raster Image

4. TranSpec3D Data Set

4.1. Data Set Statistics

4.2. Accuracy Assessment

4.3. Comparison of TranSpec3d Data Set with Booster Data Set

5. Discussion

5.1. Advantages of Our New Proposed Measuring Principle TranSpec3D

5.2. Limitations

5.2.1. Due to the Thermal 3D Sensor

5.2.2. Due to the Current Measurement Setup

5.3. Open Question

5.4. Future Work

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Limitations of Active 3D Sensors in VIS and NIR with Optically Uncooperative Objects

Appendix B. Approaches to 3D Detection of Optically Uncooperative Objects

Appendix C. Sensor Specifications

Appendix D. Sensor Calibration and Test Specimens

Appendix E. TranSpec3D Data Set

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI