1. Introduction
Robots are becoming more popular in domestic applications that differ significantly from industrial settings. Cleaning robots and robot assistants have to operate in a priori unknown environments. In contrast to the industrial environment, the objects in the domestic environment have random positions. This application requires detecting and operating objects in various lighting conditions and dealing with occlusion. Moreover, in the example scenario, when a user asks a robot for a mug, the robot should open the door to another room, open the cabinet and find a mug. This task requires the robot to deal with various articulated objects that can be rotational (doors) or translational (drawers).
Numerous mobile robots utilize a static model of the environment by uploading a pre-existing map. Other robots generate environment maps dynamically during operation by employing sensors and advanced algorithms for localization and mapping. Very often, the provided or obtained model is assumed to be static. However, the configuration of the articulated objects can change, and the robot should exploit the properties of articulated objects during the execution of the mission. Determining how the robot should interact with articulated objects in the environment is a challenging task. Present-day solutions predominantly rely on probabilistic methods and require executing specific sequences of movements to ascertain an object’s model [
1,
2]. This approach requires time-consuming actions and may damage articulated objects.
In contrast, in this research, we are focused on detecting and estimating the properties of articulated objects from visual RGB-D images only. This includes the detection of drawers, cabinet fronts, and handles, and estimating the axes of rotation and translation. The goal is to obtain accurate information about articulated objects before the robot starts the interaction with the environment. The objective of this study is to develop a system capable of building an environment model represented by a point cloud, utilizing information about articulated objects, specifically cabinets and drawers. The practical application of this system lies in enhancing the autonomy of mobile robots. The primary motivation behind this research is to eliminate the necessity for robots to interact with objects to comprehend their kinematic structure. This approach aims to mitigate the risks associated with potential damage during interaction and reduce the time required for model building. To accomplish this objective, the system solely relies on a single RGB-D frame to generate the model. No supplementary information regarding the object’s location within the environment is required.
  1.1. Related Work
  1.1.1. Detection and Estimation of Articulated Objects
Recent methods for the detection of articulated objects utilize RGB-D images and neural networks. Staszak et al. [
3] proposed an RGB-D-image-based approach for estimating articulated objects like cabinets and drawers. The method uses a pair of RGB-D images, assuming that the object state changes while the robot remains stationary. The SSD object detector [
4] is employed to detect object positions in the input image. The parameters of the joint are determined by extracting a point cloud from the bounding box area and removing static points. Initial joint parameters are estimated using the RANSAC algorithm and then optimized using Particle Swarm Optimization. This approach effectively explains data in both images based on the input hypothesis [
3]. The method presented in [
3] successfully determines the position and configuration of articulated joints and objects. It was validated using the RBO dataset [
5], which includes sequences of interactions with articulated objects. However, the method has limitations, such as incomplete object visibility, noisy depth data, or object occlusion, which influences accurate state estimation. Additionally, the system requires robot interaction with the object to estimate its model, which introduces time requirements and potential risk of object damage. In this work, we are focused on estimating the information about articulated objects from a single view only.
Object interactions provide an alternative approach for building environment models with articulated objects, as discussed by Hausman et al. [
1]. This method involves sensor fusion, combining object tracking from a vision system and measurements from the robot manipulator’s sensors to reduce uncertainty in the articulated object model and parameters. The authors present a graph probabilistic model of robot interaction that includes the articulated object model, its parameters, measurements of the object’s position with six degrees of freedom, measurements from the manipulator’s sensors, and the action taken. The vision system uses noisy measurements, which are projected onto the evaluated model using inverse kinematics. The estimated configuration is relative to the original observation and lies on the model’s surface. A simple kinematics function is employed to determine the position corresponding to the assumed configuration. The sensor model is approximated by a two-dimensional normal distribution centered at a zero vector, depending on the translational and rotational distance between the model projection and the observations. The manipulator sensor model is based on short translational movements, measured in binary to indicate successful execution. The authors compare two approaches for selecting actions that reduce model uncertainty: entropy reduction and information gain. The study findings indicate that greedily minimizing entropy after each step is not the optimal method to reduce the entropy of the proposed models and their parameters. Instead, the information gain approach, using Kullback–Leibler divergence, is more effective.
The environment model can be built using a coefficient graph (factor graph) proposed by Desingh et al. [
6]. This method addresses computational complexity in algorithms based on particle filters. The authors use a graph to reduce complexity, taking a 3D point cloud and a geometric model as input. The problem is represented as an undirected graph MRF (Markov Random Field), inferring hidden variables to maximize joint probability. Inference involves message exchange between variables until convergence. The best estimate determines the positions of object parts, forming the overall position estimates. The Pull Message Passing for Nonparametric Belief propagation (PMPNBP) algorithm is used for inference, evaluating samples, and approximating mixtures efficiently [
6].
In [
2], the robot acquires two point clouds. The first point cloud is obtained before the interaction of the robot with the object. Then, the robot employs control techniques to actively engage with the object through a concise one-step action, followed by capturing a subsequent point cloud. By utilizing these two point clouds, the robot generates a comprehensive physics model. Then, the simulated model is utilized to obtain the control trajectory for the robot. In [
7] two RGB-D images and fully-convolutional neural network architecture are used to estimate the configuration of the articulated object. A similar approach presented in [
8] aims to identify potential articulated objects by leveraging affordance prediction. It actively interacts with these objects to generate articulated motions and deduces the articulation properties based on visual observations before and after each interaction. It utilizes the affordance network. Based on the obtained affordance map, the robot actively explores the scene and interacts with the objects. Then, the neural-based method from [
9] simultaneously predicts the 3D geometry and articulation model of the object. In contrast, our method focuses on extracting information about articulated objects from a single view.
  1.1.2. Neural Networks for Object Detection
In general, object detection is the task of identifying and locating objects in an image [
10]. It is more challenging than object recognition, as it involves finding the position of objects in the image and assigning appropriate labels and probabilities [
10]. The region proposing block selects promising regions for further processing, providing bounding boxes and objectivity scores [
10]. Proposing areas can be performed using methods like Selective Search or the RPN (Region Proposal Network) used in Faster R-CNN [
11]. These methods generate object candidates efficiently [
11]. YOLO [
12] and SSD [
4] detectors adopt a grid-based approach, treating position estimation as regression to achieve high speed.
The detected object is classified based on the detector output and visual features inside the bounding box. In neural networks, object detection, feature extraction, and classification are usually handled by convolutional neural networks, as they can extract features and generalize them very well. The most common backends for feature extraction are ResNet [
13], MobileNet [
14] or EfficientNet [
15], EfficientNetV2 [
16] or transformer type networks like ViT [
17], BEiT [
18], DeiT [
19], Swin [
20].
  1.1.3. Neural Networks for Scene Segmentation
In the framework proposed in this article, we utilize various architectures for image segmentation. Segmentation divides an image into coherent regions based on similar characteristics [
21]. Unlike classification, which assigns a class to the whole image, segmentation assigns a class to each pixel [
21]. The network architecture for image segmentation typically consists of an encoder and a decoder [
22]. The U-Net is a popular segmentation network with a symmetrical architecture [
23]. It captures the context in the image and precisely locates regions using feature map fusion [
23]. DeepLab employs atrous convolutions to extract context efficiently [
24], while FastFCN replaces them with a Joint Pyramid Upsampling module for higher-resolution feature maps [
25].
  1.2. Approach and Contribution
The main contributions of this article include the following:
- neural-based handle detection, front detection and segmentation, and rotational axis detection modules; 
- data collection method, training, and systematic comparison between various detectors and segmentation methods; 
- architecture of the system that builds the full model of the articulated objects from single RGB-D images. 
We utilize state-of-the-art Deep Neural Network models and re-train them to fulfill the presented tasks.
  2. Materials and Methods
The architecture of the proposed method for the detection and estimation of articulated objects is presented in 
Figure 1.
The purpose of the system is to build a point-cloud-based environment model, which takes into account information about the articulated object’s position, the handle, and the axis of rotation or translation. The entire system for building an environment model using information about articulated objects was created in ROS Noetic (Robot Operating System), which is a meta-system used to develop robotic software. The system for building an environment model using information about articulated objects consists of five main components (
Figure 1): a Kinect camera, a handle detector, a front detector, a module that determines axes of rotation, and a module that builds the environment model. The diagram depicts the tasks of each node and their relationships. The Kinect camera publishes RGB and depth images, which are converted into a point cloud. The handle and front detector modules operate solely on RGB images. The prediction results from both detectors are forwarded to the rotation axis detection module and the scene model building node. The rotation axis detection module determines the rotation axis of rotating fronts in the RGB image and publishes the prediction results to the module that builds the scene model.
The detected objects are utilized to construct an environment model, incorporating information about articulated objects. Based on the handles and fronts found in the image, their positions in 3D space are estimated. For translational fronts, the positions of the translation axes are estimated using the normal vectors of the front surfaces.
  2.1. Handle Detection Module
The handle detection module aims to identify the bounding boxes of handles in the RGB image and publish the prediction results. To achieve this, an object-detection neural network is employed. Several networks were tested, including SSD [
4] with an Inception V2 encoder [
26], EfficientDet [
15], CenterNet architecture [
27] with a ResNet-v1-50 encoder [
13], and CenterNet with a ResNet-v1-101 encoder. The goal is to select the most efficient architecture for handle detection.
The CenterNet architecture represents objects as a single point, denoting the center of their bounding box, instead of utilizing a list of proposed rectangular regions. The network employs keypoint estimation to locate the center point and regress the object’s size, location, and orientation. This approach is claimed to be simpler, faster, and more accurate compared to detectors based on a list of proposed regions. In the articulated object information system, the handle detector takes an RGB image as input and produces a list of bounding boxes, class membership information, and class probability predictions. This prediction format facilitates the determination of handle locations in three-dimensional space, enabling easier robot interaction with the object.
  2.1.1. Dataset
To train the handle detector, a suitable dataset must be collected. This dataset should contain images with labeled handles, specifying their positions as bounding boxes. Unfortunately, no publicly available dataset provides a sufficient number of labeled images with handles. Therefore, a new dataset was created for the training, validation, and testing of the neural networks. The datasets used in this research consist of various publicly available images and images captured in different indoor environments. This diversity enhances the variety and resemblance to the desired robot environment.
  2.1.2. Training
All images in the dataset are set to a resolution of 64 × 480 px to match the images from the Kinect camera. Sample images with applied bounding boxes for handles are shown in 
Figure 2. The dataset details are summarized in 
Table 1. The network training utilized the training and validation dataset described in 
Table 1. The training images underwent heavy augmentation. Localization was measured using the L1 norm (Manhattan metric), and classification employed the Focal Loss [
28]. A prediction with an Intersection over Union (IoU) greater than 0.7 was considered a correct detection. The mean Average Precision (mAP) metric, calculated as the average area under the precision–recall curve for all classes, was used for final model evaluation. The optimization employed the Adam (Adaptive Moment Estimation) algorithm, which adjusts the momentum value during training. The learning rate was reduced using cosine descent (cosine decay) [
29], with a baseline constant value of 0.001 and a warm-up phase of 1500 steps with a value of 0.00025. Pre-trained weights from the COCO dataset [
30] were used as initial weights for the model, enabling fine-tuning specifically for the handle detection task. The training was scheduled for 20,000 steps.
  2.2. Front Detection and Segmentation Module
The goal of the front detection and segmentation module is to detect the fronts of cabinets and drawers. The detected objects are represented by bounding boxes, and within these boxes, segmentation is performed using neural networks to determine the pixels belonging to the articulated object.
  2.2.1. Dataset
To train a front detector and perform segmentation of the detected fronts, a suitable dataset needs to be collected. This dataset should include images with labeled fronts, along with annotated bounding boxes, segmentation masks, and assigned front classes (rotational or translational). Unfortunately, no publicly available dataset provides a sufficient number of such annotations. Therefore, a new dataset was created for the training, validation, and testing of the neural networks. Example annotations of the fronts of articulated objects from the training dataset are shown in 
Figure 2b.
In certain cases, distinguishing between rotational and translational objects is challenging. This situation arises when the handle is located at the central point of the front, and the front is taller than it is wide or has a square shape. Determining whether the rotation axis is on the right or left side becomes difficult, as shown in 
Figure 3a. Another scenario occurs when it is challenging to determine whether the rotation axis is on the right, left, or bottom of the front, as depicted in 
Figure 3b. Additionally, the front’s appearance may not match its actual class. We decided not to include these images in the training set because these objects require additional robot–object interaction to determine their class. Sample images used for training are displayed in 
Figure 2, and an overview of the dataset is presented in 
Table 2.
  2.2.2. Training
A function with three components was used to evaluate the performance of the front detection module:
- the position of the detected window, which is calculated using smooth L1 loss. This function is less sensitive to outliers; 
- classification of the detected window, which is calculated as classical cross-entropy loss; 
- segmentation of the detected window, which is calculated as the binary cross entropy of the segmentation mask. 
All components were given equal weights, and a prediction with IoU greater than 0.5 was considered a correct detection. Stochastic Gradient Descent with momentum was employed as the optimization algorithm for weight updates. The initial learning rate was set to 0.001, but it decreased as the cost function calculated on the validation set approached a plateau. The model was initialized with weights pre-trained on the COCO Dataset [
30]. The network training was scheduled for 500 epochs, and during each epoch, an early stopping function was implemented to halt the training process when the cost function on the validation set no longer decreased.
  2.3. Rotational Axis Detection Module
The task of this module is to identify the axis of rotation in an RGB image and publish the prediction results. In the proposed method, rotational axis detection is treated as an object segmentation problem. Therefore, a neural network for object segmentation was chosen, namely the U-Net architecture [
23], which utilizes EfficientNetB0 as an encoder.
The output of this component is four lists containing the coordinates of the two points lying on the ends of the axes, three lists containing the A, B, and C coefficients of the straight lines that correspond to the axes found, and a list containing the index of the front to which a given axis of rotation has been assigned. All lists have the same length, equal to the number of predictions remaining after processing. The results prepared in this way will allow subsequent estimation of the axis of rotation in three-dimensional space.
  2.3.1. Dataset
We created a custom dataset to train the network for rotational axis segmentation. This dataset consists of images containing rotation fronts, along with rotation axis masks, handle masks, and rotation front masks. Because no publicly available dataset provides such data, it was necessary to create a new dataset for the training, validation, and testing of the neural networks. Ambiguous rotational axes were omitted during dataset creation. Therefore, the module does not estimate the axes of rotating objects with horizontal axes. Some of the images were obtained directly from the handle and front detection datasets. An overview of the dataset is presented in 
Table 3.
  2.3.2. Training
The network training was conducted on the training and validation sets described in 
Table 3. The Dice coefficient was used as the cost function to evaluate segmentation effectiveness. The Adam optimizer was utilized for training. The initial learning rate was set to 0.001, and, similarly to the training of the network for front detection and segmentation, it decreased as the cost function calculated on the validation set approached a plateau. The initial weights of the encoder were initialized with weights pre-trained on the ImageNet dataset. This ensured that the network training did not start with random weights and allowed for faster model convergence. The training was scheduled for 50 epochs, with all training and validation images used in each epoch. To mitigate overfitting, an early stopping function was implemented, which detected the point when the cost function on the validation set ceased to decrease.
Next, a five-channel input is prepared for the neural network. The first three channels correspond to the RGB image (
Figure 2). The fourth channel represents the handle mask, which includes the detection of all handles in the image, even if these handles belong to translational fronts (
Figure 2). The last channel represents the mask of all rotational fronts for which handles are detected (
Figure 2).
Postprocessing is applied to the output from the neural network to obtain the rotational axis model. First, noise is eliminated from the mask obtained from the CNN output using a morphological opening, which removes isolated pixels while preserving the size of larger objects. Next, the two endpoints of each identified pivot axis are located on the denoised mask. The Canny algorithm is utilized to find edges in the mask. Based on the detected edges, individual contours are extracted, with their upper and lower vertices representing the two endpoints of the axis of rotation in the image. This method is limited to rotational fronts with a vertical axis of rotation.
Using the determined endpoints, the directional coefficients of the straight line forming the rotational axis are calculated. These directional coefficients are then used to assign the axis of rotation to a specific front. Initially, a decision is made for each pivot front regarding which side of the front contained the axis of rotation. This decision is based on the distance of the handle from the side edges of the front mask. It is important to note that the resulting front segmentation mask may not be perfect, so the edge alone cannot be relied upon to determine the axis of rotation. The side on which the object’s axis of rotation is located is determined by the side whose edge is at a greater distance from the handle. Subsequently, the distance between the center of this edge and each of the straight lines representing the detected axes of rotation is calculated. The axis of rotation for a given front is the axis that lies closest to the center of the edge on the rotating side of that front, provided it falls below a certain distance threshold. If the closest distance exceeds a certain threshold, then the front does not have an assigned axis of rotation. In cases where a front attempts to assign a pivot axis already assigned to another front, it is only allowed if the distance to that axis is smaller than the distance to the original front. After analyzing all rotational fronts, rotation axis predictions that have not been assigned to any front are discarded.
  2.4. Articulated Object Model
After receiving the point cloud, the node incorporates information about articulated objects into the environment model. The node processes each object individually by reading its class and creating a vector of indices for the points within the object’s segmentation mask. A smaller point cloud containing only the detected object is extracted based on these indices. Using the RANSAC algorithm and iterative estimation, the node identifies the plane corresponding to the object within the extracted point cloud. It returns the indices of the points belonging to the plane and its coefficients, creating a point cloud solely consisting of points on the plane.
For translational objects, the node computes the normal to the object’s surface, determining its translation axis. This is achieved by utilizing Principal Component Analysis on the nearest neighbors of a given point in the PCL library. The estimation is performed on a selected portion of the point cloud using the OpenMP interface for multi-threaded calculations, improving efficiency. The calculated normal vectors for the selected portion are averaged to obtain a single normal vector representing the object’s translation axis.
Regarding rotational objects, the estimation of the rotation axis occurs when the axis prediction is available. The point cloud representing the frontal plane is assigned a color corresponding to its class and then added to the output point cloud containing the environment model and articulated object information. The estimation of handle positions takes place upon receiving the handle prediction results. Unlike the estimation of frontal positions, handle estimation does not involve calculating normals, and the point clouds with handle planes are not stored separately for later processing. They are directly included in the output point cloud representing the environment model.
Estimating the rotation axis of rotational objects is performed when the axis prediction on the RGB image becomes available. This estimation is done separately for each axis. A previously stored vector of point clouds containing clouds with rotational front planes is used. The rotation axis is estimated based on the index obtained from the axis prediction. The endpoints of the axis in 3D space are then estimated using the endpoints found in the RGB image. If an endpoint from the RGB image does not have a corresponding point in the point cloud, additional operations are performed to determine its position in space. Each point in the cloud is projected onto the RGB image, allowing for the calculation of the distance between the point and the axis line in the image, as well as the distance between the point and the predicted endpoint. The distances for each point are summed, and the point with the smallest summed distance in the RGB image is selected as the new endpoint of the rotation axis. The two endpoints are connected to form the rotation axis.
  3. Results
  3.1. Handle Detection
Example results of handle detection are presented in 
Figure 4a. Four architectures for handle detection were tested. In 
Table 4, we show a numerical comparison of these models. They include a comparison of detector performance, calculated as mAP for Intersection over Union (
IoU) ratios greater than 0.5, greater than 0.75, and a comparison of the average inference time of a single image. The inference was performed on a machine equipped with a quad-core processor AMD Ryzen 3 3100 supporting eight threads.
The shortest single-image inference time was obtained by the network with SSD Inception V2 architecture. At the same time, this network had the lowest mAP score for both IoU thresholds. The EfficientDet D0 network achieved better results of 73.5% and 109.9%, respectively, relative to the SSD Inception V2 network, but the cost was the inference time, which increased by 169.8% compared to the SSD Inception V2 network. The best mAP value for the threshold set to 0.5 was obtained by the CenterNet network with ResNet101 as the encoder. This network, however, had the longest inference time of all the networks tested. The mAP value, for the threshold set to 0.5, was 0.43% worse for the same network with ResNet50 acting as an encoder, with its inference time being 37.49% shorter compared to the network with ResNet101. In addition, the network with ResNet50 achieved a better mAP score for the IoU 0.75 threshold by 2.60%. This means that the network matched the positions of the surrounding rectangle slightly better. Relative to the EfficientDet D0 network, CenterNet ResNet50 achieved better mAP scores by 27.65% and 123.11%, but the processing time for a single image was 121.24% higher. From the results presented in 
Table 4, we conclude that the EfficientDet D0 network has a problem locating the surrounding rectangle accurately, as the mAP@IoU = 0.75 value is 70.84% worse than the mAP@IoU = 0.5 value.
In 
Figure 5, we show a comparison of EfficientDet D0 and CenterNet ResNet50 network detections on selected images from the test set. In the figures, we show the differences in the performance of these two networks. In 
Table 4, the EfficientDet D0 network has a lower recall than the CenterNet network, which is manifested by the number of detected handles in 
Figure 5a,b. In 
Figure 5, we show a difference in the positioning accuracy of the rectangle surrounding the handle. The EfficientDet D0 network has trouble capturing the entire object, and sometimes the object bounding box is too large. Both networks are capable of generating false-positive predictions, but the CenterNet network gives them lower confidence. By setting a higher detection threshold, such predictions can be discarded while maintaining a large number of correct detections.
Handle detection plays a very important role in the environment model building system. In addition to influencing the position of the handle in 3D space, it also affects the accurate estimation of the axis of rotation, because one of the inputs of the axis estimation module is the mask of detected handles, which is then fed to the neural network. Handle detection also helps determine which side of the front is rotating. Taking this into account, it was necessary to choose a network with high detection sensitivity while maintaining high localization precision. For this reason, we decided to use the CenterNet network with ResNet50 as the encoder, which had a similar mAP score to CenterNet ResNet101, but the average inference time was significantly shorter.
  3.2. Front Detection and Segmentation
Example results of the front segmentation are presented in 
Figure 4b. The average inference time of a single image, on a machine equipped with a quad-core processor AMD Ryzen 3 3100 supporting eight threads, was 1.19 s. 
Table 5 and 
Figure 6 show the results of the detection performed on the test set described in 
Table 2. The mAP value for areas with at least half overlap was 0.86, where the detection threshold of each area had to be at least 50. Increasing the minimum detection threshold causes the mAP value to decrease. This is because a larger detection threshold decreases detection sensitivity, as many detected areas are unable to exceed the detection threshold, so they are missed and detection sensitivity decreases. Increasing the minimum coverage area to a value of 0.75 reduced the mAP score for each detection threshold relative to IoU = 0.50. For detection thresholds of 0.5, 0.75, and 0.95, mAP scores deteriorated by 9.30%, 8.43%, and 7.80%, respectively. However, the mAP values are still at a high level, which means that the detector performs very well in detecting fronts, despite the requirements for the exact location of the surrounding rectangle.
It is also worth noting that for an IoU value of 0.75, the decrease in mAP values, between the detection threshold of 0.5 and 0.95, is equal to only 8.97%, while for IoU = 0.50 it is 10.47%. This means that as the minimum overlap threshold increases, the detector gets rid of low-probability detection. A similar relationship can be seen in the case of the IoU = 0.90 threshold, where the decrease in mAP values between the detection threshold of 0.5 and 0.95 is only 2.56%. Looking at the mAP@IoU = 0.75 values from 
Table 4 and 
Table 5, it can be seen that the front detector does a better job of positioning the detected objects than the handle detector. The difference is due to the size of the objects to be detected and the shapes of these objects. The dimensions of the fronts are significantly larger than those of the handles, making them easier to find. Fronts also tend to have a similar rectangular shape, while the shape of handles often varies.
In 
Figure 7, we show the importance of selecting an appropriate detection threshold. The detector should generate predictions as close to ideal as possible because their accuracy determines the position of the front in 3D space and the estimation of the axis of rotation or translation of the fronts. A high threshold should ensure that the number of false-positive detections will be low. However, it can generate false-negative predictions (
Figure 6), as increasing the detection threshold decreases the sensitivity of detection. False-negative detections, however, have less impact on system performance than false-positive detections. Taking this into account, the default detection threshold for the front detector is set to 0.95, but even this does not guarantee that false-positive samples will not be included in the system.
  3.3. Rotation Axis Estimation Module
Table 6 shows a comparison of the number of detections of rotational axes before and after processing the prediction from the neural network. Before processing the prediction, 79.55% of the rotational axes were correctly detected, while after processing, only 61.02% remained. Detections that could be visually assigned to an existing front, before prediction processing, were taken into consideration. Correct detections after prediction processing were those axes that were successfully assigned to an existing front by the system. It is worth noting a large number of false detections of axes before prediction processing, which accounted for 35.32% of all results. Prediction processing is designed to remove the axes that cannot be assigned to a rotary front. In this case, false predictions are defined as detections that were assigned to a false front. They accounted for only 3.54% of all detections. The lower number of correct and false detections after processing the results is due to the rigorous way of assigning axes to the front, which depends on the quality of segmentation of the rotational front, the quality of handle detection, and the quality of segmentation of the rotational axis itself.
 Figure 8 visually shows the results of neural network prediction for the rotation axis. The proposed method correctly associates the edges of the objects with rotational axes and removes incorrect detections from the neural network.
   3.4. Articulated Object Model
The average parameter estimation time for a single articulated object in 3D space is shown in 
Table 7. Most of the time is spent on the estimation of the axis of rotation of a given object in 3D space. It takes 3.11% less time to estimate the parameters of the front. The estimation of the translation axis takes a shorter time, by 8.30%, than the estimation time of the rotation axis. This time was calculated from the moment of determining the plane of the translation front in 3D space to finding the normal to this plane, which is the translation axis of the object.
Table 8 shows the results of the detection of objects and their parameters on sequences from the test set. The system succeeded in estimating the parameters of 96.88% of the fronts and 96.88% of the axis parameters of the translational objects, as well as 97.50% of the handles of the translational objects. The total number of handles of translational objects is greater than the total number of fronts and axes of these objects, as some objects have more than one handle. In the case of rotating objects, the system was able to estimate 94.34% of the parameters of the fronts and 88.10% of the parameters of the axes of these objects, and 94.34% of the handles.
 Figure 9 shows the results of the system when the camera did not change its position between object states. In 
Figure 10, we show the results of the system when the camera changed position with respect to the observed objects. The planes of fronts and handles are represented as a point cloud, the colors of which depend on the object class. Rotation fronts take on shades of blue and purple, translation fronts take on shades of green, and handles are red. The translation and rotation axes in the environment model are visualized as straight lines, colored green and light blue, respectively.
   4. Discussion
Estimating the parameters of articulated objects without observing their various states is a challenging task, even for humans who have experience in understanding how such objects behave. When faced with a new and unfamiliar object, humans can make mistakes in predicting its nature, such as mistaking a cabinet for a drawer or incorrectly determining the rotation axis of a cabinet. Therefore, this task is even more difficult for computers.
However, the results presented in this paper demonstrate that it is indeed possible to create a system capable of building an environment model in the form of a point cloud using information about articulated objects without the need to observe them in different states. This achievement is made possible by leveraging the latest advancements in convolutional neural networks, combined with the development of a large dataset required for training the network and employing classical image processing methods. Our experience shows that around 2000 samples for rotational and translational fronts, 5000 samples of handles, and 1700 samples of rotational axes are sufficient to obtain a neural network that generalizes well to previously unseen examples. By utilizing neural networks, computers can emulate the behavior of humans who estimate the parameters of articulated objects by analyzing the shape, dimensions of the fronts, and handle positions and comparing them with past observations.
The results obtained on test sets illustrate that the system can successfully detect an object, identify the handles for manipulating the object, determine the object’s class, and estimate either the rotation axis or translation based on the assigned class. Remarkably, all of this can be achieved using a single RGB-D pair of images. Consequently, the model estimation does not necessitate interacting with the object or observing it in various states. This approach minimizes the risk of damaging an object with unknown parameters and enables faster determination of the object’s parameters.
The article introduces four modules, two of which are independent and responsible for handle detection and front detection and segmentation. The other two modules are dependent on the outputs of the preceding nodes. The first dependent node focuses on rotation axis detection and utilizes the results from handle and front detection. The second dependent node constructs an environment model by enriching the point cloud with the outputs of the other three nodes.
  5. Conclusions
The developed system can be applied to assistant robots working in indoor domestic environments. With its help, it is possible, for example, to build a map of the environment, enriched with information about articulated objects. The information stored in the map can be used by the robot to safely interact with articulated objects.
The result of the system’s operation can also serve as information a priori to another module estimating object parameters based on observations, so that such a module will achieve convergence faster. For example, a robot, knowing beforehand that the object it wants to interact with is, with a certain probability, a rotating object, will start making movements to confirm this hypothesis. This will allow it to obtain faster convergence on the model, and thus the task of opening such an object will also be completed faster. The robot will not waste time checking other hypotheses. It also decreases the probability of destroying such objects.
The developed system has some limitations. It is unable to detect articulated objects that do not have visible handles. This limitation is challenging to overcome because of ambiguity. Parameters such as the axis of rotation of articulated objects are impossible to estimate from a single image, for a human, without characteristic features like handles. The current system also ignores rotating objects that have the axis of rotation placed horizontally, due to the problems that occur when creating a dataset. However, this problem can be eliminated by assigning such objects a certain probability of belonging to one of the two classes when marking the training data. The system also has a problem with estimating the rotation axis when the handle does not lie on the opposite side relative to the location of the rotation axis. This limitation could also be addressed by assigning some probability as to the axis of rotation in such cases.
Future work includes utilizing the neural network to estimate the state of the articulated object during robot–object interaction in a scenario that is not affected by the aforementioned limitations. We also plan to utilize the sequence of images by running a recurrent neural network that only presents the initial guess about the parameters of articulated objects, but later, during the interaction, the parameters are adjusted to include the response of the object.
   
  
    Author Contributions
Conceptualization, D.B. and A.M.; methodology, D.B. and A.M.; software, A.M.; validation, A.M., P.G. and K.M.; formal analysis, A.M. and D.B.; investigation, A.M.; writing—original draft preparation, K.M. and D.B.; writing—review and editing, A.M., P.G., K.M. and D.B.; visualization, A.M. and K.M.; supervision, D.B.; project administration, D.B.; funding acquisition, D.B. All authors have read and agreed to the published version of the manuscript.
Funding
The work was supported by the National Science Centre, Poland, under research project no. UMO-2019/35/D/ST6/03959.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Hausman, K.; Niekum, S.; Osentoski, S.; Sukhatme, G.S. Active articulation model estimation through interactive perception. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 3305–3312. [Google Scholar]
- Ma, L.; Meng, J.; Liu, S.; Chen, W.; Xu, J.; Chen, R. Sim2Real2: Actively Building Explicit Physics Model for Precise Articulated Object Manipulation. arXiv 2023, arXiv:2302.10693. [Google Scholar]
- Staszak, R.; Molska, M.; Mlodzikowski, K.; Ataman, J.; Belter, D. Kinematic Structures Estimation on the RGB-D Images. In Proceedings of the 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Vienna, Austria, 8–11 September 2020; pp. 675–681. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016. ECCV 2016. Lecture Notes in Computer Science; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.;  Springer: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar]
- Martín-Martín, R.; Eppner, C.; Brock, O. The RBO dataset of articulated objects and interactions. Int. J. Robot. Res. 2019, 38, 1013–1019. [Google Scholar] [CrossRef]
- Desingh, K.; Lu, S.; Opipari, A.; Jenkins, O.C. Factored Pose Estimation of Articulated Objects using Efficient Nonparametric Belief Propagation. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7221–7227. [Google Scholar]
- Młodzikowski, K.; Belter, D. CNN-based Joint State Estimation During Robotic Interaction with Articulated Objects. In Proceedings of the 17th International Conference on Control, Automation, Robotics and Vision (ICARCV), Singapore, 11–13 December 2022; pp. 78–83. [Google Scholar]
- Hsu, C.-C.; Jiang, Z.; Zhu, Y. Ditto in the House: Building Articulation Models of Indoor Scenes through Interactive Perception. arXiv 2023, arXiv:2302.01295. [Google Scholar]
- Jiang, Z.; Hsu, C.-C.; Zhu, Y. Ditto: Building digital twins of articulated objects from interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5616–5626. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Hartwig, A. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, ICML, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
- Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations ICLR, Vienna, Austria, 3–7 May 2021. [Google Scholar]
- Bao, H.; Dong, L.; Piao, S.; Wei, F. BEiT: BERT Pre-Training of Image Transformers. In Proceedings of the 10th International Conference on Learning Representations, ICLR, Online, 25–29 April 2022. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
- Sultana, F.; Sufian, A.; Dutta, P. Evolution of Image Segmentation using Deep Convolutional Neural Network: A Survey. Knowl.-Based Syst. 2020, 201–202, 106062. [Google Scholar] [CrossRef]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A., Eds.;  Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
- Wu, H.; Zhang, J.; Huang, K.; Liang, K.; Yizhou, Y. FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation. arXiv 2019, arXiv:1903.11816. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 24–30 June 2016; pp. 2818–2826. [Google Scholar]
- Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.J.; Bourdev, L.D.; Girshick, R.B.; Hays, J.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Berlin/Heidelberg, Germany; pp. 740–755.
  
    
  
  
    Figure 1.
      Architecture of the proposed method for the detection and estimation of articulated objects that is applied to build a 3D model of the scene enhanced with information about articulated objects.
  
 
   Figure 1.
      Architecture of the proposed method for the detection and estimation of articulated objects that is applied to build a 3D model of the scene enhanced with information about articulated objects.
  
 
  
    
  
  
    Figure 2.
      Example annotations of the handles (a) and fronts (b) of articulated objects from the training dataset.
  
 
   Figure 2.
      Example annotations of the handles (a) and fronts (b) of articulated objects from the training dataset.
  
 
  
    
  
  
    Figure 3.
      Ambigious articulation: objects before and after interaction for objects with vertical axes (a) and vertical and horizontal axes (b).
  
 
   Figure 3.
      Ambigious articulation: objects before and after interaction for objects with vertical axes (a) and vertical and horizontal axes (b).
  
 
  
    
  
  
    Figure 4.
      Example results of front handle detection (a) and surface segmentation (b). On the left is presented the object on which the estimation was performed. Ground truth is presented in the center, and prediction is on the right.
  
 
   Figure 4.
      Example results of front handle detection (a) and surface segmentation (b). On the left is presented the object on which the estimation was performed. Ground truth is presented in the center, and prediction is on the right.
  
 
  
    
  
  
    Figure 5.
      Qualitative comparison between results obtained from the EfficientNet D0 (a) and CenterNet with ResNet50 encoder (b).
  
 
   Figure 5.
      Qualitative comparison between results obtained from the EfficientNet D0 (a) and CenterNet with ResNet50 encoder (b).
  
 
  
    
  
  
    Figure 6.
      mAP dependance on detection threshold and IoU.
  
 
   Figure 6.
      mAP dependance on detection threshold and IoU.
  
 
  
    
  
  
    Figure 7.
      Example segmentation results for various thresholds of the detection: 0.5 (a) and 0.95 (b).
  
 
   Figure 7.
      Example segmentation results for various thresholds of the detection: 0.5 (a) and 0.95 (b).
  
 
  
    
  
  
    Figure 8.
      Example joint detections obtained from the rotational axis estimation module.
  
 
   Figure 8.
      Example joint detections obtained from the rotational axis estimation module.
  
 
  
    
  
  
    Figure 9.
      Example results of the scene model building system for static camera.
  
 
   Figure 9.
      Example results of the scene model building system for static camera.
  
 
  
    
  
  
    Figure 10.
      Example results of the scene model building system for moving camera.
  
 
   Figure 10.
      Example results of the scene model building system for moving camera.
  
 
  
    
  
  
    Table 1.
    Properties of the dataset used to train and verify neural networks for handle detection.
  
 
  
      Table 1.
    Properties of the dataset used to train and verify neural networks for handle detection.
      
        | Dataset | Number of Images | Number of Handles | 
|---|
| Train | 1183 | 5145 | 
| Validation | 280 | 1275 | 
| Test | 100 | 581 | 
| Sum | 1563 | 7001 | 
      
 
  
    
  
  
    Table 2.
    Properties of the dataset used to train and verify neural networks for front detection and segmentation.
  
 
  
      Table 2.
    Properties of the dataset used to train and verify neural networks for front detection and segmentation.
      
        | Dataset | Number of Images | Number of Rotational Fronts | Number of Translational Fronts | 
|---|
| Train | 768 | 951 | 1560 | 
| Validation | 181 | 243 | 388 | 
| Test | 100 | 356 | 194 | 
| Sum | 1049 | 1550 | 2142 | 
      
 
  
    
  
  
    Table 3.
    Properties of the dataset used to train and verify neural networks for axis detection and segmentation.
  
 
  
      Table 3.
    Properties of the dataset used to train and verify neural networks for axis detection and segmentation.
      
        | Dataset | Number of Images | Number of Axes | 
|---|
| Train | 603 | 1771 | 
| Validation | 143 | 437 | 
| Test | 100 | 313 | 
| Sum | 846 | 2521 | 
      
 
  
    
  
  
    Table 4.
    Comparison between methods for handle detection. The best results are bolded.
  
 
  
      Table 4.
    Comparison between methods for handle detection. The best results are bolded.
      
        |  | mAP@IoU = 0.50 | mAP@IoU = 0.75 | Average Inference Time [s] | 
|---|
| SSD Inception V2 | 0.419 | 0.101 | 0.042 | 
| EfficientDet D0 | 0.727 | 0.212 | 0.114 | 
| CenterNet ResNet50 V1 | 0.928 | 0.473 | 0.253 | 
| CenterNet ResNet101 V1 | 0.932 | 0.461 | 0.405 | 
      
 
  
    
  
  
    Table 5.
    Impact of front detection threshold on the mAP metric for different IoU values.
  
 
  
      Table 5.
    Impact of front detection threshold on the mAP metric for different IoU values.
      
        |  | Detection Threshold 0.5 | Detection Threshold 0.75 | Detection Threshold 0.95 | 
|---|
| mAP@IoU = 0.50 | 0.86 | 0.83 | 0.77 | 
| mAP@IoU = 0.75 | 0.78 | 0.76 | 0.71 | 
| mAP@IoU = 0.90 | 0.39 | 0.38 | 0.38 | 
      
 
  
    
  
  
    Table 6.
    Detection of the rotational axis before and after postprocessing the results from the CNN. Bolded are best results.
  
 
  
      Table 6.
    Detection of the rotational axis before and after postprocessing the results from the CNN. Bolded are best results.
      
        |  | Correct Detections of Rotational Axis | False Positive Detections of Rotational Axis | 
|---|
| before postprocessing | 249 | 136 | 
| after postprocessing | 191 | 7 | 
      
 
  
    
  
  
    Table 7.
    Average parameter estimation time of articulated objects.
  
 
  
      Table 7.
    Average parameter estimation time of articulated objects.
      
        |  | Front Parameters Estimation | Handle Parameters Estimation | Rotation Axis Estimation | Translation Axis Estimation | 
|---|
| Avg. processing time [s] | 0.0374 | 0.0167 | 0.0386 | 0.0354 | 
      
 
  
    
  
  
    Table 8.
    Number of parameters estimated for articulated objects in 3D.
  
 
  
      Table 8.
    Number of parameters estimated for articulated objects in 3D.
      
        |  | Translational Objects | Rotational Objects | 
|---|
| Detected fronts/Total number of fronts | 31/32 | 50/53 | 
| Detected axes/Total number of axes | 31/32 | 46/53 | 
| Detected handles/Total number of handles | 39/40 | 50/53 | 
      
 
|  | Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
      
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).