1. Introduction
Mobile robots, especially social or assistive robots, coexist with people in the environment where they are deployed. Such robots need to be able to carry out some basic tasks. First, they need to know their position in the environment. Second, they have to move from one point to another autonomously, avoiding obstacles and without damaging people or objects. Finally, they interact with people and they even work with them on specific tasks. The first two skills have been extensively studied and developed in the literature, thus, today there are quite robust solutions. The third one is a slightly more complex skill and many studies are currently focusing on it.
The autonomous-behaviour generation in a robot faces several challenges. The robot not only has to be able to “survive” in any environment where it is deployed but also human–robot interaction has to be as similar as possible to human–human interaction. Interaction not only refers to communication, but it also has to do with navigation or obstacle avoidance. All these aspects relate to one single basic skill, people tracking, i.e., it is necessary to know where the people are every time.
Tracking people is not only useful to improve navigation skills in mobile robots but also to encourage socially acceptable robots. Many solutions in the literature attempt to solve this problem, typically using both vision and range sensors as shown in [
1]. Some researchers focus on Convolutional Neural Networks (CNNs) since they provide a better generalization compared to traditional methods that use geometric features. It should be pointed out that most of the proposals use Red Green Blue Depth (RGB-D) cameras to detect people in the environment. For instance, the authors in [
2] propose a solution based on a RGB-D camera combining RGB and depth data to gather input data for a segmentation CNN. Other researchers combine data from several sensors. For instance, in [
3,
4], authors propose a method to train a CNN with data provided from both Laser Imaging Detection and Ranging (LIDAR) sensors and cameras. In [
5], the authors propose to combine 2D and 3D LIDAR data to train a Support Vector Machine (SVM) to detect pedestrians for autopilot systems.
However, the above approaches are both computing-demanding when running onboard a robot. Different solutions have been proposed in the literature to deal with the problem of tracking people using 2D LIDAR sensors despite their being traditionally used for autonomous navigation. Different methods for robot navigation in crowded indoor environments have been reviewed in [
6]. The usage of the geometric features of the human legs, as well as the rate and phase of gait, was proposed in [
7] to enhance people tracking systems. Furthermore, in [
8], the clustering and centre point estimation combined with the walking centre line estimation, the speed, and the step length separation are used to detect people with or without a walker. However, these approaches are not robust enough when dealing with occlusions or changes in gait speed. To address such issues, we presented People Tracking (PeTra) in [
9], a tool that allows to locate people within the robot surroundings using the information provided by a LIDAR and a CNN.
PeTra is used in this work to validate our proposal. The first release was presented in [
10]. The system builds an occupancy map through a LIDAR sensor’s readings. The sensor is located 20 cm above the floor. LIDAR’s readings are processed by a CNN which returns a second occupancy map segmenting the readings belonging to people close to the robot. From the second occupancy map, a centre-of-mass calculation provides the people’s location estimates. A new PeTra release was presented in [
11] including not only a correlation method of location estimates for tracking people in time but also an optimized model for the CNN which allows the system to work in real time.
PeTra provides a good performance in the scenarios where it has been evaluated. However, its performance is worse in some specific locations, such as empty rooms without furniture; corridors; or scenarios with more than two people. In such cases, PeTra sometimes detects people where there are none, for instance, close to the walls. Thus, it may be convenient to refit the PeTra’s CNN to achieve better performance in such environments.
To obtain a supervised learning model with high accuracy, it is necessary to provide a large volume of data at the training phase. Generally, CNNs have been fitted by using manually labelled data. The labelling process is very time-consuming. Some researchers automatically label data by using external hardware. External hardware may report positive results but it also introduces some constraints related to the number of required devices or their measurement error. Despite the above issues, a CNNs fitted with such labelled data usually provides acceptable results. A popular alternative is using bootstrapping techniques, which is somehow similar to self-training.
The use of CNNs has increased in recent years because of the increasing computing capabilities. They are applied in every environment, from research to industry. Depending on the task to be solved, supervised or unsupervised learning techniques may be used. We focus on supervised learning since it is the most popular approach. The training process requires a dataset gathering input data, as well as their expected output.
A fully functional network model requires a large volume of data to train it. The main issue has to do with gathering such a volume of data because a fussy labelling process is needed. Labelling is usually done manually, supervising every piece of input data and labelling their corresponding output. This process is very time-consuming. Some researchers propose to use external hardware to automatically label output data. These proposals have some constraints, on the one hand, it is very common that some “noise” is introduced because of the device’s mean error. On the other hand, the hardware availability may be limited—for instance, if mobile transceivers are required for every people in the scene—and so, it could be impossible to label every piece of data correctly. To deal with the above issues, bootstrapping techniques can be useful.
Bootstrapping was first proposed in [
12] applied to word-sense disambiguation using unlabeled samples and a few labelled samples. An initial classifying model was built using the labelled samples. Then the unlabeled samples were classified extracting new patterns that were used to build an enhanced classifier.
Bootstrapping has been applied in different research areas, for instance, in document analysis and recognition. The authors in [
13] used bootstrapping to resolve segmentation problems in the processing of music scores getting a 99.2% classification accuracy. The authors in [
14] proposed a scene text detection technique using bootstrapping and text border semantics for accurate localization of texts in scenes, getting an 80.1% f-score for the MSRA-TD500 dataset.
In healthcare, the authors in [
15] point out issues related to the segmentation of craniofacial cartilage images. Labelling such images is very challenging since only experts can differentiate cartilages. The authors proposed to use self-training to fit a CNN and therefore achieve high segmentation accuracy. The authors in [
16] present a new prediction approach for imbalanced DNA-protein binding data, they use a bootstrap strategy to under-sample the negative data to balance the number of binding and non-binding samples. Results demonstrate that the method achieves a high prediction performance and outperforms the state-of-the-art sequence-based DNA-protein binding predictors.
In agricultural engineering, the authors in [
17] apply bootstrapping methods to refit a CNN which allows for segmenting plant sections. The proposed CNN was initially fitted with only 30 images manually labelled.
In industry, the study presented in [
18] uses bootstrapping to predict the useful life of a rolling bearing. The performance increased significantly by testing through different datasets and compared to MSCNN-based, BLSTM-based, and MLP-based models.
In the robotics field, bootstrapping techniques have been applied in some research. The authors in [
19] propose a novel pipeline for object detection using bootstrapping that improves 60-fold the training speed. They assess the effectiveness of the approach on a standard Computer Vision dataset (PASCAL VOC 2007 [
20]) and demonstrate its applicability to a real robotic scenario with the iCubWorld Transformations [
21] dataset. The authors of [
22] use bootstrapping to create a method of teaching a Haru’s robot its empathic behavioural response from its interaction with people. The results show that this technique is an efficient tool to speed up the robot’s learning compared to the online learning method initially used.
In this work, we propose to use bootstrapping for refitting PeTra. It is important to point out that to fit PeTra’s CNN, it is necessary to provide pairs of images. The first image shows a first occupancy map built from LIDAR readings (raw data), see
Figure 1c. The second image shows a second occupancy map of the same location but built from the LIDAR readings only belonging to people’s legs (labelled data), see
Figure 1d. To obtain labelled data, a beacon-based Real Time Location System (RTLS) was used in the first version of PeTra [
9]. The people in the scene carried a mobile transceiver that gathers the beacons’ signal to estimate their location. The main drawback of using a RTLS to label data is related to the measurement error of the device. However, KIO has a ±30 cm average error, which is related to the identification problems of PeTra on specific locations. This paper poses a new data labelling method for refitting PeTra’s CNN by bootstrapping. The CNN has been refitted with data labelled by PeTra itself. As a result, the CNN achieves higher accuracy.
The remainder of the paper is organized as follows:
Section 2 describes the materials and evaluation methods used to carry out the research; results are presented and discussed in
Section 3 and
Section 4, respectively; Finally, conclusions and future works are proposed in
Section 5.
2. Materials and Methods
A set of experiments has been carried out to evaluate the accuracy of the PeTra’s refitted CNN. In this section, the main elements of the experiment are described in depth, as well as the methodology used to evaluate the accuracy of the refitted CNN.
2.1. Leon@Home Testbed
The experiments were conducted in the mock-up apartment known as Leon@Home Testbed [
23], shown in
Figure 2a, a certified testbed [
24] of the European Robotics League (ERL) located in the Robotics Group’s lab at the University of León. Its main purpose is to benchmark service robots in a realistic environment. The apartment is a single bedroom mock-up home built in an 8 m × 7 m space. Walls of 60 cm in height divide it into a kitchen, living room, bedroom, and bathroom.
2.2. Orbi-One Robot
Orbi-One, shown in
Figure 2b, is a service robot manufactured by Robotnik [
25]. It accommodates several sensors such a RGB-D camera in the head, and a Hokuyo LIDAR sensor in its mobile base. It also has a six-degrees-of-freedom arm attached to its torso. Inside, an Intel Core i7 CPU with 8 GB of RAM allows it to run the Robot Operating System (ROS) framework [
26] in charge of managing the robot hardware.
2.3. KIO RTLS
KIO, a commercial RTLS manufactured by Eliko [
27] is shown in
Figure 2c. It has been used for labelling data to later fit the PeTra’s CNN for the first time. This RTLS system calculates the location of a mobile transceiver, usually called
tag, in a two- or three-dimensional space. KIO uses Radio Frequency Identification (RFID) beacons, usually known as
anchors, placed in known locations in the mock-up apartment of Leon@home Testbed. The anchor locations have been placed according to the method described in [
28]. To fit PeTra’s CNN, people in the scene carried a mobile transceiver that gathers the beacons’ signal to estimate their location. Then, the second occupancy map was built from the first one by cropping the LIDAR readings away from people’s legs, obtaining pairs of images similar to the ones shown in
Figure 1c,d. However, location estimates provided by KIO have an average ±30 cm error according to the manufacturer’s specifications. The evaluation carried out in [
28] shows that the measurement error is higher in some areas and lower in others; however, on average, the claims of the manufacturer are correct.
2.4. PeTra
Figure 1 illustrates Petra’s running, from the time the robot gathers data from its LIDAR sensor until PeTra locates people around the robot. PeTra first builds a 2D occupancy map from all the readings of the LIDAR, see
Figure 1c. This first occupancy map is processed by a CNN that returns a new occupancy map including only the readings belonging to people leg-like patterns, see
Figure 1d. The CNN used by PeTra is based on the U-net architecture [
29]. The architecture was originally designed to perform the biomedical image segmentation [
30]. Postprocessing the output of PeTra’s CNN, a centre-of-mass calculation provides people location estimates. Correlating the location estimates by using a Kalman filter allows for tracking each person in the scene over time. A video showing PeTra’s operation is available online [
31].
PeTra has shown good performance in the scenarios where it has been evaluated. However, its performance is worse in some specific locations, such as corridors. In these cases, PeTra sometimes detects people where there are not. Thus, it may be necessary to refit the PeTra’s CNN to get better performance in such environments.
2.5. Data Gathering
The fitting of PeTra’s CNN was carried out by using a public dataset known as RRID:SCR_015743 [
32]. Data are available at the Robotics Group of the University of León’s website [
33]. The data were gathered on 14 different scenes [
10]. In all of them, the Orbi-One robot stood still while one or more people moved around it. Three different environments of the mock-up apartment of Leon@home Tesbed were considered: the kitchen, the bedroom, and the living room.
This RRID:SCR_015743 dataset is composed of two releases: the first release (v1) was released in November 2017, and the second one (v2) was released in February 2018. Both releases consist of Rosbag files, a ROS feature that allows for capturing the information gathered by the robot and recording it for further processing. The data for fitting PeTra’s CNN have been gathered in the mock-up apartment of Leon@home Testbed.
The first release of RRID:SCR_015743 dataset contains 81 Rosbag files. It has been used to fit PeTra’s CNN for the first time. In this release, the data contained in the Rosbag files were labelled by using KIO.
The second release of RRID:SCR_015743 dataset contains 42 Rosbag files. It has been used to refit PeTra’s CNN. In this release, the data contained in the Rosbag files were labelled by using PeTra’s CNN fitted with the first release of RRID:SCR_015743 dataset.
Moreover, a new dataset was created to evaluate PeTra’s performance using both the CNN fitted with the first release of RRID:SCR_015743 dataset (labelled by KIO), and the CNN refitted with the second release of RRID:SCR_015743 dataset (labelled by PeTra).
The data for this dataset was gathered in the corridor of Leon@home Testbed, data are available online (DOI: 10.5281/zenodo.4541258) [
34]. This dataset contains 25 Rosbag files, numbered 1–25, recorded in different locations with the Orbi-One robot standing still. Two types of Rosbag files were recorded. In 17 Rosbag files, numbered 1–17, people were standing still in the scene. They were placed in known locations to obtain ground-truth data. The locations where people were placed for each Rosbag file are shown in
Figure 3b. An example of a real scene at gathering-data time is shown in
Figure 3a. The remaining 8 Rosbag files, numbered 18–25, were recorded without people in the scene to evaluate the True Negatives rate.
2.6. Data Labelling by Bootstrapping
Supervised learning techniques require labelled datasets for fitting models. We can label data either manually, which is time-consuming; or automatically, which reduces time but may introduce measurement errors. So far, the KIO system was used to automatically label the training data; however, this method features several drawbacks described below. This paper proposes a new data labelling method, using the PeTra itself to label the data that will be later used to refit its CNN.
The main drawback we found when labelling with KIO has to do with the number of available tags. We have just two KIO tags. Thus, we can record Rosbag files with two people at most in the scene. In contrast, PeTra allows for locating all the people in the scene.
As mentioned in
Section 2.5, PeTra was used to locate people in the scenes on the second release of RRID:SCR_015743 dataset. PeTra’s location estimates are used to label occupancy maps. However, we have not used the entire dataset to refit the CNN. We have selected the scenes where PeTra showed the best performance. We discarded some Rosbag files for each scenario (see [
10] for details): 2, 11, and 14 for the kitchen; 2, 7, 11, and 13 for the bedroom; and 2, 5, 9, 11, and 14 for the living room. The remaining Rosbag files were used to refit the CNN.
As mentioned above, raw occupancy maps are built from the points detected by the LIDAR sensor, see
Figure 1c. PeTra’s estimates contain three points, as shown in
Figure 1e: two for the location of the people’s legs and a third one for the people’s centres. To label the occupancy maps, we “draw” a 15 cm circle around the people legs. We assume that the LIDAR readings inside those circles belong to people. These readings are used to build the second occupancy map, see
Figure 1d.
2.7. Refitting Process
The CNN is the main component of PeTra allowing to locate people at the scene. The experiment proposed consists of fitting twice the CNN model used by the tool. The first fitting was done with the data labelled using the KIO RTLS devices. Thus, the first PeTra version is available and ready to use. This version has shown good performance but it is worse in some specific locations, such as empty rooms without furniture or corridors. In such cases, PeTra sometimes detects people where there are not, usually close to the walls.
Once the first version is available, PeTra is used to label data of a different dataset (v2), as is described in
Section 2.5. The resulting dataset is used to perform the second fitting, getting in a new neural network model for PeTra.
2.8. Convolutional Neural Network Fitting
PeTra’s CNN was fitted on Caléndula, the High Performance Computing (HPC) cluster located in Supercomputación Castilla y León (SCAYLE), which provides HPC services to the research centres and companies in Castilla y León, Spain. Caléndula has 345 servers (+7000 cores), 18.8 TB of memory and an overall computation performance of 397 TFlops (Rpeak).
Specifically, the fitting was carried out in a server with 2 Xeon E5-2695 v4 processors with 36 cores, 384 GB RAM, 2 hard drives of 200 GB each, Infiniband FDR 56 GB/s, and 8 Nvidia V100 GPUs.
The PeTra’s CNN was developed by using the Keras API for Python and Tensorflow as backend. To fit the CNN, we need pairs of images. Each pair is composed of a raw occupancy map and its corresponding labelled occupancy map. In this case, 80% of the dataset was used to fit the CNN and the remaining 20% to test it. The CNN was trained for 30 epochs, with a batch_size value (number of images processed per iteration) of 128. The fitting process reports a precision score, consisting of the accuracy, and the loss, a scalar value that the fitting tries to minimize: the lower the loss, the higher the True Positive rate.
2.9. Evaluation
The evaluation was carried out in two ways. First, the accuracy score and loss value of both models were compared. The accuracy is a measure of how accurate the model’s predictions are compared to the true data. The loss value is the sum of errors made for each example in training or validation sets. The models were trained with 80% of the total images that comprise the dataset and were tested through the 20% remaining.
Then a comparison of PeTra’s performance was carried out using both the CNN fitted with data labelled by the KIO device, and the CNN refitted by bootstrapping. A new specific dataset was gathered for this purpose, as described in
Section 2.5. Ground-truth data to evaluate the performance of both CNNs were obtained by locating people in the scenes in known positions, see
Figure 3b.
To evaluate the precision, each Rosbag file is played twice: the first one by using PeTra with the CNN fitted with data labelled by the KIO device; and the second one by using PeTra with the CNN refitted by bootstrapping. For each running, we need to know whether or not all the people in the scene have been recognized properly. Specifically, we need to know how many people are in the scene, the number of people detected by PeTra, the number of people correctly detected, the number of people not detected by PeTra, and the number of people wrongly detected. Such data allow us to obtain the confusion matrix that allows visualization of the performance of our algorithm.
Moreover, to evaluate the overall performance of both CNNs the following Key Performance Indicators (KPIs) obtained through the confusion matrix are considered: Sensitivity (
), Specificity (
), Precision (
), Accuracy (
), and Matthews Correlation Coefficient (MCC). The
score—see Equation (
1)—shows the rate of positive cases that were correctly identified by the algorithm. The
—Equation (
2)—measures the proportion of negative cases that were correctly identified by the algorithm. The
score—see Equation (
3)—shows the fraction of relevant instances among the retrieved instances. The
—see Equation (
4)—measures the proportion of correct predictions both positives and negatives cases among the total number of cases examined. Finally, the
—see Equation (
5)—is used as a measure of the quality, considering both true and false positives and negatives cases. It takes into account true and false positives and negative cases.
4. Discussion
The accuracy score of the CNN trained by applying the bootstrapping technique increased by 8.32% compared with the CNN trained with data labelled by KIO. In addition, the loss value has also increased by 7.93% compared to the training done with bootstrapping and the KIO.
The confusion matrices presented in
Section 3 provide an overall view of the differences between the two CNNs. As mentioned in
Section 2.9, the
values match the number of people correctly detected by both CNNs, specifically, 21 and 21, respectively. The
values match the number of people wrongly detected, there are not any people at the scene but the model detects someone, by both CNNs, specifically, 16 and 3, respectively. The
values match the number of people not detected by both CNNs, specifically, 21 and 21, respectively. This field has not improved because people are far away from the robot or the LIDAR data have partial occlusions from the scene. On the other hand, the
value matches the number of cases where PeTra does not detect any people and actually, there are no people in the scene. Such cases correspond to the scenes recorded in Rosbag files 18–25. According to
Figure 4, there are 4 cases where the CNN fitted with KIO does not return people’s location estimates on scenes where there are no people. Specifically, in the scenes recorded in Rosbag files 18, 21, 22, and 23. In the other four cases, the CNN fitted with KIO return people location estimates where there are none. For the CNN refitted by bootstrapping,
Figure 5 shows that in all cases (8), the system does not return people location estimates for scenes where there are no people.
According to
Table 3 both CNNs have the same Precision value (
). However, the rest of the metrics are higher with the CNN refitted by bootstrapping. The sensitivity value—an important measure for imbalanced data—is considerably lower for the CNN fitted with labelled data by using the KIO device (
) compared to the CNN refitted by bootstrapping (
), this represents an increase of 54.16%. The Specificity value for the CNN refitted by bootstrapping (
) is 72.43% higher compared to the CNN fitted with labelled data by using the KIO device (
). The accuracy value is also lower for the CNN fitted with labelled data by using the KIO device (
) compared to the CNN refitted by bootstrapping (
), thus representing an increment of 35.71%.
Finally, the score is the most important KPI to consider since it engages the rate of True Positives correctly detected with a low rate of False Positives. For the CNN fitted with data labeled by the KIO device the value is lower () than the CNN refitted by bootstrapping (). The MCC ranges in the interval , with extreme and values reached in case of fatal and perfect classification, respectively. This value is 65.97% higher for the CNN refitted by bootstrapping.
5. Conclusions
This paper presents a comparative study of fitting a CNN with data labelled by bootstrapping versus data labelled with an external ground-truth system. Specifically, this work compares the performance of the CNN used by the PeTra tool to track people close to a mobile service robot. CNN performance was evaluated with a new dataset gathered for such purpose. The CNN has been fitted twice: the first time by using training data labelled with the KIO RTLS device, and the second time by using the PeTra tool itself to label the training data.
We analysed the KPIs of both models. Results show that labelling with PeTra improves four out of the five KPIs described. Sensitivity in the CNN trained with bootstrapping increases by 54.16% compared with CNN trained by KIO. The Specificity is 72.43% higher in the CNN trained with by bootstrapping compared to the CNN trained with KIO. Furthermore, the Accuracy of the CNN trained with bootstrapping is 35.71% higher than the CNN trained with KIO. Finally, the MCC KPI is 65.97% higher in the CNN trained by bootstrapping compared to the CNN trained with KIO. Thus, this work allows to assert that bootstrapping increases the accuracy of a CNN-based people tracking tool. The improvement is significant in the cases where there are no people in the scene (Rosbags 8–25). In such cases, the CNN trained by bootstrapping does not detect any people where there are none. Therefore, using bootstrapping to label data is a good alternative to get a more accurate CNN. Moreover, this method is an especially interesting alternative in environments where a RTLS or a ground-truth system is not available.
Moreover, it is important to point out that PeTra allows to label data of all people in the scene, whereas the KIO device—as the most commercial RTLS—has a limited number of mobile transceivers. In addition, PeTra’s CNN refitted by bootstrapping performs especially well in scenarios without people, thus avoiding false positives and any possible incorrect robot behaviour. Similarly, in scenarios where there are a larger number of people, it also improves significantly. Furthermore, our solution provides a labelling method suitable to use in scenarios where a ground-truth system is not available.
The source code of PeTra is available online under an open-source license [
35]. A docker image with all required software to test PeTra and to double-check the overall evaluation posed in this paper is also available online under an open-source license [
36].