Robust Person Identification and Following in a Mobile Robot Based on Deep Learning and Optical Tracking
Abstract
:1. Introduction
- Teleoperated robots: They are capable of performing specific actions, remotely controlled by a human operator. This type is mainly used in hazardous or high-precision environments [5,6]. Some advances have been made in improving the teleoperation function, implementing feedback from the robot, such as haptic feedback [7], or VR (Virtual Reality) sensation, to allow persons to sense the environment as if they were in the robot’s position;
- Autonomous robots: These robots are much more complex machines, distinguished for implementing a response by themselves, independently of any remote operator. This is crucial in some scenarios where there exist factors (such as the time elapsed performing an action or the cost of a control link on the robot) with considerable weight in the design [8]. The state-of-the-art techniques for these robots try to emulate human behavior, so robots’ actions can be performed autonomously with a certain degree of intelligence.
- The proposed system integrates three neural networks for person identification. These networks perform inferences over the images captured by the RGB-D sensor, which is attached to the system as the sensing source of the robot. The inferences are devoted to detecting the different persons in the scene, as well as distinguishing them by employing a discriminant feature: their faces. All detection and identification tasks are based on neural networks, achieving high robustness and reliability;
- The system includes a person tracker based on optical flow. This tracker aims to guess the trajectories followed by each person detected by the robot, allowing the robot to follow the person. At the same time, the neural network yields a new update, as it takes considerably less time to predict the person’s displacement. As a result, the robustness of the entire system is improved compared to a version governed exclusively by neural inferences, which are sensitive to visual occlusions. The introduction of the tracker soothes the robot’s movements, ensuring more robustness in its observable behavior;
- The final system is based on a commercial off-the-shelf (COTS) low-consumption System-on-Module (SoM) mounted on a battery-powered mobile base. This embedded solution features a high-performance Graphical Processing Unit (GPU). This assembly can operate independently without requiring an external computer to perform deep learning inferences or run algorithms in parallel.
2. Related Work
2.1. Visual Person Detection
- Reshape: the first task to be addressed by the network is to reshape the input image(s) to a fixed size on which the rest of the layers work. In the case of an SSD detector, this shape is n × 300 × 300 × 3 (n is the size of the input batch, as n images can be evaluated simultaneously by the neural network). This image size offers a good trade-off between performance and computational load;
- Base network: this first group of layers is reused from a typical image classification model, such as VGG-16 [21]. The first layers of this architecture are utilized in this design, truncated before the first classification layer. This way, the network can leverage the feature maps from the classification network to find objects inside the input image. Following the first part of the network, several convolutional layers are appended, decreasing in size, to predict detections at multiple scales. The base network can be different from VGG-16, such as a MobileNet [22], highly optimized for running on low-end devices limited in computing power;
- Box predictors: for each layer in the base network, an image convolution is performed, generating a small set (typically three or four) of fixed-size anchors, with varying aspect ratios for each cell on a grid over the activation map. These maps have different sizes, so the system can detect objects in different scales. The anchors are then convolved with small filters (one per depth channel), yielding confidence scores for each known class and offsets for the generated bounding box. These scores are passed through a softmax operation that compresses them into a probability vector;
- Postprocessor: as several detections might be triggered in the same area for different classes and scales, a non-maximum-suppression [23] operation is performed at the output of the network to retain the best boxes under the combined criteria of detection score and Intersection over Union (IoU) score, which measures the overlapping quality between two bounding boxes.
- The coordinates of the object within the anchor;
- Objectness score, computed by logistic regression to maximize the probability of overlap with a ground truth bounding box concerning any other prior anchor;
- 80 scores, as the original implementation is trained in the COCO dataset, which contains 80 classes. These classes might be overlapping (e.g., “woman” and “person”). Thus, these scores are computed by independent logistic classifiers and are not passed through a softmax operation.
2.2. Person Identification
2.3. Person Following
2.4. Embedded Deployment
2.5. Related Systems
3. System Design
3.1. Perception Module
3.1.1. Camera
3.1.2. Neural Pipeline
- Person detection: several deep learning object detector models with different base network architectures and depths have been tried. Only those architectures that produce a good performance on a mobile (low power) device with an adequately fast inference time are considered. The SSD model [9], which employs a MobileNet for feature extraction, and the tiny YOLOv3 [26] model have been evaluated for this purpose. These models have already been trained and are openly accessible on GitHub (https://github.com/mystic123/tensorflow-yolo-v3, accessed on 25 October 2023) and on the TensorFlow Model Zoo [40];
- Face detection: this issue can be addressed using a detecting neural network. The accepted solution is a single-class detection system because the earlier-mentioned models are unsuitable for detecting faces. The two-stage neural network used by the network developed in faced [10] is capable of detecting faces. Based on a class-specific neural network, its YOLOv2 detector ensures a fast and effective identification. A video sequence comparing the accuracy of this system against a classical Haar cascade approach [15] can be seen at https://github.com/iitzco/faced (accessed on 25 October 2023);
- Face identification: once a person’s face has been detected, it can be used as a discriminating characteristic to establish their identification. A publicly accessible implementation of TensorFlow (https://github.com/davidsandberg/facenet, accessed on 25 October 2023) has been utilized to carry out the identification for this task using the neural network FaceNet [11]. This deep identification method converts the image of a face into a projection (or embedding) of a 128-dimensional vector. This transformation is learned after a triplet-loss training procedure that projects similar faces as closely as possible while separating different faces as much as possible. As a channel-wise normalization step is carried out before running the picture through the network, it provides identical projections when two photographs of the same face are analyzed despite varying illumination circumstances.
- MSS (Minimum Segment Size): A segment is chosen to be replaced by the TensorRT optimization if its value rises beyond a certain threshold. Increasing this value makes the optimizer more discriminating and only optimizes the network’s most heavily loaded portions. Low values can result in an abnormally high overhead, which would lead to inferior performance than if the original graph had been used;
- MCE (Maximum Cached Engines): TensorRT preserves a runtime cache of its engines to speed up loading them into the GPU. As there is relatively little memory available to create the cache, this parameter modifies the number of engines cached;
- Precision mode: The trained neural networks’ weights and parameters are typically treated as 64-bit floating point values. The operations are substantially lighter when the precision is decreased to 32-bit or 16-bit, achieving comparable results. A more extreme method lowers the accuracy to 8-bit integers while conducting an additional quantization step because the range can only hold 256 entries. The quantization step analyses the segment, computing the numeric range of its weights.
3.2. Actuation Module
3.2.1. Optical Motion Tracker
3.2.2. PID Controllers
- Angular zone: The reference person has to be placed at the horizontal center of the image, with a margin of ±50 pixels on the sides;
- Linear zone: The reference person has to be placed at a distance of 1 m from the robot front, with a distance margin of ±30 cm.
- ex: The linear error is computed using the depth image, estimating the distance from the robot to the person. Since the camera sensor registers the depth image into the RGB one, the person coordinates can be used in the depth image to find the distance of each pixel inside the bounding box of the reference person: the person depth map. As it is feasible that the box contains an important region of the background (especially if the person opens their arms, as the neural detection will encompass the entire body), the edges of the depth map are trimmed. Later, a 10 × 10 grid is computed to have 100 uniformly distributed samples of the person’s depth. In order to ensure that the background does not affect the range measurement, the median value is computed, as even if some outlier points belonged to the background, they would have to make up 50% of the sampled set to deviate the measurement from the true range;
- ew: The angular error can be computed, taking into account that, if the robot and the person are aligned, its bounding box will be horizontally placed near the center of the image. Therefore, an error metric can be extracted by computing the difference in the horizontal coordinate between the image center and the center of the bounding box of the reference person.
3.3. Software Architecture
- Main: The purpose of this thread is to continuously draw the output image, compute the errors and suitable responses, and send them to the robot. One thing to notice about this thread is that it does not process all the frames in the sequence, as its rate depends on the drawing time and the computation time of the response. It works asynchronously, fetching the latest frame from the tracker thread;
- Network controller: this thread handles the three neural networks of the pipeline, running sequential inferences on them. These neural networks are deployed in the GPU of the Jetson board. Therefore, this thread can be seen as the one that interacts with the GPU to pass, retrieve, and transform the tensors from the networks;
- Tracker: This thread must inherently iterate faster than the neural infrastructure. However, including it in the main thread would be bad for its performance, as the speed would be limited by the image drawing and responses published in the speed topics. Therefore, it is extracted to a specific thread. The simplicity of the Lucas–Kanade tracker makes it fast to execute. However, it would be pointless to track a person several times before a new image arrives from the camera. To avoid this, the thread has a rate limitation of 30 Hz, equal to the frame rate of the camera sensor. As this is the fastest thread to execute, and the tracker must have access to every image from the camera, this is the first component to receive the images from the source in a 30 Hz synchronous manner (the rest of the components can fetch the images asynchronously from the tracker whenever they need them);
- ROSCam: This component, responsible for fetching the images from the source (an ROSBag or the Xtion camera), is not explicitly deployed as a thread. However, as it works through subscribers when a synchronous mode is required, the ROS API for Python (rospy) automatically deploys these subscribers on independent threads.
4. Results and Discussion
4.1. Person Detection Experiment
4.2. Face Detection Experiment
4.3. Face Recognition Experiment
4.4. TensorRT Optimization Experiment
4.5. Motion Tracker Experiment
4.6. Complete System Experiment
4.7. Discussion of Results
5. Conclusions
- Implement multimodal tracking using sensor fusion. The depth data of the person also provides positional information, and bringing this information into the tracker can potentially lead to better performance;
- Implement a probabilistic tracker, such as an EKF (Extended Kalman Filter), relying on the person’s trajectory. This approach may avoid confusion between two persons if they cross each other or help the system follow a person’s trajectory even if it is temporarily lost. In addition, this can solve problems coming from using optical flow, such as a person moving a part of their body;
- Add a navigation component to the robot. If the robot is equipped with a laser scanner, it can detect possible obstacles between the robot and the person. Thus, a simple planning algorithm, such as VFF (Virtual Force Field), can be combined with this system to avoid collisions while the robot moves.
- Incorporate more complex identification networks into the system for better adaptation to changes in a person’s appearance/behavior.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Deng, L.; Hinton, G.; Kingsbury, B. New Types of Deep Neural Network Learning for Speech Recognition and Related Applications: An Overview. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 8599–8603. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. Imagenet classification with deep convolutional neural networks. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar] [CrossRef]
- Martínez-Olmos, P. Deep Learning Course: Convolutional Neural Networks; University Lecture; Springer: Berlin, Germany, 2020. [Google Scholar]
- Potel, J. Trial by Fire: Teleoperated Robot Targets Chernobyl. In Proceedings of the IEEE Computer Graphics and Applications; IEEE: Piscataway, NJ, USA, 1998; Volume 18, pp. 10–14. [Google Scholar] [CrossRef]
- Berkelman, P.; Ma, J. The University of Hawaii Teleoperated Robotic Surgery System. In Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, CA, USA, 29 October–2 November 2007; pp. 2565–2566. [Google Scholar] [CrossRef]
- Okamura, A.M. Methods for haptic feedback in teleoperated robot-assisted surgery. Ind. Robot. Int. J. 2004, 31, 499–508. [Google Scholar] [CrossRef]
- Girimonte, D.; Izzo, D. Artificial Intelligence for Space Applications. In Proceedings of the Intelligent Computing Everywhere; Schuster, A.J., Ed.; Springer: London, UK, 2007; pp. 235–253. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Lecture Notes in Computer Science; Springer: Berlin, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef]
- Itzcovich, I. Faced: CPU Real Time Face Detection Using Deep Learning. Available online: https://towardsdatascience.com/faced-cpu-real-time-face-detection-using-deep-learning-1488681c1602 (accessed on 25 October 2023).
- Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
- Lucas, B.; Kanade, T. An Iterative Image Registration Technique with an Application to Stereo Vision (IJCAI). In Proceedings of the JCAI’81: 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; Volume 81. [Google Scholar]
- Shi, J. Good Features to Track. In Proceedings of the 1994 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 1994; pp. 593–600. [Google Scholar]
- Gockley, R.; Forlizzi, J.; Simmons, R. Natural Person-Following Behavior for Social Robots. In Proceedings of the HRI ‘07: Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction, Arlington, VA, USA, 10–12 March 2007; pp. 17–24. [Google Scholar] [CrossRef]
- Viola, P.; Jones, M. Rapid Object Detection using a Boosted Cascade of Simple Features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001; Volume 1, pp. 1–511. [Google Scholar] [CrossRef]
- Molina-Moreno, I.M.; González-Díaz, I.; Díaz-de-María, F. Efficient Scale-Adaptive License Plate Detection System. IEEE Trans. Intell. Transp. Syst. 2019, 20, 2109–2121. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv 2013, arXiv:1311.2524. [Google Scholar]
- Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In ECCV 2014: Computer Vision; Springer: Cham, Switzerland, 2014; Volume 8691. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Hosang, J.; Benenson, R.; Schiele, B. Learning non-maximum suppression. arXiv 2017, arXiv:1705.02950. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. arXiv 2015, arXiv:1506.02640. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolo9000: Better, faster, stronger. arXiv 2016, arXiv:1612.08242. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
- Li, P.; Wu, H.; Chen, Q. Color distinctiveness feature for person identification without face information. Procedia Comput. Sci. 2015, 60, 1809–1816. [Google Scholar] [CrossRef]
- Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–109. [Google Scholar]
- Johnston, B.; Chazal, P. A review of image-based automatic facial landmark identification techniques. EURASIP J. Image Video Process. 2018, 2018, 86. [Google Scholar] [CrossRef]
- Gottumukkal, R.; Asari, V. An improved face recognition technique based on modular PCA approach. Pattern Recognit. Lett. 2004, 25, 429–436. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar]
- Weinberger, K.Q.; Blitzer, J.; Saul, L.K. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 2006, 10, 207–244. [Google Scholar]
- Islam, M.J.; Hong, J.; Sattar, J. Person-following by autonomous robots: A categorical overview. Int. J. Robot. Res. 2019, 38, 1581–1618. [Google Scholar] [CrossRef]
- Islam, M.J.; Fulton, M.; Sattar, J. Towards a generic diver-following algorithm: Balancing robustness and efficiency in deep visual detection. arXiv 2018, arXiv:1809.06849. [Google Scholar] [CrossRef]
- Eirale, A.; Martini, M.; Chiaberge, M. Human-Centered Navigation and Person-Following with Omnidirectional Robot for Indoor Assistance and Monitoring. Robotics 2022, 11, 108. [Google Scholar] [CrossRef]
- Ghimire, A.; Zhang, X.; Javed, S.; Dias, J.; Werghi, N. Robot Person Following in Uniform Crowd Environment. arXiv 2022, arXiv:2205.10553. [Google Scholar]
- Condés, I.; Cañas, J.M. Person following Robot Behaviour using Deep Learning. In Proceedings of the 19th International Workshop of Physical Agents (WAF 2018), Madrid, Spain, 22–23 November 2018; pp. 147–161. [Google Scholar]
- Condés, I.; Cañas, J.M.; Perdices, E. Embedded Deep Learning Solution for Person Identification and following with a Robot. In WAF 2020: Advances in Physical Agents II; Springer: Cham, Switzerland, 2020; Volume 1285. [Google Scholar] [CrossRef]
- TensorFlow. TensorFlow Object Detection: Model Zoo. Available online: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md (accessed on 25 October 2023).
- González-Díaz, I.; Díaz-de-María, F. Adaptive multipattern fast block-matching algorithm based on motion classification techniques. IEEE Trans. Circuits Syst. Video Technol. 2018, 18, 1369–1382. [Google Scholar] [CrossRef]
- Åström, K.J.; Murray, R.M. Feedback Systems: An Introduction for Scientists and Engineers; Tech. Rep.; Princeton University Press: Princeton, NJ, USA, 2004. [Google Scholar]
- Wada, K. labelme: Image Polygonal Annotation with Python. 2016. Available online: https://github.com/wkentaro/labelme (accessed on 25 October 2023).
- Blog, G.A. Accelerating Training and Inference with the Tensorflow Object Detection API. Available online: https://blog.research.google/2018/07/accelerated-training-and-inference-with.html (accessed on 25 October 2023).
Linear | Angular | |
---|---|---|
kp | 0.4 | 0.005 |
ki | 0.05 | 0.006 |
kd | 0.04 | 0.0003 |
YOLO | SSD | |
---|---|---|
IoU | 0.858 ± 0.068 | 0.926 ± 0.044 |
Inference time (ms.) | 35.003 ± 1.503 | 172.237 ± 8.791 |
Frames with detection | 123 (17.06%) | 533 (73.93%) |
Haar | Faced | |
---|---|---|
IoU | 0.579 ± 0.202 | 0.559 ± 0.221 |
Frames with detection | 248 (34.4%) | 266 (36.89%) |
YOLOv3 | SSD | |
---|---|---|
Precision | FP16 | FP16 |
MCE | 50 | 3 |
MSS | 3 | 3 |
Inference time (ms) | 39.768 | 15.922 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Condés, I.; Fernández-Conde, J.; Perdices, E.; Cañas, J.M. Robust Person Identification and Following in a Mobile Robot Based on Deep Learning and Optical Tracking. Electronics 2023, 12, 4424. https://doi.org/10.3390/electronics12214424
Condés I, Fernández-Conde J, Perdices E, Cañas JM. Robust Person Identification and Following in a Mobile Robot Based on Deep Learning and Optical Tracking. Electronics. 2023; 12(21):4424. https://doi.org/10.3390/electronics12214424
Chicago/Turabian StyleCondés, Ignacio, Jesús Fernández-Conde, Eduardo Perdices, and José M. Cañas. 2023. "Robust Person Identification and Following in a Mobile Robot Based on Deep Learning and Optical Tracking" Electronics 12, no. 21: 4424. https://doi.org/10.3390/electronics12214424
APA StyleCondés, I., Fernández-Conde, J., Perdices, E., & Cañas, J. M. (2023). Robust Person Identification and Following in a Mobile Robot Based on Deep Learning and Optical Tracking. Electronics, 12(21), 4424. https://doi.org/10.3390/electronics12214424