**2. Person Location Method**

Pedestrian location is performed by processing the input images with a person detector that will locate the persons present in the image. Then, the output of the detector is

**Citation:** Carro-Lagoa, Á.; Barral, V.; González-López, M.; Escudero, C.J.; Castedo, L. Alternatives for Locating People Using Cameras and Embedded AI Accelerators: A Practical Approach. *Eng. Proc.* **2021**, *7*, 53. https://doi.org/10.3390/ engproc2021007053


Academic Editors: Joaquim de Moura, Marco A. González, Javier Pereira and Manuel G. Penedo

Published: 27 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

processed to determine the real-world position of each person. To perform this last step, the camera calibration information and the vertical vanishing point are also used.

#### *2.1. Person Detection*

The most common alternatives for person detection using convolutional neural networks (CNNs) are object detectors, which provide bounding boxes of the persons, and pose estimation, which provides the position of the different key body joints of each person. Cosma et al. [2] compared these two methods, obtaining better results with pose detection networks, which are more resistant to occlusions. Moreover, the correct processing of the detected pose allows for estimation of the position of the person's feet more accurately even when they do not appear in the image.

The PoseNet neural network [3] is used in this work.This model uses a bottom-up approach where all the keypoints of every person are first predicted using a CNN, providing a heatmap for every body part. Then, these keypoints are grouped into individuals using a custom greedy algorithm. This last step can fail if the image has several persons close to each other, mixing the keypoints of two or more persons.

There are several pretrained PoseNet networks available with different CNN backbones and input resolutions. We selected the ResNet50 backbone with a 416×288 resolution as it provides a good balance between inference speed and reliability.

#### *2.2. Post-Processing of Person Keypoints and Projection to 2D Map*

The keypoints of each person are used to predict the position of the feet, even if they are not detected or they are occluded. Each keypoint obtained has a score, allowing discarding the keypoints with low reliability. With these reliable keypoints, our postprocessing algorithm predicts the feet and head position of each person taking into account the proportions of the human being. These positions are estimated using the least squares method. The vertical vanishing point of the image is also taken into account to correct the inclination of people in the image, depending on the camera perspective.

Cosma et al. [2] used a similar method with the following differences: they only performed these calculations when the feet positions were not detected, and they attempted to determine the inclination of people without taking into account the vanishing point.

Once the feet position in the image is known, the information from the camera is used to determine the map position of each person. The correspondence between each pixel on the image and the 2D floor map coordinates can be calculated with a homography transformation. The homography matrix can be obtained from the position of, at least, four pixels and the map coordinates of each of them. This matrix can also be calculated from the camera projection matrix.

In certain situations, when a person is very close to the camera and only the head is detected, the estimation of the feet position is very poor. This problem can be corrected by assuming that the person has an average height and using the known position of the camera, thus, providing a better estimation of the person's position.

#### **3. Experimental Results**

The CamLoc [2] and ICG Lab6 [4] datasets were processed with our person positioning system. Unlike other datasets that only provide the bounding boxes of each person, these datasets annotate the groundtruth position of each person in the map and provide the camera calibration information. This enables us to directly obtain the mean error of the estimated positions.

The CamLoc dataset contains only one person in several scenarios with varying levels of occlusion. Table 1 shows the obtained results with the CamLoc dataset. The mean error of the positions and the percentage of missing predictions are compared, showing that our system obtained better results with all the cameras.


**Table 1.** The results with the CamLoc dataset compared with the original results.

The ICG Lab6 dataset [4] consists of one room that is simultaneously recorded by four cameras. There are six scenarios where several persons perform different activities in the room. Table 2 shows the obtained results jointly with the results in [4]. In addition to the mean error, the detected true positives (TP), false positives (FP), and false negatives (FN, i.e., the missing detections) are shown.


**Table 2.** The mean error column only considers the error of the TP detections.

The ICG Lab6 method uses a specific tracking algorithm for this kind of scenario with several cameras covering a common area and obtains good results. Our results were obtained by performing a merge of the near positions detected in each camera, and then using a simple tracking algorithm to filter out some FP and FN, only considering the positions of the detections and not the appearance of each person. Moreover, the results are also affected by the difficulty of the pose estimator to distinguish between people when they are very close to each other.

#### **4. Conclusions**

We described the developed person location method based on computer vision techniques and provided our experimental results. The obtained results showed the high accuracy that this kind of positioning system can provide. However, in complex scenarios, an adequate tracking algorithm that takes into account the appearance of each person is needed to obtain reliable results.

**Funding:** This work has been funded by the Navantia-UDC Joint Research Unit under Grant IN853B-2018/02, the Xunta de Galicia (by grant ED431C 2020/15, and grant ED431G 2019/01 to support the Centro de Investigación de Galicia "CITIC"), the Agencia Estatal de Investigación of Spain (by grants RED2018-102668-T and PID2019-104958RB-C42) and ERDF funds of the EU (FEDER Galicia 2014–2020 & AEI/FEDER Programs, UE).

**Conflicts of Interest:** The authors declare no conflict of interest.

