*3.1. System Architecture*

As a base for the fall detection approach, we used the assistance robot LOLA, designed entirely by our research team to monitor and help the elderly and any other person with a functional disability who lives alone. The main idea behind the LOLA robot was to be an assistive robot that could also work as a rollator for helping to walk or as a table to transport objects due to its shape—80 cm height, 58 cm width and 70 cm depth (Figure 2).

**Figure 2.** LOLA assistive robot.

The system is equipped with an Arduino Mega board, various sensors, a single RGB camera and Raspberry V3 B+. We also needed a connection to a server to perform the heavy workload—image processing and the fall detection algorithm. This connection could be WIFI to a remote server or ethernet to a laptop located in the robot. The camera is located 76 cm above the floor and takes images of 640 × 480 pixels (Figure 3).

**Figure 3.** System-architecture overview.

#### *3.2. Deep Learning-Based Person Detection*

CNNs are one of the most popular machine-learning algorithm types at present and it has been decisively proven over time that they outperform other algorithms in accuracy and speed for object detection [28].

Algorithms for object detection using CNN can be broadly categorized into two-stage and single-stage methods. The two-stage algorithm based on classification first generates many proposals or interesting regions from the image (body) and then those regions are classified using the CNN (head). In other words, the network does not check the complete image; instead, it only checks parts of the image with a high probability of containing an object. Region-CNN (R-CNN) proposed by Ross Girshick in 2014 [29] was the first of this series of algorithms that was later modified and improved, for example, fast R-CNN [30], faster R-CNN [31], R-FCN [32], Mask R-CNN [33] and Light-Head R-CC [34]. However, single-stage algorithms based on regression do not use regions to localize the object within the image; the predict bounding boxes and class probabilities at the whole image. The most known examples of this type of algorithm are Single Shot Detector (SSD), proposed by Liu et al. [35] and 'you only look once' (YOLO) proposed by Joesph Redmon et al. in 2016 [36]. YOLO has been updated to versions YOLOv2, YOLO9000 [37] and YOLOv3 [38]. In this paper, we decide to apply real-time object detection system YOLOv3 for person detection, which has proven to be an excellent competitor to other algorithms in terms of speed and accuracy.

The YOLO network takes an image and divides it into S × S grids. Each grid predicts B bounding boxes *bi*, *i* = 1, ... , *B* and provides a confidence score for each of them *Confbi*, which reflects how likely the box contains an object. Bounding boxes with this parameter above a threshold value are selected and used to locate the object, a person in our case. The bounding box position is the output of this stage for our algorithm.

#### *3.3. Learning-Based Fall/Nonfall Classification*

The effectiveness of SVM-based approaches for classification has been widely tested [39–41]. The SVM algorithm defines a hyperplane or decision boundary to separate different classes and maximize the margin (maximum distance between data points of the classes). Support vectors are training data points that define the decision boundary [42]. To find the hyperplane, a constrained minimization problem has to be solved. Optimization techniques such as the Lagrange multiplier method are needed.

In the case of nonlinearly separable data, data points from initial space *Rd* are mapped into a higher dimensional space *Q* where it is possible to find a hyperplane to separate the points. With this, the classification-decision function becomes

$$f(\mathbf{x}) = \text{sgn}(\sum\_{i=1}^{N\_s} y\_i \mathbf{a}\_i \mathbf{K}(\mathbf{x}, \mathbf{s}\_i) + b) \tag{1}$$

where training data are represented by {*xi*, *yi*}, *i* = 1, ... , *N*, *yi* ∈ {−1, <sup>1</sup>}, *b* is the bias, *αi*, *i* = 1, ... , *N* are the Lagrange multipliers obtained during the optimization process [43] and *si*, *i* = 1, ... , *Ns* are the support vectors, for which *αi* = 0 and *<sup>K</sup>*(*<sup>x</sup>*, *xi*) is a kernel function. A Radial Basis Function (RBF) was used as a kernel in this study:

$$\mathcal{K}(\mathbf{x}, \mathbf{x}\_i) = e^{-\gamma ||(\mathbf{x} - \mathbf{x}\_i)||^2} \tag{2}$$

where *γ* is the parameter controlling the width of the Gaussian kernel.

The accuracy of the SVM classifier depends on regularization parameter *C* and *γ*. *C* is the parameter that controls the penalization associated with the training samples that are misclassified and *γ* defines how far the influence of a single training point reaches. So, both parameters must be optimized for every different task in particular, for example, by using cross-validation.

The selection of the right features or input parameters to the SVM plays an important role in having a high-performance classification algorithm. Some features are most widely used in the literature as aspect ratio (AR), change in AR (CAR), fall angle (FA), center speed (CS) or head speed (HS) [21,44,45]. However, after analyzing the parameters that provide the best trade-off performance for goals to achieve in our approach, using the bounding box data of a detected person, we defined the input feature vector for the SVM classifier as

• Aspect ratio of bounding box, *ARi*:

$$AR\_i = \frac{\mathcal{W}\_{bi}}{H\_{bi}}\tag{3}$$

• Normalized bounding box width, *NWi*:

$$N\mathcal{W}\_{\text{i}} = \frac{\mathcal{W}\_{\text{bi}}}{\mathcal{W}\_{\text{imugc}}} \tag{4}$$

• Normalized bounding box bottom coordinate, *NBi*:

$$NB\_i = 1 - \frac{Ydown\_{bi}}{H\_{imag}} \tag{5}$$

where *Wbi* = *Xrightbi* − *Xlef tbi*, *Hbi* = *Ydownbi* − *Xtopbi* are the width and height of bounding box *bi*, respectively, calculated from the bounding box position provided by YOLOv3 {*Xlef tbi*, *Xrightbi*, *Ytopbi*, *Ydownbi*} and *Wimagen*, *Himagen* are the width and height of the overall image. Point (0, 0) is at the top-left corner of the overall image. Parameter *NBi* defines the distance from the bottom of the image to the lower part of the normalized bounding box. As the values of the *NBi* and *NWi* parameters are between 0 and 1, in order to give a similar weight to *ARi*, we needed to adjust its value as input to the SVM. We analyzed the data and *Wbi* was lower than 10*Hbi* for all cases, so we normalized *ARi* by 10 in order to ge<sup>t</sup> a feature in [0,1]. Therefore, we considered detection if *Wbi*< 10*Hbi*.

Parameter *ARi* is the most significant feature that characterizes the fall. As can be seen from the examples in Figure 4a,b, a person standing upright has a small *ARi*, while this ratio is large in the case of a person lying in a horizontal body orientation position. However, this parameter alone is not enough. There are some cases where the person is in a lying-position but this parameter does not show it; this is the case of lying in a vertical body orientation position, as we show in Figure 4c.

**Figure 4.** Aspect ratio. (**a**) Standing person *ARi* = 0.402. (**b**) Fallen person in horizontal-pose orientation position *ARi* = 3.810. (**c**) Fallen person in vertical-pose orientation position *ARi* = 0.751.

One of the main goals of the algorithm is the ability to differentiate between fallen people and resting situations. Figure 5 shows one example of how the optical perspective in the cameras works. The object size in the image (in pixels) depends on the real image size (in mm) and the distance from the camera to the object [46]:


When we compare a fallen person and a resting person at the same distance from the camera, the situation is the one observed in Figure 5b. The resting person is the person in the higher position. As shown in Figure 6a, the *ARi* and *NWi* parameters in both cases were the same (same size of bounding box); however, the *NBi* parameter was different *NB*1, *NB*2. For the same value, *NB*1, the bounding box size for a fallen person should be the red one (see Figure 6b).

Therefore, proposed parameters *ARi*, *NWi* and *NBi* provide needed information for differentiating those situations and, during the training stage, the SVM learns the relation between them in both cases (fall and resting position).

Figure 7 shows the previous explanation with real images. It contains three pairs of images where fallen and resting persons are at the same distance from the camera (1.5, 2 and 3 m away). Table 1 shows the parameters provided to the SVM in those situations. As can be seen in the table, each pair of images have a similar *NWi* parameter (slight differences are due to not being precisely at the same position from the camera). However, parameter *NBi* had a larger value in the nonfall situation because the body was in a higher position in the image.


**Table 1.** Input parameters to the support vector machine (SVM) from images in Figure 7.

**Figure 5.** Optical perspective. Image plane for same object at (**a**) different distances and (**b**) different heights.

**Figure 6.** Relation between the *NBi* parameter and the bounding box size. (**a**) Fallen and resting persons at same distance from camera. (**b**) Two fallen persons at different distances from camera.

**Figure 7.** Fall/nonfall detection 1.5, 2 and 3 m away (each pair of images are in the same column).
