**4. Activity Recognition Chain**

Figure 4 illustrates the procedure we followed to perform activity recognition. To simplify the illustration, we show the signal from one accelerometer axis for the smart watch and the RSSI from one beacon. During real-world operation, our system uses three accelerometer signals (one for each axis) and eight RSSI signals (one for each of the eight beacons deployed). For the smart watch, we have also experimented with both accelerometer and gyroscope signals, but this did not result in a noticeable improvement in performance.

**Figure 4.** Overview of the activity recognition chain implemented in our system.

The data acquisition phase is performed using our mobile application. When in training mode, where data need to be labelled for using them in the training of the classifiers, the participant has to select the activity he/she is performing from a list of available activities. This guarantees that incoming data will be labelled accordingly.

Our data segmentation approach involves the use of a non-overlapping sliding window. As we further discuss in Section 6.1, we have evaluated our system using window sizes of 1 to 5 s with a 1-s increment, as the window size has been shown to affect the performance of activity recognition [10,43,44]. We have used the same windowing mechanism for the BLE beacon data as it has been shown to benefit multipath mitigation [45].

With respect to feature extraction in the case of the accelerometer data from the smart watch, we have opted for two feature types:


These features are most appropriate for human activity recognition, as shown in the analyses in [43,46,47]. For the beacon data, we used one feature type, mean and standard deviation, based on our previous work on occupancy detection using BLE beacons [41,42]. As we are using a three-axis accelerometer, the total number of smart watch features for Type 1 is six, while for Type 2 it is 15. Similarly, since we have deployed eight beacons, the total number of beacon features is 16.

The next stage of our activity recognition chain is feature fusion [48,49]. In Section 6, we demonstrate that this significantly enhances the performance of our system. We must note, however, that our system can also operate using only the data coming from the smart watch accelerometers.

To better illustrate how feature fusion is implemented, let us define the RSSI signal value corresponding to beacon *i* at time *t* as: *r* (*t*) *i* , where *i* ∈ Z ∩ [1, *K*]. In our case, there are *K* = 8 beacons.

Thus, at time *t*, the RSSI signal values corresponding to the eight beacons are: *r* (*t*) 1 , *r* (*t*) 2 , ... , *r* (*t*) 8 . Similarly, the accelerometer signal values for each axis at time *t* are: *a* (*t*) *<sup>x</sup>* , *a* (*t*) *<sup>y</sup>* , *a* (*t*) *z* .

In the data segmentation stage, the signals from each sensor are partitioned into non-overlapping data windows *ws*, where *s* denotes the type of sensor. Consequently, we have:

$$\begin{aligned} w\_{\mathfrak{r}\_1} &= (r\_1^{(\mathfrak{r}\_1)}, \dots, r\_1^{(\mathfrak{r}\_\mathfrak{n})}) \\ &\dots \\ w\_{\mathfrak{r}\_8} &= (r\_8^{(\mathfrak{r}\_1)}, \dots, r\_8^{(\mathfrak{r}\_\mathfrak{n})}) \\ w\_{a\_\mathfrak{x}} &= (a\_\chi^{(\mathfrak{r}\_1)}, \dots, a\_\chi^{(\mathfrak{r}\_\mathfrak{m})}) \\ w\_{a\_\mathfrak{y}} &= (a\_\mathcal{y}^{(\mathfrak{t}\_1)}, \dots, a\_\mathcal{y}^{(\mathfrak{t}\_\mathfrak{m})}) \\ w\_{a\_\mathfrak{z}} &= (a\_\mathcal{z}^{(\mathfrak{t}\_1)}, \dots, a\_\mathcal{z}^{(\mathfrak{t}\_\mathfrak{m})}) \end{aligned}$$

We must note that, since the transmission frequency of the BLE beacons and the sampling rate of the smart watch are different, the number of samples in the respective windows also differ, as denoted by *t<sup>n</sup>* and *tm*. For each window, we extract a set of features, which are then fused into a single feature vector *x*. For example, if we use the first feature type (mean and standard deviation) for the smart watch data, the fused feature vector for *K* = 8 beacons will be:

$$\mathbf{x} = \left( mean(\mathbf{w\_{r\_1}}), std(\mathbf{w\_{r\_1}}), \dots, mean(\mathbf{w\_{r\_8}}), std(\mathbf{w\_{r\_8}}) \right)$$
 
$$mean(\mathbf{w\_{a\_x}}), std(\mathbf{w\_{a\_x}}), mean(\mathbf{w\_{a\_y}}), std(\mathbf{w\_{a\_y}}), mean(\mathbf{w\_{a\_z}}), std(\mathbf{w\_{a\_z}})) $$

The feature vector *x* is then used as the input to the classifier. For the classification of activities, we have chosen four classifiers that have been successfully used in human activity recognition research, as discussed in Section 2. More specifically, we have chosen k-Nearest Neighbours (KNN), Logistic Regression (LR), Random Forest (RF) and Support Vector Machines (SVM). We partitioned our dataset into 80% training set and 20% test set and used 10-fold cross-validation for hyper-parameter tuning. For SVM, we have chosen the radial basis function kernel, as the number of features is small compared to the number of instances, and mapping our data to a higher dimensional space improves the classification performance [50].

We should note that when the system is used in normal operation mode with a trained classifier residing in the server, as depicted in Figure 1, the mobile phone is responsible for the stages up to and including segmentation. The data are then transmitted to the server where feature extraction, fusion and classification take place.
