4.3.1. PDR

Since PDR does not need additional equipment or a pre-survey, it has a wide range of potential applications for the indoor positioning of pedestrians. It relies on the inertial sensors extensively existing in mobile devices, e.g., smartphones, to acquire information about the user's movements, which are then combined with the user's previous location to estimate the present position and further achieve complete trajectory. The equation utilized for location estimation is as follows:

$$\begin{cases} \mathbf{x}\_t = \mathbf{x}\_{t-1} + SL\_t \sin \theta\_t \\ y\_t = y\_{t-1} + SL\_t \cos \theta\_t \end{cases} \tag{4}$$

where (*xt*, *yt*) is the pedestrian position at time *t*, *SLt* is the step length, and *θt* details the heading direction of the pedestrian [40].

As mobile technology continues to evolve, a growing number of physical sensors are being installed in smartphones and, thus, various combinations of sensors can provide increasingly rich information, which makes PDR more feasible and accessible. A typical PDR consists of three main components: step detection, step-length estimation, and heading estimation [41].

**Step detection.** As the most popular method for accurate step detection, peak detection is employed in this paper, which relies on the repeating fluctuation patterns during human movement. Using the smartphone's accelerometer to determine whether the pedestrian is stationary, or walking is straightforward as it directly reflects the moving acceleration. The magnitude of acceleration on three dimensions *ax*, *ay*, *az* instead of the vertical part is employed as the input for peak findings to improve the accuracy, which can be expressed as:

$$a = \sqrt{a\_x^2 + a\_y^2 + a\_z^2} \tag{5}$$

where *ax*, *ay*, *az* denote the three-axis accelerometer values in the smartphone [42]. A peak is detected when *a* is greater than the given threshold. To further enhance the performance, the low-pass filter is further applied to the magnitude to reduce the signal noise. Due to the acceleration jitter, the incumbently detected peak points need to be eliminated. Hence, an adaptive threshold technique of the maximum and minimum acceleration is adopted to fit different motion states with a time interval limitation between adjacent detected steps.

**Stride length estimation.** Various linear and nonlinear methods are proposed to estimate the step length, which varies from person to person because of different walking postures determined by various factors, including height, weight, and step frequency.Therefore, it is not easy to precisely construct the same step-length estimation model.Some researchers assume that the step length is a static value affected by the individual characteristics of different users. On the contrary, the empirical Weinberg model estimates the stride length according to the dynamic movement state, which is closer to reality [43].The model is given by:

$$SL = k \sqrt[4]{a\_{\text{max}} - a\_{\text{min}}} \tag{6}$$

where *k* is the dynamic value concerned with the acceleration of each step and *amax* , *amin* are the maximum and minimum accelerations for each step [44].

**Heading estimation.** Heading information is a critical component for the entire PDR implementation, which seriously affects localization accuracy. To avoid the accumulative error in the direction estimation based on the gyroscope, and short-term direction disturbances based on the magnetometer, the combination of the gyroscope and magnetometer is typically adopted for heading estimation [42]. The current magnetometer heading signals, current gyroscope readings, and previously fused headings are weight-averaged to form the fused heading. The weighting factor is adaptive and is based on the magnetometer's stability as well as the correlation between the magnetometer and the gyroscope [44]. As they are already fused in the rotation vector achieved from the rotation sensor in the smartphone, the heading change can be calculated by a rotation matrix transformed from the

rotation vector [45]. The rotation vector is defined as: [*x*, *y*, *z*, *<sup>w</sup>*], and the matrix is defined as *M*, *M* ∈ *R*3×3. The heading direction on three dimensions can be evaluated by:

$$M = \begin{bmatrix} M\_{11} & M\_{12} & M\_{13} \\ M\_{21} & M\_{22} & M\_{23} \\ M\_{31} & M\_{32} & M\_{33} \end{bmatrix} = \begin{bmatrix} 1 - 2y^2 - 2z^2 & 2xy - 2zw & 2xz + 2yw \\ 2xy + 2zw & 1 - 2x^2 - 2z^2 & 2yz - 2xw \\ 2xz - 2yw & 2yz + 2xw & 1 - 2x^2 - 2y^2 \end{bmatrix} \tag{7}$$

$$\theta = \begin{bmatrix} \arctan2(M\_{12}, M\_{22})\\ \arcsin(-M\_{32})\\ \arctan2(-M\_{31}, M\_{33}) \end{bmatrix} = \begin{bmatrix} \arctan2(2xy - 2zw, 1 - 2x^2 - 2z^2) \\ \arcsin(-2yz - 2xw) \\ \arctan2(2yw - 2xz, 1 - 2x^2 - 2y^2) \end{bmatrix} \tag{8}$$

### 4.3.2. Landmark Identification Model

Although PDR methods can estimate the location and trajectory of pedestrians, lowcost inertial sensors built into smartphones provide poor-quality measurements, resulting in accuracy degradation. Moreover, the cumulative error, including the heading estimation caused by the gyroscope and step-length estimation error caused by an accelerometer, could be produced in the long-term positioning using PDR, increasing the challenge of precise localization collection. Therefore, it is necessary to prepare the reference points with the correct positions known during the movement to reduce the accumulated errors when the user passes. Spatial contexts, such as landmarks, can be properly chosen to calibrate the localization error based on the inherent spatial information without additional deployment costs. Landmark is defined as a spatial point with salient features and semantic characteristics from its near environment in indoor positioning systems, such as corners, stairs, and elevators [27]. These features can be observed for identification in one or a combination of different sensors as people pass through the landmark. The locations of these landmarks are presented by geographical coordinates or the relationships with other locations/areas, where people perform specific and predictable activities. Changes in motion are reflected in sensor readings, and different motions present different patterns. The specific activities that people perform when passing landmarks are also reflected in at least one sensor. Using the data of one sensor or the combination of data from multiple sensors, the changing pattern of a specific activity can be identified, and then the landmark can be recognized [46]. The identified landmark can be used as an anchor point to correct the path we obtained and improve the performance of the calculated trajectory.

Landmark identification involves classifying the sequences of various sensor data recorded at regular intervals by sensing devices, usually smartphones, into a well-defined landmark, which has been extensively regarded as a problem of multivariate time series classification. To address this issue, it is critical to extract and learn the features comprehensively to determine the relationship between sensing information and movement patterns. In recent years, numerous features have been attained in many studies on certain raw signal statistical aspects, such as variance, mean, entropy, kurtosis, correlation coefficients, or frequency domains via the integration of cross-formal codings, such as signals with Fourier transform and wavelet transform [47]. Moreover, the special thresholds of different features for various kinds of landmark recognition are specifically analyzed. For instance, the threshold of angular velocity produced by a gyroscope is usually used to detect the corner landmark, the acceleration changes can recognize the stairs. The combinations of different thresholds of various sensors forming the decision tree can detect the standing motion state to further distinguish common landmarks, such as corners, stairs, and elevators [48,49]. However, despite high accuracy, the calculation, extraction, and selection of features of different sensors for various landmarks are heuristic (with professional knowledge and expertise of the specific domain), time-consuming, and laborious [47].

To facilitate feature engineering and improve performance, artificial neural networks based on deep learning techniques have been employed to conduct activity identification without hand-crafted extraction. Deep learning techniques have been applied in many fields to solve practical problems with remarkable performance, such as image processing, speech recognition, and natural language processing, to solve practical problems [50,51]. Many kinds of deep neural networks have been introduced and investigated to handle landmark identification based on the complexity and unsureness of human movements. Additionally, CNN and LSTM are widely adopted with high accuracy rate activity recognition among the applied networks. CNN is commonly separated into numerous learning stages, each of which consists of a mix of convolutional operation and nonlinear processing units, as follows:

$$h^k = \sigma(\sum\_{l \in L} g(\mathbf{x}^l, w^k) + b^k) \tag{9}$$

where *hk* reveals the latent representation of the *k*-th feature map of the current layer, *σ* is the activation function, *g* denotes the convolution operation, *xl* indicates the *l*-th feature map of the group of the feature maps *L* achieved from the upper layer, *w<sup>k</sup>* and *bk* express the weights matrix and the bias of the *k*-th feature map of the current layer, receptively [52]. In our model, the rectified linear units (ReLU) were employed as the activation functions to subsequently conduct the non-linear transformation to obtain the feature maps, denoted by:

$$\sigma(x) = \max(0, \ge) \tag{10}$$

More importantly, the convolution operation in CNN can efficiently capture the local spatial correlation features by limiting the hidden unit's receptive field to be local [53]. CNN considers each frame of sensor data as independent and extracts the features for these isolated portions of data without considering the temporal contexts beyond the boundaries of the frame. Due to the continuity of sensor data flow produced by the user's behavior, local spatial correlations and temporally long-term connections are both important to identify the landmark [52]. LSTMs with learnable gates, which modulate the flow of information and control when to forget previous hidden states, as variants of vanilla recurrent neural networks (RNNs), allow the neural network to effectively extract the long-range dependencies of time-series sensor data [54]. The hidden state for the LSTM at time *t* is represented by:

$$h\_l = \sigma(w\_{l,l} \cdot \mathbf{x}\_l + w\_{h,l} \cdot h\_{t-1} + b) \tag{11}$$

where *ht* and *ht*−<sup>1</sup> are the hidden state at time *t* and *t* − 1, respectively, *σ* is the activation function, *wi*,*<sup>h</sup>* and *wh*,*<sup>h</sup>* are the weight matrices between the parts, and *b* symbolizes the hidden bias vector. The standard LSTM cells barely extract the features from the past movements, ignoring the future part. To comprehensively capture the information for landmark identification, the Bi-LSTM is applied to access the context in both the forward and backward directions [55].

Therefore, both Bi-LSTM and CNN are involved in capturing the spatial and temporal features of signals for landmark identification. The architecture of the proposed landmark identification is shown in Figure 2. It performs the function of landmark recognition using the residual concatenation for classification, followed by Bi-LSTM and multi-head CNN. When preprocessed data segmentations of multiple sensors come, the inherent temporal relationship is extracted sequentially by two Bi-LSTM blocks that consist of a Bi-LSTM layer, a batch normalization (BN) layer, an activation layer, and a dropout layer. BN is a method used to improve training speed and accuracy with the mitigation of the internal covariate shift through normalization of the layer inputs by recentering and re-scaling [34]. Next, multi-head CNN blocks with varying kernels size are followed to learn the spatial features at various resolutions. Each convolutional block is made of four layers: a onedimensional (1D) convolutional layer, a BN layer, an activation layer, and a dropout layer. To accommodate the three-dimensional input shape (samples, time steps, input channels) of the 1D convolutional layer, we retain the output of the hidden state in the Bi-LSTM layer. Then the acquired spatial and temporal features are combined, namely the concatenations of the outputs of the multi-head CNNs and Bi-LSTMs. To reduce the parameters and avoid overfitting, the global average pooling layer (GAP) with no parameter to optimize rather

than the traditional fully connected layer is applied before combining the outputs [32]. Finally, the concatenated features are transmitted into a BN layer to re-normalize before being fed into a dense layer with a softmax classifier to generate the probability distribution over classes.

**Figure 2.** Architecture of the landmark identification model.

### *4.4. Contact Awareness with Trajectory*

Exhalation and inhalation respiratory activities are constantly alternating (e.g., each breath consists of 2.5 s of continuous exhalation and 2.5 s of continuous inhalation), and droplets are continuously being released from the respiratory tract with a horizontal velocity during the process of exhalation with the same direction as the movement of the human. The particles exhaled at each moment will continue to move forward, starting from the user positions when they are expelled. The viral droplets exhaled from the infectious host are transported and dispersed into the ambient airflow before finally being inhaled by a susceptible person. Each exhalation lasts several seconds (e.g., 2.5 s), in which a long distance can be traveled for those who are in motion, and the initial position of droplets expelled cannot be accurately estimated in an indoor environment. Therefore, once complete, the exhalation period is divided into many short-term (e.g., 0.1 s) particle ejections. Because the interval is short, the continuous virus exhalation process can be converted into an instantaneous process, i.e., the virus is released instantly at the beginning of each interval. The virus-laden droplets expelled at different intervals maintain independent and identical motion patterns and the initial positions of the particles released in each interval can be regarded as the locations of the people at the initial moments. The virus-containing particles maintain a uniform motion of initially horizontal velocity (e.g., 3.9 m · s<sup>−</sup>1) in the first second and then instantaneously will mix in the overall considered space. Meanwhile, the droplets are evenly distributed within the moved space. In the first movement phase of the exhaled droplets in each interval, the virus moves in the same direction as the people travel, which is called forward transmission. As for the backward transmission, in general, the initial velocity of the virus is faster than the speed of movement and the speed of airflow, so in the first phase, very few virus particles move in the opposite direction.

The movements of all viral-loaded droplets exhaled by infectious people at different locations will meet somewhere at some time and contribute to the calculation of concentration. To precisely present the virus quanta concentrations, the transmissions of all virus particles per exhalation sources from different origins and in different states are assumed to follow the same patterns, in which the particles keep constant initial velocity in the first second and then will instantly mix in the overall space. The time it takes for the virus to move to the current point and the contribution to the virus quanta in the present are

estimated with the help of spatial distance and velocity. Thus, the quanta concentration in an indoor area at time *t*, *q* - *t*, *ERq* is measured by:

$$q(t, \, ER\_q) = \sum\_{i}^{i=N\_v} \left( \frac{ER\_q^i}{RR\_{iv} \cdot V(t^i)} \cdot \left(1 + e^{-RR\_{iv} \cdot t^i} \right) + \left(q\_0 \cdot \frac{e^{-RR\_{iv} \cdot T}}{V} + q\_0^i \cdot \frac{e^{-RR\_{iv} \cdot t^i}}{V} \right) \right) \tag{12}$$

where *RRiv* is the virus removal rate of the target space, *Nv* represents the virus generated in different places at different moments, *ER<sup>i</sup> q* is the of the quanta emission rate of the infector at which the virus (*i*-th) is expelled, *T* is the time difference from the start of the experiment to present, *t i* is the time difference between the current time and the originating time of the virus (*i*-th), *V* - *t i* is the volume of the space that the *i*-th virus had passed since it was expelled to the present, *q*0 is the environmental virus quanta number, *qi* 0 is the virus exhaled by the infector that has evenly spread to the overall investigated space with the volume of *V*. Exhaled virus particles eventually become the environmentally well-mixed virus quanta, while different initial states induce different decays.

### *4.5. Spatiotemporal Contact Awareness*

The algorithm of the proposed iSTCA with the landmark-calibrated PDR technology based on a smartphone is detailed in Algorithm 1. The detailed procedures are as follows,

Firstly, the raw signals are acquired via the developed collection application and preprocessed to create the dataset for the landmark identification model training by utilizing the data preprocessing method introduced in Section 4.2.


Secondly, the landmark recognition model designed in Section 4.3.2 would be trained and stored based on the dataset generated in the first step to further the PDR algorithm.

Thirdly, the target trajectory S is constructed by performing the landmark-calibrated PDR technique, including step detection, stride length estimation, heading determination, and landmark identification.

Fourthly, we obtain the initial state set *Qi* 0 of the expelled particles in the *i*-th (*i* = 1, 2, 3 . . .) short-term period with the help of the calculated human movement trajectory S and the preset viral particle ejection interval *τ*. *Qi* 0 defines the state of all *i*-th emitted particles in interval *τ* and consists of three parts *t*, *V*, *q*, where *t* represents the elapsed time after being exhaled, *V* represents the spread coverage of droplets due to airborne dispersion, and *q* represents the quanta concentration.

Fifthly, the state set *Qi j* at the *j*-th interval for any *Qi* 0 after being expelled is acquired by employing the defined movement pattern of the considered particles.

Finally, the virus quanta concentration *q*T P in the target position P at the target time T is reached. The virus quanta concentration presented within P at T by particles expelled in the various intervals is summed to estimate *q*T P . Moreover, the virus quanta concentrations presented in different locations at various times can be further evaluated.
