The mini-batch alignment technique, used in the mini-batch alignment pipeline that can be found in
Figure 1, forgets differences between previous PDFs of mini-batch distributions and permuted PDFs of mini-batch distributions in an iterative way. In the subsections that follow, we explain different components of the mini-batch pipeline.
Our mini-batch alignment technique is inspired by the MMD [
22] loss and specific use of the Wasserstein distance [
23]. The reason for this is that both techniques also involve a form of probability density estimation as an intermediate computation procedure. The mini-batch alignment technique differs from these techniques. It does not involve source and target distribution distance reduction in its loss definition and does not involve sample generation based on distance from an original distribution. The MMD loss, when minimized, brings together hidden input sample source and target distributions via empirical estimation. Empirical estimation is based on kernel feature mapping input samples to a Reproducing Kernel Hilbert Space (RKHS), computing feature means across all mapped input samples separately for the source and target distributions, and computing feature mean distance via subtraction. The specific use of the Wasserstein distance involves learning data augmentation to generate input samples with a probability distribution that is at least
away from the initial probability distribution [
24].
3.1. Data Pre-Processing
Data pre-processing takes Wi-Fi–CSI data in as
, where
A denotes transceiver link antenna. These data are initially reduced from a given number of transceiver link antennas in the
A axis to 2 such that the first antenna matrix contains far greater amplitudes than the second antenna matrix, therefore adhering to an amplitude requirement for extracting Doppler shifts [
25]. Amplitude adjustments are made to both antenna matrices such that the Doppler shift information in the conjugate multiplication output has a much higher amplitude [
26]. The conjugate multiplication between antenna matrices is used to remove random phase offsets. This is possible because the antennas are connected to the same radio frequency oscillator. After conjugate multiplication, static and high-frequency noise components are filtered out with a low and high-pass Butterworth filter, respectively. The filters are 6th and 3rd order, respectively, and use a critical frequency of
and
half-cycles per sample. Using Principal Component Analysis (PCA), the Butterworth filter output is reduced to the first principal component.
We consider two different data input types, i.e., Doppler Frequency Shift (DFS) and Gramian Angular Difference Field (GADF) [
13,
27] applied to the Wi-Fi–CSI’s amplitude value. According to Wang and Oates [
27], GADF is another visual way to understand time series. As time increases, corresponding values warp among different angular points on spanning circles, like water rippling. GADF preserves temporal dependency, and it contains temporal correlations [
27]. DFS refers to the change in frequency of a wave caused by the relative motion between the wave source and the observer. In other words, GADF can be considered signal analysis in the time dimension, while DFS can be considered signal analysis in the frequency dimension across time. The reason we have considered both time and time-frequency dimension analysis is that time dimension analysis may provide the necessary Wi-Fi–CSI value variation in situations in which micro-Doppler components are obfuscated by dominant Doppler components. However, in dominant Non-Line-Of-Sight (NLOS) scenarios (LOS to receiver antenna is blocked, and there are no additional antennas at locations to alleviate this), receiver signal amplitude variation follows a Rayleigh fading distribution [
28]. The observed Rayleigh distribution may have a high peak concentrated around a specific amplitude value without observed noteworthy tails. Therefore, Wi-Fi–CSI amplitude value variation may be quite low while there is a large noticeable difference in observed frequency components.
To acquire DFS, we transform the first principal component of the Butterworth filter output into a matrix structure
by Short-Time Fourier Transform (STFT). This transformation has been taken from the Widar3 [
13] research project. The matrix structure represents energy (i.e., magnitude) distribution information over both a Doppler frequency bin and time dimension. The STFT output is zero-padded at the end and in the direction of the
T axis to 2000 instants. STFT outputs for the same gesture and repetition under the same unique combination of domain factors but belonging to different transceiver links are depth-stacked together to form the input modality
, in which
B denotes the Doppler frequency bin and
the transceiver link dimension. Data are served to the feature extractor block in the form of mini-batches
with size
.
To acquire GADF, we convert the first principal component of the Butterworth filter output’s complex numbers into real numbers by taking the amplitude to reduce computational complexity and storage requirements. The caveat is that this leads to a reduction in performance. Afterward, the principal components amplitude vector is downsampled in the time dimension via Piecewise Aggregate Approximation (PAA) [
29]. PAA involves splitting the principal component’s amplitude vector into windows and computing the mean value across the windows. The mean values are subsequently connected. Then, the PAA output is scaled to the interval
, followed-up by converting the scaled PAA output into the quasi-Gramian matrix GADF using Equation (
2). The symbol
denotes the value scaled PAA output and
n its vector size. When comparing the size of a GADF and DFS matrix, we notice that GADF
is much larger than DFS
since the amount of Doppler frequency bins is limited. Therefore, the GADF matrix is downsampled with a factor q to
using bilinear interpolation. As a result, overall DFS and GADF matrix sizes are the same, and therefore, deep-learning models under test have an equal amount of computational complexity during training. During model training, we did not have enough storage disk capacity for storing the datasets in GADF form without considering down-sampling. Down-sampling with bilinear interpolation first involves computing a width and height size difference ratio. Second, for every difference ratio scaled coordinate in the downsampled image, the weighted sum between 4 neighboring matrix index coordinates in the original is computed. GADF matrices for the same gesture and repetition under the same unique combination of domain factors but belonging to different transceiver links will then be depth-stacked together to form the input modality
. Data are fed to the feature extractor block in the form of mini-batches
.
3.2. Feature Extractor
The goal of the feature extractor (i.e., backbone network) is to map the input modality to a latent representation consisting of features that are relevant for activity/gesture recognizer and similarity discriminator blocks. Our feature extractor is based on mobile inverted bottleneck [
30] and squeeze-an-excitation blocks [
31]. Block details are omitted but can be found in [
30,
31]. The listed hyperparameters were obtained via hyperband optimization [
32], for which an epoch budget of 600 per iteration, 5 iterations, and a model discard proportion per iteration of 3 were used.
Most hyperparameters of the feature extractor used together with DFS as input type can be found in
Table 1. The standard convolution layers denoted in
Table 1 are initialized by random numbers drawn from a variance-scaled normal distribution and consider no in-between pooling and bias addition [
33]. We apply padding to keep input and output sizes the same (e.g., ‘same’ padding). The variance-scaled normal distribution uses scale factor 2 and ‘fan_out’ mode. The max. pooling layer denotes a global max. pooling layer (pool size engulfs the entire input width/height, and max. pooling operation is only applied once). None of the MobileV2 blocks use dropout regularization or batch normalization. MobileV2 block weights are initialized randomly by drawing from a variance-scaled normal distribution with a scale factor of 2 and ‘fan_out’ mode.
Most hyperparameters of the feature extractor used together with GADF as input type can be found in
Table 2. The start and end convolution operation weights are initialized by random numbers drawn from a variance-scaled normal distribution (layers do not consider bias addition). This distribution uses scale factor 2 and ‘fan_out’ mode. Both operations consider padding to keep in/output sizes the same (e.g., ‘same’ padding). All max. pool layers, except for the last layer, also consider padding to keep in/output sizes the same. The last max. pooling operation to create the latent representation is a global max. pooling operation. None of the MobileV2 blocks use dropout regularization or batch normalization. MobileV2 block weights are initialized randomly by drawing from a variance-scaled normal distribution with a scale factor of 2 and ‘fan_out’ mode.
3.4. Similarity Discriminator
The similarity discriminator computes scalars representing the similarity between two feature vectors. In the case of mini-batch alignment, the feature vectors are sampled probability density values at a value interval between outer distribution quantiles. Therefore, the feature vectors denote discrete 1D PDF approximations (referred to as 1D PDFs in the text). Its output is ignored during the inference stage. The discriminator uses a PDF estimation layer and a bilinear similarity layer. The input to the PDF estimation layer,
, is a concatenation of the feature extractor output and the activity recognizer output
F. Concatenation is preferred since the activity recognizer output contains domain-specific features but is still helpful to the activity-recognition task [
11]. Per feature in
F, the PDF estimation layer uses Kernel Density Estimation (KDE) to create a 1D probability distribution across values from different mini-batch samples (in Equation (
3), the KDE procedure is illustrated in mathematical form). The PDF is sampled across a range of values, in which the outer values are the zeroth and fourth distribution quantiles. The equation assumes the existence of one value vector
.
is a smoothing parameter that should be set such that important PDF peaks are not obscured, and random PDF peaks are filtered out. Equation (
3) can be generalized to a multivariate situation by considering
m to be a d-variate value coordinate vector,
h to be a smoothing matrix, considering the entire value vector
z, and making use of multivariate kernels.
We consider the artificially permuted PDFs per feature in
F, created with the help of a diffeomorphism, as
(the PDF permutation process in mathematical form per feature is illustrated in Equation (
4)). A diffeomorphism is a smooth, differentiable, and invertible element-wise map function denoted by
g. The PDF estimation layer randomly picks between the following diffeomorphism with equally divided probability: scaling by scalar 2, shifting by scalar −3, reciprocal, chain shifting by scalar 5, and scaling by scalar 0.8, sigmoid, softplus, or exponential function. Because in total, 7 diffeomorphisms are considered, every diffeomorphism has a 14.3% chance of being picked. Diffeomorphism-picking is performed by randomly sampling an integer number in a range of which the min. bound is 1, and the max. bound is equal to the total number of considered diffeomorphisms. Randomly picking among a large set of diffeomorphisms results in more PDF permutation variety and makes similarity loss maximization difficult. The difficulty can be found in less bias towards a severely limited number of diffeomorphisms. The first-order derivative of the inverse map function can be generalized to a multivariate situation by substituting the first-order derivative with the determinant of the Jacobian matrix (i.e., matrix of all first-order partial derivatives) [
34]. We do not consider multivariate PDFs due to non-linear scaling issues this has caused during experimentation (see
Section 5.5).
The PDF estimation layer contains a bank (i.e., persistent tensor across mini-batches, epochs) made of PDFs from previously encountered mini-batches, denoted by . In the first two matrix indices, the PDFs and permuted PDFs of the current considered mini-batch are stored. Prior to copying the PDFs into the first matrix index, the previous PDFs are assigned to a random other matrix index in the bank. The bank capacity C denotes the maximum number of other matrix indices. R is the PDF sampling resolution (i.e., the number of evenly spread PDF samples in the range between the zeroth and fourth distribution quantiles). Prior to training, the entire bank is randomly initialized. During training, the bank is fed to the bilinear similarity layer. The PDF estimation layer uses a bank capacity of 40, a smoothing (bandwidth) parameter of 0.335, and a PDF sampling resolution of 100. The bank and square matrices per feature in F in the similarity layer are initialized with random numbers drawn from a variance-scaled uniform distribution with a scale factor of 1/3 and ‘fan_out’ mode. The similarity discriminator hyperparameters were determined via manual hyperparameter tuning.
The bilinear similarity layer, per feature in F, learns a similarity function , where is a square matrix. Its output () for every feature in F contains one similarity scalar and C similarity scalars .