1. Introduction
In the digital era, human beings require an advanced level of computer intelligence [
1]. Human–computer interaction (HCI) is not restricted to the original hardware-related interaction. Certain smarter interaction techniques have been slowly occurring in the lives of people, namely a sequence of highly intellectual techniques based on voice recognition, face recognition, and gesture recognition [
2]. Intellectual mechanisms will help establish interaction between computers and humans. The origin of such highly suitable interaction approaches has become a major advancement trend in the present HCI domain. The main objective of HCI advancement is to make effective computers that adapt to and serve the requirements of humans [
3]—people-centered rather than forcing individuals to adapt to the computer. Gaining HCI data allows very efficient learning and creation of smarter systems [
4]. Machine learning (ML) is a significant branch of artificial intelligence (AI). It performs well in several domains and illustrates powerful research and development (R&D) potential.
With the advent of ML technology in HCI, machines have become very intelligent. Amongst authors in both the industry and academia whose aim was the development of ubiquitous computing, the most broadly discussed research concept regarding HCI has been human activity recognition (HAR) [
5,
6]. Recently, the amount of research on HAR has increased rapidly due to the extensive availability of sensors, improvements in power utilization, and reduction in costs, and resulting in technological advances in ML approaches. The Internet of Things (IoT) and AI can now be live streamed [
7]. The growth in HAR has alleviated practical implementations in several real-world domains, which include the medical sector, tactical military applications, the detection of crime and violence, and sports science [
8]. It was apparent that the extensive range of conditions to which HAR can be applicable presents evidence that the domain consists of powerful capabilities for enhancing living standards [
9].
Mathematical methods related to human activity data enable the recognition of a diversity of human activities—for instance, walking, running, sitting, standing, and sleeping. HAR mechanisms are of two major groups: sensor-related systems and video-related systems. Time-series-classifying tasks were major difficulties while utilizing HAR, i.e., whenever the movements of individuals were forecasted with the help of sensory data [
10]. These tasks usually include accurately extracting features from raw data using signal processing approaches and deep-field expertise to fit one of the ML models. In recent research, the capability of deep learning (DL) techniques, which include long short-term memory (LSTM) neural networks and convolutional neural networks (CNN), to automatically extract meaningful attributes from the provided raw sensor data and attain the most advanced outcomes has been achieved [
11,
12].
This paper presents a new quantum water strider algorithm with hybrid-deep-learning-based activity recognition (QWSA-HDLAR) model for HCI. The proposed QWSA-HDLAR technique employs a deep-transfer-learning-based, neural-architectural-search-network (NASNet)-based feature extractor to generate feature vectors. In addition, the presented QWSA-HDLAR model exploits the QWSA-based hyperparameter tuning process of the NASNet model. Finally, the classification of human activities is carried out by the use of a hybrid convolutional neural network with a bidirectional recurrent neural network (HCNN-BiRNN) model. The experimental validation of the QWSA-HDLAR model is tested using two datasets, namely KTH and UCF Sports datasets. In short, the major contributions are listed as follows:
An automated QWSA-HDLAR technique encompassing NASNet-based feature extraction, QWSA-based hyperparameter tuning, and HCNN-BiRNN-based classification is presented for the identification and classification of human activities on HCI. To the best of our knowledge, the presented QWSA-HDLAR technique does not exist in the literature.
Employ QWSA-based NASNet model to extract feature vectors, where the QWSA helps in accomplishing enhanced classification results due to the hyperparameter tuning process.
Validate the performance of the QWSA-HDLAR technique using two datasets, such as KTH and UCF Sports datasets.
2. Related Works
In [
13], a novel technique was devised in this study for action recognition. The presented technique was related to the DL features fusion and shape. A two-step-based approach can be performed—human extraction to action recognition. At the initial step, human beings were derived through a simple learning process. During this process, HOG features can be derived from some chosen datasets. After choosing the powerful features with the use of entropy-controlled features, linear support vector machine (LSVM) maximization and detection were executed. Secondly, geometric features were derived from detected areas, and parallel DL features were derived from original video frames. The gained feature vector can be classified through a cubic multiclass SVM. Jaoued et al. [
14] recommend a new technique for HAR depending upon a hybrid DL method. The devised method can be assessed on the challenging UCF101, KTH, and UCF Sports datasets.
Zheng et al. [
15] examine the segmentation techniques’ impact on DL method performance and compare four data transformation methods. The multichannel technique included three overlapped color channels, generated optimum performance. Additionally, the multichannel method can be implemented for three public datasets and generated satisfying outcomes for multisource acceleration data. Tanberk et al. [
16] devise a hybrid deep method for understanding and interpreting videos, aiming at HCR. The devised architecture was built by compiling dense auxiliary movement information and optical flow approach in video datasets with the help of DL approaches. As we know, it was the first research related to a new compilation of LSTM fed by auxiliary data and 3D-CNN fed by optical flow on video frames for HAR.
Abdulazeem et al. [
17] devise a structure with three main stages for HCR. The stages are recognition, pretraining, and preprocessing. This structure provides a set of new methods which were three-fold as follows: the first is during the pretraining stage—a standard CNN can be well trained on a generic dataset for adjusting weights; another is performing the recognition procedure—this pretrained method can be then applied to the target dataset; and finally, the recognition stage exploits CNN and LSTM to apply five distinct architectures. Ronald et al. [
18] devise iSPLInception, a DL method motivated by the Inception-ResNet architecture from Google, which not only attains high prediction accuracy but utilizes some device sources. The researchers in [
19] devise the late fusion of a HAR classifier and visual recognition. Vision can be utilized to recognize the several screws collected in a mock part, while HAR from body-worn inertial measurement units (IMUs) categorizes actions performed while assembling the parts. CNN techniques were utilized in both modes of classifier while or before several late fusion approaches were examined for estimation of a concluding state estimate.
3. The Proposed Model
In this paper, a novel QWSA-HDLAR model was developed for the recognition of human activities in the HCI environment. The proposed QWSA-HDLAR technique initially applies a NASNet model to derive a collection of feature vectors. Additionally, the presented QWSA-HDLAR model utilizes a QWSA-based hyperparameter tuning process to choose the hyperparameter values for the NASNet model optimally. Finally, the classification of human activities is carried out using the CNN-BiRNN model.
3.1. Feature Extraction: NASNet Model
Primarily, the proposed QWSA-HDLAR technique exploits the NASNet model to derive a collection of feature vectors. This is one of the very influential and famous technologies for handling smaller datasets through a pretrained network. A pretrained network was trained on massive datasets, generally in the tasks of classifying images; later, the architecture and weight were retained. If these primary datasets are relatively large and sufficiently widespread, then the feature subset that the pretrained network has learned could be helpful as a visual model. Thus, the feature might assist various computer vision tasks, although the new task can include completely different classifications from the primary task [
20]. The transfer of learning from a pretrained network is exploited in two ways: fine-tuning and feature extraction. Feature extraction includes the convolution base of the pretrained network for extracting the novel data set feature and later training a novel classification on top of the output.
The fine-tuning corresponds to the feature extraction model, which includes unfreezing the final layer of the frozen convolution base employed for extracting features. Then, the unfrozen layer is later retrained alongside the novel classification formerly learned in the feature extraction model. The fine-tuning model aims to alter the pretrained model’s most abstract features to make them more pertinent to the novel task. The following steps are involved in this study:
A pretrained NASNet is considered, and the classification base is detached.
The convolution base of pretrained models is frozen.
A new CNN-BiGRU classification is added and trained on top of the convolution base of pretrained networks.
C layers of the convolution base of pretrained networks are unfrozen.
Finally, the unfrozen layer and novel classifiers are trained together.
Equipped with engineering expertise and a large amount of computation power, Google launched NASNet [
21] and devised the problem of searching for an optimal CNN model as a reinforcement learning (RL) problem. The RL model is a type of ML approach which allows an agent to discern the best action in virtual environments to attain the goal with feedback from its own experiences and actions. Furthermore, the concept was to search the optimal grouping of parameters of the provided number of layers, searching filter size, strides, output channel, etc. In the RL settings, the reward after every search action was the accuracy of the searched model on the provided datasets. In NASNet, using the general framework is predetermined; the cells or blocks are not predetermined by researchers. Rather, they searched through the RL search technique. The structure of the NASNet model is shown in
Figure 1.
Furthermore, the number of early convolution filters and motif repetitions N are free parameters and utilized for scaling. In particular, they are named reduction and normal cells. A reduction cell is a convolution cell that returns a feature map from which the feature map’s width and height are minimized by a factor of 2, and a normal cell is a convolution cell that returns a feature map of similar dimensions. NASNet achieved advanced results in the ImageNet competition, but the computational power required for NASNet was bigger than what a small company only capable of using common methodologies could provide.
3.2. Hyperparameter Tuning: QWSA Model
In this study, the QWSA-based hyperparameter tuning process optimally chooses the hyperparameter values of the NASNet model. The WSA has good performance in a majority of the problems; occasionally, it has a chance of becoming trapped in the local optima and early convergence [
22]. Here, the concept of quantum computing was taken into account. In quantum space, the location of the male and female WSs is not concurrently defined; therefore, the location of WS must be determined using the wave function
, where
determines the location of WSs. This implies that the square of mode indicates the likelihood density of WS appearing at the
-location in space, and it can be expressed as follows [
23]:
In Equation (1),
defines a likelihood density function which fulfills the normalized condition:
The location of WS is accomplished using the Monte Carlo method, and its updated formula is provided as follows.
From the expression,
defines the population size;
and
signify that an arbitrary number lies within [0, 1];
determines the local attraction area of the
-
iterations of WSs, which define the location of every WSs is an arbitrary location among the global and the individual locations;
signifies the weighted distance between the candidate and the mean optimum location of population;
defines the mean value of individual optimum location for the WSs;
determines the iteration count;
denotes the shrinkage–expansion coefficient, i.e., exploited to handle the individual convergence rate lies within
In many instances,
, and
and
define the candidate and the
, respectively. The flowchart of WSA is shown in
Figure 2.
The updating is implemented in the mating phase of original WSA:
In these conditions, if
has good outcomes when compared to
; otherwise,
as explained in Algorithm 1.
Algorithm 1: Pseudocode of WSA |
Inputs: The population size news, |
number of territories nt, and the maximal number of iterations MaxCycle |
Outputs: The richest position of WS and the objective valueInitialize the population randomly |
Evaluate the fitness value of WSs |
while (ending criteria is not satisfied) do |
Establish nt number of territories and assign the WSs |
For (each territory) do |
The male keystone sends mating ripples, and the designated female decides about the response that is repulsive or attractive signals. |
Upgrade the location of the keystone based on the response of female |
Calculate the novel location for finding food to compensate for the consumed energy at the time of mating |
if (keystone cannot find food) then |
Forage for food resources and approach the food-rich territory |
if (keystone couldn’t find food again) then |
The hungry keystone would be died due to starvation or be killed by the resident keystone of the new territory. |
A larva that is matured replaces the killed keystone as the successor determined |
End if |
End if |
End for |
Return WS_optimal |
The QWSA method extracts a fitness function for achieving enhanced classifier performance. It fixes a positive numeral for indicating the superior outcome of a candidate solution. In this article, a classifier error rate reduction was regarded as a fitness function, as provided in Equation (9). The optimum resolution contains a minimal error rate, and the poor resolution achieves a higher error rate.
3.3. Activity Recognition: HCNN-BiRNN Model
At the final stage, the classification of human activities is carried out by the use of CNN-BiRNN model. The CNN-BiRNN hybrid models contain two major mechanisms: a BiRNN with attention model [
24] on the top half and a CNN with
convolution layers on the bottom half. These components of the models are jointly trained in an end-to-end manner. One sample
is a real value vector (
W refers to the range dimension length); therefore, a 1D-CNN module is applied, from which the convolutional process occurs in the range dimension. For the initial layer, convolutional operations using stride length 1 employ
filters,
, to
X, resulting in feature maps of layer 1,
. For the next
convolutional layers, convolution-pooling operations repetitively employ
filters,
to
, and feature map
is obtained after the
convolution layer.
Every convolution layer is followed by a pooling layer; therefore, the time dimension is shortened, and the temporal dependency increases as the convolution layer grows. After dropping the single dimension,
is regarded as a series of length
with
feature vector at every time step. Next, we replace the
time step, using
for notation convenience. The forward recurrent neural network (RNN) reads
in its novel order and produces hidden state
at all the time steps, and the backward RNN reads
in its reverse order and generates
as follows
From the expression, and refer to the input-hidden weights, and denotes the weight that connects the hidden layer, denotes the dimensionality of the hidden state, and indicates the sigmoid function. Next, the concatenation of and forward and backward states create . Consequently, every hidden state comprises data of the entire target, with stronger emphasis on the part near the region at step. In BiRNN, the data at all the time steps might slowly be lost alongside the backward and forward propagation.
To prevent data loss and focus on the discriminatory time step automatically, the attention module is proposed, which, as a byproduct, is capable of relaxing the misalignment problem. In this technique, we adapt a multilayer perceptron for calculating the attention weight depending on the hidden state and indicate
as an invariant feature vector, i.e., weighting sum of
, i.e.,
The weight,
, is calculated using the below equation
Let
be the parameter of the attention module, and weight
stands for the coefficient which scores the matching degree among the recognition task and
hidden state. The invariant feature vector
incorporates the data at each time step based on the discrimination of the hidden state. Assuming the invariant feature vector
, we adopted the
function to predict the label vector of
input sample, as follows
In Equation (15), denotes the class count, represents the probability of belongs to class, and indicates the variable of softmax classification.
4. Performance Validation
The experimental result analysis of the QWSA-HDLAR model is tested using two datasets, namely the KTH dataset [
25] and the UCF Sports dataset [
26]. The first KTH dataset includes 600 samples with six class labels, as given in
Table 1. Next, the UCF Sports dataset contains 1000 samples with ten class labels, as provided in
Table 2.
Figure 1 demonstrates the confusion matrices produced by the QWSA-HDLAR model. With the entire dataset, the QWSA-HDLAR model recognized 99 samples under class 1, 97 samples under class 2, 95 samples under class 3, 99 samples under class 4, 98 samples under class 5, and 100 samples under class 6. Similarly, with 70% of the training (TR) dataset, the QWSA-HDLAR model identified 74 samples under class 1, 74 samples under class 2, 63 samples under class 3, 67 samples under class 4, 69 samples under class 5, and 63 samples under class 6.
Table 3 illustrates the overall HAR outcomes of the QWSA-HDLAR model on the test KTH dataset. The experimental output demonstrated that the QWSA-HDLAR model has shown enhanced performance on all datasets. For instance, on the entire dataset, the QWSA-HDLAR model has obtained average
of 99.33%,
of 98%,
of 99.60%,
of 97.99%, and an area under the receiver operating characteristic curve (AUROC) score of 98.80%. Eventually, with 70% of the TR dataset, the QWSA-HDLAR model attained average
of 99.21%,
of 97.63%,
of 99.52%,
of 97.62%, and AUROC score of 98.58%. Meanwhile, on 30% of the TS dataset, the QWSA-HDLAR model reached average
of 99.63%,
of 98.80%,
of 99.78%,
of 98.83%, and an AUROC score of 99.29% as illustrated in
Figure 3.
The training and validation accuracies depicted by the QWSA-HDLAR technique with distinct epochs on the KTH dataset are demonstrated in
Figure 4. The results assured that the accuracies are found to be higher with increased epochs. Additionally, the training accuracy seems to be superior to testing accuracy.
The training and validation losses gained by the QWSA-HDLAR method on the test KTH dataset are reported in
Figure 5. The figure identified that the QWSA-HDLAR technique has resulted in lower training and validation loss values.
In order to report the enhanced performance of the QWSA-HDLAR model, a comparison study with existing methods [
25,
26,
27] is performed on the KTH dataset in
Table 4. The results implied that the Gaussian mixture model with Kalman filter (GMM-KF) model has shown poor performance with a lower
of 90.47%, whereas the gated recurrent neural network (GRNN) model has attained a slightly enhanced
of 85.85%. This is followed by the support vector machine with 3DCNN (SVM-3DCNN) and the CNN with convolutional autoencoder (CNN-CAE) models, which obtained improved
values of 90.45% and 92.80%, respectively. Though the GMM-KFGRNN and SDL-HBC models resulted in reasonable
values of 95.52% and 99.38%, the QWSA-HDLAR model gained a higher
of 99.63%.
Figure 6 establishes the confusion matrices produced by the QWSA-HDLAR approach on the UCF Sports dataset. The figure implied that the QWSA-HDLAR technique has proficiently identified all ten class labels effectually on the applied data.
Table 5 exemplifies the overall HAR outcomes of the QWSA-HDLAR technique on the test UCF Sports dataset. The experimental output illustrated that the QWSA-HDLAR approach showed enhanced performance on all datasets. For example, on the entire dataset, the QWSA-HDLAR algorithm achieved average
of 99.06%,
of 95.30%,
of 99.48%,
of 95.30%, and AUROC score of 97.39%. Finally, with 70% of the TR dataset, the QWSA-HDLAR approach reached average
of 99.06%,
of 95.40%,
of 99.48%,
of 95.28%, and AUROC score of 97.44%. At the same time, on 30% of the TS dataset, the QWSA-HDLAR methodology reached average
of 99.07%,
of 95.26%,
of 99.48%,
of 95.13%, and AUROC score of 97.37%.
The training and validation accuracies depicted by the QWSA-HDLAR methodology with distinct epochs on UCF Sports dataset are demonstrated in >
Figure 7. The results ensured that the accuracies were higher with increased epochs. In addition to that, training accuracy seems to be better than testing accuracy.
The training and validation losses inferred by the QWSA-HDLAR technique on the test UCF Sport dataset are reported in
Figure 8. The figure identified that the QWSA-HDLAR algorithm has resulted in lower training and validation loss values.
In order to report the enhanced performance of the QWSA-HDLAR technique, a comparison study with recent methodologies [
28,
29] is performed on the UCF Sports dataset in
Figure 9. The results implied that the AR-DT and LTP-HAR approaches showed poor performance with lower
values of 78.21% and 78.84%, respectively.
This is followed by the average two-stream CNN and GMM- KFGRNN algorithms, which reached improved values of 88.30% and 88.51%, respectively. Though the DTR-DNN and GS-LOF techniques resulted in reasonable values of 95.83% and 95.54%, the QWSA-HDLAR method obtained a higher of 99.07%. The detailed results and discussion show that the proposed model has shown effectual performance on HAR over the other models.