You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

4 July 2024

A Robust Deep Feature Extraction Method for Human Activity Recognition Using a Wavelet Based Spectral Visualisation Technique

,
,
,
and
1
Department of Computer Science and Engineering, University of Asia Pacific, Dhaka 1216, Bangladesh
2
Department of Computer Science and Engineering, University of Aizu, Aizu-Wakamatsu 965-8580, Japan
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Human Activity Recognition Using Sensors and Machine Learning: 2nd Edition

Abstract

Human Activity Recognition (HAR), alongside Ambient Assisted Living (AAL), are integral components of smart homes, sports, surveillance, and investigation activities. To recognize daily activities, researchers are focusing on lightweight, cost-effective, wearable sensor-based technologies as traditional vision-based technologies lack elderly privacy, a fundamental right of every human. However, it is challenging to extract potential features from 1D multi-sensor data. Thus, this research focuses on extracting distinguishable patterns and deep features from spectral images by time-frequency-domain analysis of 1D multi-sensor data. Wearable sensor data, particularly accelerator and gyroscope data, act as input signals of different daily activities, and provide potential information using time-frequency analysis. This potential time series information is mapped into spectral images through a process called use of ’scalograms’, derived from the continuous wavelet transform. The deep activity features are extracted from the activity image using deep learning models such as CNN, MobileNetV3, ResNet, and GoogleNet and subsequently classified using a conventional classifier. To validate the proposed model, SisFall and PAMAP2 benchmark datasets are used. Based on the experimental results, this proposed model shows the optimal performance for activity recognition obtaining an accuracy of 98.4% for SisFall and 98.1% for PAMAP2, using Morlet as the mother wavelet with ResNet-101 and a softmax classifier, and outperforms state-of-the-art algorithms.

1. Introduction

The use of human activity recognition (HAR) is expanding rapidly. To identify regular human activity, abrupt changes in the activity in real-time can provide important information in relation to the targeted concerns. Thus, HAR techniques are playing a significant role in elderly care, investigation activities, healthcare, sports, smart homes, surveillance activities, context-aware computing, athletics, and so on []. In recent research, artificial intelligence and robots are used to perceive the world around them, and to interact with humans and the environment. Therefore, human activity recognition (HAR) has become an integral part of robotics applications that aim to interact with humans. HAR is often regarded as a challenging task as humans perform numerous stationary and non-stationary activities in different ways where the activities are often organized into several sub-activities, i.e., defining complex activities []. HAR data can be generally gathered in two ways, namely, sensor-based and image-video-based, where each has its own merits and demerits.
While the location of an item ensures that a camera can continuously collect accurate data, in many complex scenarios, data collection may be influenced by motion blur, angle, objects in the way, and variations in illumination []. Activity classification involves image processing technologies. However, there are significant problems with utilizing an image to distinguish distinct activities as it can infringe on consumer privacy. This is a major concern in several applications. For instance, it is not enough to merely set up cameras to keep a watch on the actions of senior adults living in care facilities. That will lead to major privacy difficulties. Researchers support the use of sensors over other methods because of their several advantages []. They are lightweight, easily portable, mountable at various locations, have low power consumption, have comparatively low energy consumption, and are fully private. Since the sensor placements are not fixed [], the sensors may produce changing data. We have consequently concluded that sensor data may be misleading in the absence of a very efficient classifier. Sensor-based technology’s major goal is to implant wearable electronics within the human body. Sensors on wearable technology can generate data depending on the direction of movement of the user. An accelerometer and gyroscope, which are widely employed for activity identification, make up the majority of inertial sensors. Their integration is visible in wearable technology, including fitness trackers, Fitbits, smartphones, smartwatches, and more [].
Data from sensors are produced in the form of a time series, which is chronological. Many researchers have used time-domain, particularly frequency-domain, analysis to extract the intrinsic information for activity recognition. Useful information can generate relevant features that can be fed to the machine and deep learning model to recognize the activity [,,,,]. Research outcomes indicate that recognition accuracy is quite good. However, in time-domain analysis, only fine resolution in time is available with no frequency information provided at that moment.
The opposite is also true in the case of frequency-domain analysis []. Thus, the information obtained is always lagging behind both when extracting the relevant statistics for activity recognition. Alternatively, time-frequency-domain analysis offers a much better combination of time- and frequency- domain features from the signal supporting optimal performance in many application. Some widely used representative time-frequency-domain methods include short-time Fourier transform (STFT), the Wingner–Vile distribution, and the wavelet transform method. The STFT shows good time resolution but poor frequency resolution []. The Wigner–Ville distribution is a time-frequency representation that provides the best possible time and frequency resolution []. However, it is also very sensitive to noise and can produce interference terms []. The wavelet transform is a more sophisticated time-frequency representation, which uses a set of basis functions called wavelets to decompose the signal into different frequency bands. Wavelets have good time and frequency resolution []. This allows for a more flexible and adaptive representation that captures both time and frequency information effectively. Wavelets possess the advantage of providing good time and frequency resolution simultaneously. The key characteristic that sets the wavelet transform apart is its ability to localize in both domains allowing for a more precise analysis of signal components with varying frequencies at different time intervals. It helps to analyze and process the nonlinearities in the signals that are often encountered in the real world. The purpose of transforming signal data into image representations, such as spectrograms or scalograms, instead of relying on real images, stems from the significant advantages offered within the domain of deep learning models for activity recognition, primarily due to the model’s efficiency in feature extraction. This approach proves particularly advantageous given the inherently temporal nature of sensor signals. Converting these sequential signals into visual representations in the form of images captures both temporal and frequency information effectively. This streamlined representation facilitates a more efficient process of feature extraction by deep learning models, enabling them to discern intricate temporal patterns and variations in human activities more proficiently. A noteworthy reference supporting this approach is the work of Ronao et al. [], where the paper explores the widespread use of time-frequency representations for translating sensor signals into images. This underscores the efficacy of this method in capturing nuanced temporal patterns crucial for accurate activity recognition. In essence, utilizing image representations aligns seamlessly with the strengths of deep learning, optimizing the model’s ability to extract meaningful features from the temporal dynamics of sensor data. Various methods have been devised, including the wavelet kurtogram, spectrogram, scalogram, and several other techniques rooted in the principles of wavelet transform []. The most commonly used technique in time-frequency analysis is the wavelet kurtogram. It is derived by applying the wavelet transform to the signal and then computing the kurtosis at each scale to reveal the non-Gaussian behavior at different scales, helping identify regions in the signal. It can be used for HAR, but Wavelet kurtograms are especially effective in detecting sharp changes in the signal; it is not possible to achieve a good result for detecting HAR due to its sensitivity to quick shifts in the signal, which might not always mean someone is performing a specific activity. Alternative methodologies have been developed to overcome this limitation. For instance, the spectrogram [] is especially valuable when analyzing signals with consistent features, offering a visual representation of a signal’s spectral content over time. It excels in the examination of stationary signals, where stable frequency components enable a clear depiction of spectral variations. However, HAR signals are typically non-stationary, involving dynamic and changing patterns. In such cases, the scalogram [] method becomes particularly useful for HAR applications. Its strength lies in its adaptability to non-stationary signals, providing a visual representation of time-varying frequency content in a manner that complements the dynamic nature of activities in HAR scenarios and is particularly useful for analyzing signals with non-stationary characteristics. On the other hand, recent research indicates that deep learning-based models exhibit exceptional performance in the realm of 2D space-image analysis, demonstrating a robust capability to classify various activities []. Leveraging this insight, scalograms emerge as a highly effective means to formulate the required spectral images [], comprehensively representing a signal’s time-varying frequency content. This information-rich representation aligns seamlessly with the strengths of deep learning methodologies, which excel in extracting intricate patterns from complex data. HAR is essential for smart technologies such as elderly care, surveillance, and personalized health monitoring. Cameras and sensors are two robust procedures for identifying the different activities. Relying solely on cameras for activity recognition poses challenges and generates privacy concerns. Researchers prefer sensor-based technologies for their portability, low power consumption, and privacy preservation. Wearable devices with accelerometers and gyroscopes provide valuable data for activity identification. Traditional time- and frequency-domain analyses of sensor data lack simultaneous information, prompting the adoption of time-frequency-domain methods, like wavelet transforms. However, for efficient feature extraction in deep learning models, transforming sensor data into image representations, such as scalograms, is favored. This approach optimizes the model’s ability to discern intricate temporal patterns. Scalograms, which are particularly effective in HAR scenarios, leverage deep learning’s strengths in analyzing 2D space-image data for accurate activity classification. Keeping this in mind, in this paper, firstly, the 1D data have been formulated and converted to scalogram images by time-frequency-domain analysis and the features extracted by deep learning; then, a conventional classifier is used to identify the human activity for use in the elderly care system. In summary, the key contributions of this research include:
  • This paper proposes a model that converts activity signals into scalogram images and processes the images into a 3-channel representation. Subsequently, the deep features are extracted from activity images generated from the scalogram using deep learning models.
  • An exploration of the optimal combination of mother wavelets for scalogram generation, coupled with deep learning models, was undertaken to enhance accuracy and efficacy in human activity detection.
  • Additionally, a comprehensive experiment was conducted using two renowned datasets focusing on HAR, encompassing a variety of daily activities. This experiment served to validate the proposed model in terms of its efficiency, robustness, and scalability.
In this paper, the model is described and a detailed literature review is provided in Section 2, delving into sensor-based data preferences and advanced techniques like the wavelet and continuous wavelet transform. The proposed model is then introduced in Section 3, leveraging convolutional neural networks (CNNs) such as ResNet, GoogLeNetV3, and AlexNet, emphasizing efficient learning through softmax. The experimental setup and analysis of results are presented in Section 4, covering dataset selection, preprocessing, and hyper-parameter tuning. The model performance metrics are interpreted, highlighting strengths and areas for improvement. In Section 5, the model is benchmarked against the existing literature, showcasing advancements. The paper concludes in Section 6 by synthesizing key findings and discussing future research directions, offering a holistic view of the contributions to HAR.

2. Literature Review

To date, numerous types of research have been conducted on activity recognition. Broadly, activity recognition can be divided into two parts: vision-based and sensor-based. Many researchers have concentrated on images and video, whereas other researchers have concentrated on sensor-based signals. However, most of them are activity class identification tasks where different artificial learning models, such as the Hidden Markov Model, machine learning, and deep learning approaches, are used for activity classification. Below, we first discuss some important research about visual media, followed by discussion of sensor-based signals.

2.1. Vision-Based HAR

In recent investigations into human activity recognition, researchers have explored diverse methodologies to enhance accuracy and address specific challenges faced by vision-based HAR. In a recent study by Hao et al., the researchers observed that color-depth local spatio-temporal features (CoDe4D) are generated from RGB-D videos. They evaluated the performance of CoDe4D features combined with the bag-of-features (BoF) encoding representation. The traditional SVM classifier was used for activity recognition. They applied their method in four datasets, namely, UTK Action3-D, Berkeley MHAD, and ACT4, where the activities were different from each other []. However, in many cases, a convolutional neural network (CNN) was used for activity classification. Koki et al. used ensembled transfer learning to recognize activity from a motionless image. To build feature fusion-based ensembling, four CNN branches were applied, and in each branch, an attention module was employed to gather contextual data from the feature map established by previously trained models. The final recognition output was constructed on three datasets: Stanford 40 actions, BU-101, and Willow-Actions. The recovered feature maps were concatenated and provided to a fully connected network []. Hazar et al. used a CNN model on video sequences of the UCF-ARG dataset. This worked on an offline phase and an inference phase where a scene stabilization step was performed. Classification happened in two phases: identifying the video frames and identifying the whole video arrangements []. In Disha Deotale et al., the authors identified some gaps such as limited accuracy, scalability, and applicability as the classification of a large number of frames creates large delays. To minimize the weakness, a sub-activity stitching model (GSAS) was proposed based on an innovative gated recurrent unit (GRU). It worked on two stages: GRU, a sub-activity identification, and a 1D convolutional neural network (CNN), a sub-activity edging on the untrimmed datasets []. Subsequently, Yujia Zhang et al. identified that traditional data augmentation is crucial for human activity classification. According to their study, traditional approaches like data cropping are generally responsible for bad sample generation. For this reason, overall performance is hampered. The authors thus proposed a non-complex operative method known as Siamese architecture and labeled it as a Motion-patch-based Siamese Convolutional Neural Network (MSCNN). They evaluated and experimented with well-known datasets such as UCF-101 and HMDB-51 []. Alongside machine and deep learning, Ivan et al. used a statistical Hidden Markov Model for the activity recognition of human activity. A body poses dictionary was used to obtain the spatial and temporal compositions of atomic human actions. Two complex activity datasets, the MSR-Action3D and the Composable Activities dataset, were used in this research [].

2.2. Sensor-Based HAR

These days, wearable sensors, artificial neural networks, and HAR technologies dramatically influence the quality of life every day. These technologies have attracted the attention of several organizations and researchers who wish to improve all elements of human life. Owing to sensor-based HAR’s increased privacy features and exponential growth, various effective supervised machine learning algorithms have been applied in studies. Sensor-based HAR makes use of a range of sensors, including Bluetooth, accelerometers, gyroscopes, and sound sensors.
Three inertial sensor units on upper and lower body limbs were employed by Attal et al. to research wearable sensor-based human activity recognition. Using the RF method as a wrapper for feature selection, they contrasted supervised (k-NN, SVM, GMM, RF) and unsupervised (k-Means, GMM, HMM) classification approaches []. The results showed that when it comes to unsupervised categorization in the context of everyday activities utilizing three wearable accelerometers, HMM is a a superior alternative to k-NN for supervised techniques. Many supervised machine learning approaches have been employed in HAR with significant results []. Among these, deep learning stands out as being exceptionally effective and exact in pattern recognition, eliminating the need for manual feature construction. Because it can automatically learn relevant and high-level properties through end-to-end neural network training, it is the best option for HAR applications. In this research paper [], a deep recurrent neural network (DRNN) is employed to provide a framework for high-throughput human activity recognition from raw accelerometer data. Some recent papers show that converting sensor data into images before training with deep learning can lead to better performance on a variety of tasks, including footstep detection [], human activity recognition [], and fall detection []. The main reason for this improvement is that converting sensor data into images allows the deep learning model to learn spatial features from the data. Spatial features are important for many tasks, such as identifying objects and recognizing activities. Jaegyun Park et al. [] realized the importance of sensor-based human activity recognition using time series signals. The authors identified that traditional systems sought to use non-concrete pure signals by sampling with a predefined interval, making it difficult to gain the desired accuracy as the real sample is previously unknown. Thus, the authors proposed a novel multi-temporal sampling module that uses multiple sampling intervals instantaneously in the neural network on the PAMAP2 dataset [].
Saeedeh Zebhi used two domain analysis techniques, WVT and 2D FFT (two-dimensional fast Fourier transforms). After sensor data were displayed onto an image display, CNN was used to classify activities. The UCI HAR, MOTIONSENSE, MHEALTH, and WISDM datasets were used in this investigation []. Sakorn Mekruksavanich et al. used sensor data to tackle complex HAR problems using a deep neural network consisting of convolutional layers and residual networks with a squeeze-and-excite technique. The effectiveness of the model was tested using three publicly available datasets (WISDM-HARB, UT-Smoke, and UT-Complex) []. Wassila Dib presented a HAR study that measured the Received Signal Strength Indicator (RSSI) of several body channels on-body using a threshold-based methodology. To identify both static and dynamic behaviors, the authors used typical machine learning techniques combined with unconventional statistical characterization factors []. Imen et al. introduced CWT-CNN2D, a novel approach integrating Continuous Wavelet Transform (CWT) with 2D Convolutional Neural Networks (CNNs) for recognizing human activities using IMU data from smartphones and wearables across various datasets. CWT converts accelerometer and gyroscope signals into image representations, which are then fed into a 2D CNN for classification. Compared to baseline methods, such as RF and CNN-LSTM, CWT-CNN2D achieves impressive accuracy of 93.9% on the UCI HAR dataset. However, its performance on the Pamap2 dataset is notably lower, achieving only 74.3% accuracy []. The study by Nadia et al. presents a CNN-MLP hybrid model for sensor-based human activity recognition, leveraging deep learning’s autonomous data insights. Achieving 97.14% accuracy on the UCI HAR dataset, it integrates CNN and MLP layers for effective feature extraction and pattern recognition []. Yasin et al. conducted a study on sensor-based human activity recognition using deep 1D-CNNs, evaluating their approach on two datasets, UCI-HAPT and PAMAP2, achieving accuracy rates of 98% and 90.27%, respectively. Their model leveraged raw accelerometer and gyroscope data, with parameter optimization for refining architecture and hyper-parameters. They highlighted the enhanced performance gained from integrating sensor data compared to using individual sensors alone []. Zebin et al. introduced a deep learning model for sensor-based human activity recognition (S-HAR), leveraging both inertial and stretch sensors. Tested on the w-HAR dataset, their model achieved high accuracy (97.68%), outperforming existing methods. This study highlighted the effectiveness of combining sensor data types and demonstrated strong generalization in activity classification [].

3. Proposed Methodology

The proposed HAR methodologies consist of three main processes, including data acquisition, time-frequency-domain analysis, and activity classification. Figure 1 illustrates the overall flowchart of the proposed methodology. The first part is the data collection process from a wearable sensor. The second part of the analysis involves time-frequency-domain analysis since the dataset comprises time-domain signal data. We adopt a wavelet-based scalogram data analysis model for generating spectral images, enabling the identification of relevant features in the time-frequency domain. The third part consists of activity detection to identify the regular activity using a deep learning model.
Figure 1. Flowchart of the proposed human activity recognition methodology.

3.1. Sensor Data Acquisition

The research employs wearable sensors, particularly accelerometers and gyroscopes for generating activity data. The sensor data utilized in this study are sourced from two well-known publicly available datasets. Both the accelerometer and gyroscope sensors generate signals in three axes (X, Y, Z) for each activity recorded in the datasets. The accelerometer collects data from the body, measuring specific forces in three dimensions, while the gyroscope records rotational signals across the same axes. An IMU (Inertial Measurement Unit) integrates these sensors to report on a body’s specific force, linear acceleration, angular rate, and orientation, encompassing various daily activities. This paper focuses on daily activities characterized by linear acceleration, tilting, and vibration. In the analysis of daily life activities, features related to body twisting and rotational movements are deemed less critical. Therefore, accelerometer data were prioritized for feature extraction to enhance the efficiency of daily activity recognition. The accelerometer’s precise capture of linear acceleration aligns better with targeted movements, such as walking and running. Consequently, a more streamlined data processing pipeline is achieved and power consumption is reduced. The decision to utilize accelerometer data in generating scalogram images stems from a deliberate evaluation process. Gyroscope data encompass information about rotational movements and alterations in orientation. However, thorough analysis of related work revealed that these additional sensor inputs did not significantly enhance the recognition of the specific set of activities under consideration [,]. Figure 2 illustrates the variations in signals for various basic activities by showing three axes of the accelerometer sensor, including walking (Figure 2a), jumping (Figure 2b), running (Figure 2c), and standing (Figure 2d).
Figure 2. Sensor-generated signals for the X, Y, and Z axes from from accelerometer for different activities.

3.2. Time-Frequency Analysis and Scalogram Generation

Time-frequency analysis is a technique used to analyze signals whose frequency content varies over time. It provides a way to understand how the spectral characteristics of a signal change over time, which is crucial for non-stationary signals. A scalogram is a visual representation of this analysis, created by plotting the squared magnitude of wavelet coefficients in a time-frequency plane. This representation highlights the signal’s frequency components as they evolve, enabling detailed examination of complex signals in various fields, such as HAR, engineering, medicine, and finance. Using time-frequency-domain values rather than raw data for HAR offers significant advantages in feature extraction and classification performance. Raw sensor data, consisting of time series data, often fails to capture frequency-related information essential for distinguishing activities. Time-frequency analysis, such as use of wavelet transforms, enables simultaneous examination of time and frequency characteristics, providing a comprehensive signal representation. This dual-domain analysis reveals detailed patterns and changes within the signal, enhancing the recognition of complex activities. The Fourier Transform (FT) represents a signal in the frequency domain but loses information about the time at which certain frequencies occur. This limitation becomes evident in cases where the systems change their physical properties and characteristic spectrum over time. While the FT spectrum is easily interpreted for stationary systems, it fails to directly correlate temporal signal modifications with frequency features. For non-stationary signals, where the physical properties and characteristic spectrum of the signal change over time, the FT alone fails to directly correlate the temporal signal modifications with the frequency features of the spectrum. The FT represents the spectrum integrated over the acquisition time, making it challenging to capture the signal’s time-varying behavior. To overcome this limitation, methods that combine time- and frequency-domain analysis are needed to show the signal’s evolution in both domains. Windowed FT gives both spectral and temporal resolution, and it was among the first algorithms to work in the time-frequency plane. However, because the size of the time-frequency window is controlled by the window function employed, the windowed FT has a fixed time-frequency resolution. A superior option, however, is supplied by the wavelet transform (WT), which gives a time-frequency representation of a signal with variable time-frequency resolution.

3.2.1. Continuous Wavelet Transform (CWT)

This paper uses Continuous Wavelet Transformation (CWT) to extract valuable time-frequency information from the signal data representing various human activities. It is a powerful tool for time-frequency analysis, particularly suited for capturing the nuanced dynamics of nonstationary signals where frequency content fluctuates over time. Unlike the Discrete Wavelet Transform (DWT), which has limited support in time [], the CWT excels in providing a detailed portrayal of signal features across multiple scales and frequency bands []. Leveraging its ability to efficiently analyze signal breakdown into numerous frequency bands, the CWT proves valuable in identifying subtle variations, trends, and shifts in human activity. Human activities have quickly varying characteristics and CWT is superior in time-frequency resolution, making CWT the preferred choice in this paper.
CWT is a method of representation of the real-valued function S ( t ) as the following integral (1):
W s ( a , b ) = 1 a S ( t ) Ψ t b a d t
Based on a scale a > 0 ( a ϵ R + ) and translocational value b ( b ϵ R ) . The Discrete Wavelet Transform (DWT) carries with it a similar principle [], with the difference that the parameters a and b are discrete:
a = ( a 0 ) n , b = k b 0
As was previously noted, CWT offers a great way to extract and study complex spectral information from a signal. The function is referred to as a mother wavelet since it is continuous in both time and frequency. For every conceivable pair ( a , b ) , this mother wavelet is utilized to create a daughter wavelet:
Ψ a , b ( t ) = 1 a Ψ t b a
Following that, the CWT is implemented:
W s ( a , b ) = 1 a S ( t ) Ψ t b a d t = S ( t ) Ψ a , b ( t ) d t
The similarity between the signal in question and each of the offspring wavelets is indicated by the Formula (4). An image with the b-value set on the x-axis and the a-value set on the y-axis can be used to illustrate these results. Mother wavelet functions require three extra things as follows: First, this function has to be restricted. Its squared module (5) must, therefore, be constrained.
| | Ψ | | 2 = | | Ψ ( t ) | | 2 d t <
Second, the function has to be localized both in time and in frequency. Finally, the area under the curve has to be zero. Selecting an appropriate mother wavelet is crucial for effective feature extraction, considering the unique properties of different wavelets tailored to specific features. The choice is guided by the need to capture signal characteristics, with scales determining the frequency range for global or localized focus. The convolving process at each scale yields coefficients representing alignment, enhancing the overall feature extraction.
Mother Wavelet: The success of the CWT relies on the choice of an appropriate mother wavelet function Ψ ( t ) , a mathematical function that is used to generate other wavelets through scaling and translation. Wavelets are small, localized functions that can be used to represent the different frequency components of a signal at different scales. Mother wavelets are typically chosen to have specific desirable properties, such as vanishing moments, which makes them well-suited for denoising applications, or orthogonality, which makes them easy to compute. There are several widely used mother wavelets, but in this paper, four of them—Morlet [], Mexican Hat [], Complex Morlet [], and Shannon []—are applied. The Morlet wavelet function is defined as (6):
ψ ( t ) = A · e t 2 2 σ 2 · e i 2 π f 0 t
The Morlet wavelet as a function of time t is represented by ψ ( t ) in this equation. The wavelet has unit energy because of the term A, which serves as a normalizing constant. The wavelet is temporally localized via the Gaussian envelope e t 2 2 σ 2 , whose width is set by σ . A complex exponential term is created by adding e i 2 π f 0 t , which produces an oscillatory component with a frequency of f 0 . The Morlet wavelet’s intrinsic complexity allows it to efficiently extract both the frequency and the temporal information from a signal. The Complex Morlet wavelet is defined as follows (7):
ψ ( x ) = 1 π · f b · exp 2 π i · f c · x · exp x 2 f b
The function ψ ( x ) captures the altered signal, incorporating both amplitude and phase information, comparable to the pitch and timing of a musical note. The term 1 π · f b works as a volume control. The component exp ( 2 π i · f c · x ) functions as both a pitch shifter and time modifier, modulating ψ ( x ) with a complex exponential that modifies the center frequency and creates phase shifts. The term exp ( x 2 / f b ) serves as a magnifying glass, adding a Gaussian window on ψ ( x ) that zooms in on a specified time interval around x = 0 . The width of this window, determined by f b , controls the zoom level and focuses emphasis on a particular instant in the signal. Together, these components constitute a Complex Morlet wavelet.
In signal processing and image analysis, the Mexican Hat wavelet function, also termed the Ricker wavelet or the second derivative of a Gaussian, is a mathematical function that is commonly exploited. The following equation defines this below (8):
ψ ( t ) = A · ( 1 t 2 ) · e t 2 2 σ 2
In this instance, the wavelet’s width is determined by σ and A acts as a normalization constant. Because of its zero-crossings, this wavelet is commonly applied in signal processing applications like edge detection. The Shannon wavelet function is defined as (9):
ψ ( t ) = π t · sin ( 2 π a t ) · cos ( 2 π b t )
Here, π t is a linear term, and a and b control the frequencies of the sine and cosine components. This wavelet is associated with wavelet analysis and is known for its mathematical simplicity. It is used in applications where signals need to be analyzed in both frequency and time domains.
This function combines a Gaussian window with a complex sinusoidal wave, allowing it to effectively capture oscillatory patterns in the data. The parameters σ and f 0 can be adjusted to match the characteristics of the data under analysis. Through the convolution process, the selected mother wavelet slides along the signal, calculating inner products at different positions via coefficient calculation. The resulting coefficients represent the alignment between the wavelet and the signal at each point and scale.

3.2.2. Spectral Image Generation

Images naturally exhibit spatial features and patterns, which are often arranged hierarchically, and this aligns with the strengths of CNNs. Thus, the spectral images of the signal are generated to extract the potential features using the convolution process for activity recognition. In this step, the resulting coefficients from the convolution in CWT are used to construct the scalogram images to find the patterns using CNNs. Scalogram offers a comprehensive two-dimensional depiction of how a signal’s frequency content changes over time. The x-axis of the scalogram corresponds to time, the y-axis indicates the frequency, and the intensity or color at each point represents the magnitude of the wavelet coefficients. The intensity or color mapping in a scalogram plays a pivotal role in conveying information about the amplitude or power of wavelet coefficients at distinct time and frequency coordinates. Brighter colors or greater intensity often signify higher amplitudes or powers, offering a clear visual representation of the signal’s strength at specific time-frequency locations, which helps to expose the characteristics of different activities. At each point within the time-frequency plane, the scalogram acts as a graphical manifestation of the wavelet transform’s response for a particular frequency and time scale. Strong responses at specific coordinates indicate a pronounced presence of that frequency during the corresponding time interval, facilitating the identification of critical temporal events or shifts in frequency components. The scalogram allows for visualizing localized features in both time and frequency. This visualization helps in identifying time-localized frequency components and patterns in the signal that may not be apparent in the time or frequency domain alone.

3.3. Image Generation for Deep Learning Model

The data for HAR usually contain the X, Y, and Z sensor data, which represents movement behaviors by acceleration in three axes. This section describes the CWT-generated scalogram image formulation to address the motion features in a three-dimensional space. This paper applied a two-step process to prepare these images for the deep learning model. First, gray scaling, and secondly, these grayscale images were merged into a multi-channel representation, resembling an RGB format. Gray scaling is a common preprocessing step in image analysis and computer vision tasks. The conversion from a color image to a grayscale image involves reducing the image’s dimensionality by representing each pixel with a single intensity value, indicating its brightness or darkness. The most straightforward and commonly used method for gray scaling is to take a weighted sum of the red, green, and blue color channels. The equation [] for this process is:
Y = 0.299R + 0.587G + 0.114B
In Equation (10), Y represents the grayscale intensity, and R, G, and B are the red, green, and blue color values of the original pixel. The coefficients 0.299, 0.587, and 0.114 are weights assigned to each color channel. The choice of weights in the grayscale conversion equation is influenced by the varying sensitivity of the human eye’s photo-receptors to different wavelengths, with higher sensitivity to green (around 555 nm), moderate sensitivity to red (around 650 nm), and lower sensitivity to blue (around 475 nm).
Secondly, multi-channel activity images are generated by stacking the converted grayscale scalogram images of the X-, Y-, and Z-axis to produce the RGB images for activity recognition. The RGB activity image conveys the information of human motion in three-dimensional space. This processing step is crucial for feeding the data into deep learning models providing the model with a richer input. It enhances the ability to learn from different aspects of the activity simultaneously. This will help the model to capture complex patterns. Figure 3 illustrates a multi-channel representation of three-axis scalogram image generation.
Figure 3. Process of RGB image generation.
Krizhevsky et al. [], Ciresan et al. [], and Howard et al. [] (as well as many other researchers), demonstrated CNNs’ ability to learn rotation and shift-invariant features, suggesting they can adapt to variations in data presentation. This adaptability, along with the success of data augmentation techniques in image classification, strengthens the argument that consistent channel order swapping during training and testing likely will not significantly impact model performance if the order is swapped for both. Altering the axes and colors in the image generation process for deep learning models, such as exchanging the colors of the Y-axis and Z-axis, will not affect the model training and prediction results as long as the alteration is consistently applied to both the training and testing datasets. In the context of RGB images derived from three-frame tensor data, where each frame represents a layer and the data resolution is (x × y) × 3 (corresponding to R, G, B channels), deep learning models typically extract features from 2D representations of these frames. Whether the RGB channels are originally mapped as (R, G, B) or altered to (B, G, R), the model’s training process will adapt to recognize patterns based on the consistent configuration used. During testing, the model will apply these learned features from the altered RGB configuration in a consistent manner. Therefore, the performance metrics, such as accuracy, are expected to remain comparable because the model’s internal feature representations will align with the consistently altered input data. This consistency ensures that the recognition capabilities are preserved, regardless of the specific channel assignment. However, for the purpose of this paper, RGB as the X-, Y- and Z-axes are considered to be preprocessed data.

3.4. Training Models for Extracting Deep Features

This section demonstrates the training process of deep learning models for extracting features from activity images, utilizing generated images for both training the models and extracting the features. To accomplish this, a selection of CNN architectures, including CNN, AlexNet, ResNet50, ResNet101, and MobileNetV3, are employed. Additionally, for classification tasks, the softmax classifier is utilized.
Following the training phase, the feature extraction process involves several key steps common across these architectures. Firstly, the activity image is forwarded through the chosen deep learning architecture, where convolutional layers apply filters to capture low-level features like edges and shapes. Subsequently, pooling layers downsample the data, reducing their dimensionality while retaining essential information. Then, fully connected layers amalgamate the extracted features from previous layers, forming higher-level representations of the image. Finally, the output of the last fully connected layer, situated before the softmax layer in classification tasks, serves as the feature vector. This vector encapsulates the critical characteristics of the input image, rendering it suitable for various tasks, including classification or further analysis. Advancements in deep learning have propelled the development of powerful activity models, particularly using Convolutional Neural Networks (CNNs). These models, designed for tasks such as image classification and visual recognition, leverage intricate architectures consisting of convolutional layers, ReLU activation functions, pooling layers, and fully connected layers. The training process involves backpropagation and optimization algorithms, dynamically adjusting filter weights to minimize the loss function. Among notable architectures, AlexNet introduced groundbreaking features, including ReLU activation functions and dropout regularization, setting standards for subsequent models. ResNet addressed challenges associated with increasing network depth by introducing skip connections and thus revolutionized the training of deep neural networks. Additionally, MobileNetV3, optimized for mobile hardware, balances between efficiency and accuracy, making it ideal for resource-constrained devices. This section explores the evolution and contributions of these deep learning models in the context of activity model training. In the realm of deep learning for tasks like activity recognition using wearable sensor data, each neural network architecture typically has specific requirements for the shape of input data it can accept. For instance, popular models, such as AlexNet, and ResNet variants like ResNet-101, have standardized input shapes that align with their design specifications. AlexNet typically expects inputs sized at [227, 227, 3]; on the other hand, ResNet models commonly use [224, 224, 3]. These dimensions correspond to the width, height, and number of color channels (RGB) of the input images or data representations. Therefore, to effectively utilize these models for activity recognition tasks, it is crucial to preprocess the sensor data into the appropriate shape that matches the requirements of the chosen deep learning architecture. This approach ensures that the models can process the data correctly and leverage their designed features optimally, thereby enhancing the accuracy and performance of activity recognition systems in applications such as smart homes and healthcare monitoring.
CNN: A Convolutional Neural Network (CNN) [] model is designed to extract the visual feature of activity images for HAR. It consist of convolutional layers, ReLU activation functions, pooling layers, and fully connected layers, enabling automatic feature learning from raw data. Convolutional layers extract local features, ReLU introduces nonlinearity, pooling layers reduce spatial dimensions, and fully connected layers make classification decisions. Through backpropagation and optimization algorithms during training, filter weights are adjusted to minimize the loss function, resulting in state-of-the-art performance across visual recognition applications. The CNN architecture typically involves multiple convolutional layers with configurations like Conv2D employing 32/64/128/128 filters and 3 × 3 kernels, followed by ReLU activation and MaxPooling2D layers for spatial downsampling. A dropout layer with a rate of 0.5 is incorporated to mitigate overfitting. The resulting flattened output represents abstract features extracted from input images, crucial for capturing hierarchical patterns. These features are then classified using various classifiers, showcasing the effectiveness of CNNs in image analysis and recognition tasks.
AlexNet: The AlexNet [], introduced by Alex Krizhevsky et al., marked a significant advancement in deep learning by pioneering the use of Rectified Linear Unit (ReLU) activation functions and dropout regularization in CNNs. This architecture, designed for large-scale image datasets, comprises five convolutional layers with max-pooling, followed by three fully connected layers, incorporating dropout for model complexity control. With approximately 60 million parameters, the network features convolution sizes ranging from 11 × 11 in the first layer to 3 × 3 in subsequent layers. ReLU activation accelerates model convergence, while dropout mitigates overfitting. Data augmentation techniques, such as random flipping and cropping, enhance the diversity of the training dataset. Notably, overlapping pooling is employed to reduce spatial dimensions while preserving information. The meticulous design choices in AlexNet laid the foundation for subsequent advancements in deep learning architectures, establishing enduring standards. The feature extraction process in AlexNet begins with an input shape of [227, 227, 3]. The architecture starts with a Conv2D layer with 96 filters, employing an 11 × 11 kernel size and a stride of 4. Batch normalization and ReLU activation follow, with subsequent MaxPooling2D operations (pool size: 3 × 3, stride: 2) facilitating spatial downsampling. Two additional Conv2D layers with 256 filters and 5 × 5 kernel size, accompanied by ReLU activation, are followed by another MaxPooling2D layer. Three Conv2D layers with 384 filters, 3 × 3 kernel size, and ReLU activation create a hierarchy of features, culminating in an additional Conv2D layer with 256 filters and 3 × 3 kernel, followed by ReLU activation. Flattening the output into a one-dimensional vector condenses abstracted features learned by the convolutional layers, facilitating the representation of high-level patterns for classification models.
ResNet: After the groundbreaking success of AlexNet in the 2012 ImageNet competition, subsequent neural network architectures sought to further improve performance by increasing depth. However, this approach presented challenges, such as the vanishing/exploding gradient problem. The ResNet (Residual Network) [] architecture tackled this issue by introducing skip connections, forming residual blocks that facilitate the learning of residual mappings. ResNet focuses on learning residual mappings related to identity mapping, ensuring smoother gradient propagation and faster training. Its architecture typically consists of two 3 × 3 convolutional layers with Batch Normalization and ReLU activation, along with a 1 × 1 convolutional layer for deeper networks. Variants like ResNet-50, ResNet-101, or ResNet-151 address the trade-off between model complexity and computational efficiency, capturing intricate features while mitigating challenges associated with deep networks. The incorporation of skip connections enables practitioners to select a ResNet variant that balances model depth and computational efficiency for optimal performance. In the feature extraction process of ResNet-50, input images are standardized to a shape of [224, 224, 3]. The ResNet-50 model, with its 50-layer depth and innovative use of residual learning, initializes weights from the "imagenet" dataset and freezes all layers during feature extraction to leverage pre-trained knowledge and prevent overfitting. Feature extraction involves passing input images through the stacked layers of ResNet-50, including residual blocks designed for learning hierarchical features. The residual connections facilitate the direct flow of information through the network, addressing challenges associated with training very deep networks. Similarly, the ResNet-101 model extends the ResNet-50 architecture by incorporating additional layers, resulting in a deeper neural network with 101 layers. The increased depth of ResNet-101 allows for a more intricate representation of hierarchical features. The abstract features extracted from the input images by ResNet-50 and ResNet-101 are then classified and detected using conventional classifiers. However, ResNet-151 or higher layer ResNet models are not tested due to the significant trade-offs in computation and memory requirements. Deeper networks require substantially more computational power to train, as each additional layer increases the data processing and calculations needed during training and inference. This results in longer training times and higher resource consumption. Additionally, deeper models demand more memory to store the increased number of weights and intermediate activations, which can quickly exceed the hardware capacity, especially on GPUs with limited VRAM. Furthermore, as we do not have much data, deeper networks typically need larger datasets to avoid overfitting, and if the available dataset is not sufficiently large, a deeper model like ResNet151 might not generalize well and could perform worse than shallower models. Hence, despite the theoretical potential for better accuracy, the practical constraints of computational power, memory, and dataset size led to the decision not to use ResNet-151 in this case. MobileNetV3: MobileNetV3 [], introduced in 2019 as the latest iteration in the MobileNet series, signifies a significant breakthrough in mobile and embedded vision applications. Designed as an optimized version of EfficientNet for mobile hardware, MobileNetV3 addresses the critical challenge of balancing efficiency and accuracy in resource-constrained devices. Its architecture introduces innovative elements like the ’hardSwish’ activation function, an enhancement over ’Swish’. It incorporates techniques such as half-removed Squeeze and Excitation (SE) blocks alongside Neural Architecture Search (NAS) technology. These advancements empower MobileNetV3 to achieve superior results by strategically reducing network complexity, making it exceptionally well-suited for deployment on smartphones, tablets, and Internet of Things (IoT) devices where memory and power limitations are paramount. MobileNetV3’s flexibility is demonstrated through its robust performance across convolutional neural networks with varying layer counts, establishing it as a versatile solution compared to other architectures. In the feature extraction process of MobileNetV3, particularly utilizing the “small” variant, input images are standardized to a shape of [227, 227, 3]. The selection of MobileNetV3 small is based on its efficiency, leveraging depthwise separable convolution layers and linear bottleneck layers to create a lightweight and computationally efficient model adaptable to diverse applications. Additionally, the model incorporates features like a configurable width multiplier and L2 regularization, offering flexibility and enhanced generalization. The output tensor from MobileNetV3 small represents a concise set of abstract features extracted from the input images, leveraging depthwise separable convolutions for efficient feature learning.

3.5. Classification with Softmax Classifier

The softmax classifier is applied to retrieve feature data of the instances of HAR scalogram images in the proposed research.
Softmax, also known as Multinomial Logistic Regression, is a widely used method for multiclass classification problems. It extends binary logistic regression to handle multiple classes by employing the softmax function to transform raw predictions into probabilities. Given a set of input features x and a weight matrix W along with a bias vector b, the unnormalized log probabilities for each class i are calculated using the equation: z i = W i x + b i . Here, W i represents the i t h row of the weight matrix W, and b i is the i t h element of the bias vector b. The softmax function then converts these log probabilities into a valid probability distribution over all classes. For class j, the softmax function is defined as:
s o f t m a x ( z ) j = e z j k = 1 K e z k
where K is the total number of classes. The final prediction is made by selecting the class with the highest probability. The training objective typically involves minimizing the cross-entropy loss, which measures the dissimilarity between the predicted probabilities and the true distribution of classes.

4. Experiment and Result Analysis

The following section of this paper outlines the process of generating scalogram images and detecting activity using deep learning models.

4.1. Dataset Description

The research employs wearable sensors, particularly accelerometers and gyroscopes, for the generation of activity data. In this study, the SisFall [] and PAMAP2 [] datasets are utilized, which consist of recordings from wearable sensors worn by the subjects during various physical activities. The aim is to extract meaningful information for activity recognition and monitoring tasks.
The SisFall dataset features two accelerometer sensors that produce acceleration data in three axes (X, Y, Z) and the second sensor, a gyroscope, generates rotation signals in three axes (X, Y, Z) for a single activity. Each file in the dataset comprises nine columns representing nine subjects, with the number of rows varying depending on the duration of the tests. The subjects generate the data by wearing waist-worn sensor devices. These sensors operate at a sampling frequency of 200 Hz, enabling detailed and accurate data capture. The dataset encompasses a comprehensive set of 19 Activities of Daily Living (ADL) performed by two age groups. Elderly people group was comprised of 15 individuals (8 males aged 60–71 and 7 females aged 62–75) whereas young adults’ group was comprised of 23 participants (both males and females aged 19–30). A list of these Activities of Daily Living (ADL) is provided in Table 1.
Table 1. Activity of SisFall dataset.
To further validate the proposed method, the PAMAP2 dataset is also used to evaluate its performance on a different set of real-world data. In the PAMAP2 dataset, 9 subjects participated in 18 activities while wearing IMU sensors and a heart rate monitor. An IMU (Inertial Measurement Unit) is a device that combines an accelerometer, gyroscope, and magnetometer to measure and report a body’s specific force, angular rate, and orientation. Each IMU is attached to the chest, dominant arm, and dominant-side ankle. The dataset includes 12 different protocol activities and 6 optional activities. Each participant performed the protocol activity and some of the optional activities. Each file in the dataset contains a subject’s data with 54 columns: one for timestamp, one for activity label, and the remaining 52 for raw sensory data. These 52 columns hold data from the 3 IMU sensors, each containing 17 columns. The IMUs record temperature, 13-bit 3-bit accelerometer data at a scale of A 16 g , 13-bit 3D accelerometer data at a scale of A 6 g , 3D gyroscope data, 3D magnetometer data, and orientation data, but these data are not valid for this dataset. The sampling frequency is around 100 Hz. The activities are shown in Table 2.
Table 2. Activity of PAMAP2 dataset.
The activity names from the SisFall dataset have been shortened due to their detailed nature. In Table 3, the shortened forms of the activity details from the SisFall dataset are displayed. Within the scope of this work, an attempt is made to select activities of a similar nature. However, it is noteworthy that the SisFall dataset presents a more intricate array of activity combinations. Consequently, activities demonstrating superior outcomes are prioritized for initial inclusion. Detecting the remaining activities is reserved for future endeavors, where there is an aspiration to enhance activity detection effectiveness and precision.
Table 3. Short forms of activities of SisFall dataset.
For the dataset split, we randomly selected 70% for training, 20% for testing, and 10% for validation.

4.2. Image Generation and Preprocessing

A crucial step in extracting time-frequency representations of the signals involves converting the raw sensor data into scalogram images using CWT. To enhance the signal representation, a diverse set of mother wavelets, a broad range of scales, and various sampling frequencies are employed. This comprehensive approach aims to capture the intricate details and varied frequency components present in the signals. The selection of mother wavelets, including the Morlet wavelet, Complex Morlet wavelet, Shannon wavelet, and Mexican Hat wavelet, allows for a nuanced exploration of different signal characteristics.
The application of these mother wavelets spans across a range of scales from 1 to 128, facilitating a thorough analysis of the signal’s time-frequency features. Additionally, considering different sampling frequencies, precisely at 50, 100, and 200 Hz, adds further granularity to the investigation. This multi-faceted strategy achieves a more refined and comprehensive representation of the signals in the study on HAR. Among them, the best outcome, i.e., proper representation and accurate classification results, is obtained with the frequency of 200 Hz and the Morlet wavelet as the mother wavelet.
As illustrated in Figure 4, The preprocessing pipeline involved converting the raw sensor data into scalogram images and subsequently converting them into grayscale. Furthermore, the grayscale conversion is applied to the scalogram images by reducing the color space from three channels (RGB) to a single channel representing intensity. The subsequent merging of the grayscale images into RGB channels enables us to leverage the information from multiple sensor axes. We create a multi-channel representation of the data by representing each sensor axis as a separate channel. This approach provides the opportunity to capture and incorporate different aspects of the activity patterns into the analysis. Each channel represents a specific axis, enabling the exploration of independent information from different directions or orientations captured by the wearable sensors. Then the multi-channel representation of the scalogram images in RGB (R for the x-axis, G for the y-axis, and B for the z-axis) format facilitates the utilization of powerful deep learning models that operate on image data. Table 4 shows the processed images for different activities of the first accelerometer from the SisFall dataset with Shannon, Morlet, Complex Morlet and Mexican Hat mother wavelet successively. The choice of the mother wavelet function in wavelet transform analysis is crucial due to its significant impact on the accuracy and effectiveness of signal analysis tasks. Keng and Leong [] highlight how different mother wavelets can lead to varying decomposition results, potentially compromising the accuracy of signal analysis tasks. Maneesh et. al. [] further explore how the properties of mother wavelets influence their selection, affecting the detection of specific features within signals. Polikar’s [] tutorial elucidates Heisenberg’s uncertainty principle, illustrating the trade-off between time and frequency resolution inherent in different mother wavelets. Ultimately, the convolution process between the signal and the chosen mother wavelet determines how well relevant features are extracted, with some wavelets better suited than others for specific tasks. This variability underscores why different wavelet transforms yield differing results in accuracy and effectiveness, emphasizing the critical role of the mother wavelet function in signal analysis.
Figure 4. An example of activity image generation from sensor data.
Table 4. Generated scalogram images of some activities.

4.3. HAR Feature Extraction Using Deep Learning

In the proposed research for Human Activity Recognition (HAR), various deep learning architectures including CNN, AlexNet, ResNet50, ResNet101, and MobileNetV3 are employed for feature extraction from individual images. Each architecture offers unique parameters and characteristics crucial for extracting meaningful features. For instance, the input image size typically adheres to a standard dimension, such as 224 × 224 pixels with three color channels (e.g., RGB), ensuring uniformity across models. CNNs, AlexNet, and ResNet architectures follow a series of convolutional layers, often followed by pooling layers, which reduce the spatial dimensions of the feature maps while retaining important information. AlexNet, being a deeper architecture, encompasses five convolutional layers and three fully connected (FC) layers. ResNet50 and ResNet101, characterized by their residual learning approach, typically comprise multiple residual blocks, with ResNet101 being deeper. These architectures culminate in fully connected layers, where the extracted features are aggregated to form a feature vector. MobileNetV3, optimized for efficiency, employs pointwise convolutions, width and resolution multipliers, and squeeze-and-excite blocks. Following feature extraction, a final classification is performed using a softmax classifier, often leading to activity recognition results. The models are trained using consistent hyperparameters, such as the Adam optimizer, categorical cross-entropy loss function, a learning rate of 0.01, 15 epochs, and a batch size of 3, to maintain simplicity and consistency across the models.

4.4. Activity Recognition

In this section, the features extracted from the previous stage are utilized for activity classification using a softmax classifier on the evaluation dataset. To assess the performance, several metrics, including Recall (R), Accuracy (A), F1-score (F1), and Precision (P), are employed to assess the proposed model. Recall (Equation (12)) quantifies the ability of a classification model to capture and correctly identify all relevant cases (positives) among the total actual positives.
R = T P T P + F N × 100
Accuracy (Equation (13)) represents the percentage of correct predictions out of the total predictions made by the model, providing an overall assessment of its performance.
A = T P + T N F P + T P + F N + T N × 100
F1-score (Equation (14)) serves as a metric that balances precision and recall, calculated using their harmonic mean.
F 1 = 2 × ( R × P ) R + P × 100
Precision (Equation (15)) is a metric assessing the accuracy of positive predictions, defined as the ratio of true positives to the sum of true positives and false positives.
P = T P T P + F P × 100
These equations quantify the performance metrics used to evaluate the proposed model’s effectiveness in activity detection. After feature extraction, the extracted features from CNN, AlexNet, ResNet-50, and MobileNetV3 are employed as inputs for the softmax classifier. Each model is compiled using appropriate loss functions, such as categorical cross-entropy, and optimizers, like Adam, to facilitate training for the classification task. Additionally, data augmentation parameters, including rescaling, rotation range of 40 degrees, width, and height shifting range of 0.2, shear range of 0.2, zoom range of 0.2, and horizontal flipping, are incorporated into the training pipeline to further improve model generalization and resilience to variations in input data.
Altogether, five popular architectures: simple CNN, ResNet50, ResNet101, MobileNetV3, and AlexNet are tested. Table 5 displays the overall performance of different models with different classifiers. Among them, it is found that the ResNet101 produces the best result with the Morlet wavelet and a scale range from 0 to 128 with SVM. This combination yields a classification precision of 98.2%, a recall of 97.0%, an F1-score of 97.6%, and an accuracy of 98.4% on the SisFall dataset. Notably, ResNet101 exhibits superior performance. The CNN achieves its maximum accuracy of 95.1% when utilized with the Mexican Hat wavelet. Similarly, AlexNet demonstrates improved performance with the Shannon wavelet, achieving an accuracy of 93.2%. Moreover, both ResNet50 and MobileNetV3 attain the highest accuracy when employed with the Morlet wavelet, achieving 94.9% and 95.8%, respectively. Figure 5 shows the confusion matrix of the SisFall dataset for the best combination, which is the ResNet101 and Morlet wavelet, providing a visual representation of the classification performance. In Table 6, the detailed classification report of the SisFall dataset is presented, including Precision, Recall, F1-score, and Accuracy for each class.
Table 5. The overall performance of different DL models with different mother wavelets in SisFall.
Figure 5. Confusion matrix of activity classification using ResNet-101 and Morlet wavelet with SisFall dataset.
Table 6. Precision, Recall, F1-score, and Accuracy for each class of SisFall.
After achieving promising results on the SisFall dataset, the next phase of the study involved testing the same model architecture on the PAMAP2 dataset. This transition allowed us to explore the model’s adaptability and generalization capabilities across distinct datasets and activity recognition scenarios. The following section provides a detailed analysis of the attained results, highlighting the classification accuracy and other evaluations for each model architecture. Table 7 showcases all of the models evaluation metrics for each classifier. This dataset also yields optimal performance for ResNet101, achieving an Accuracy of 98.1%, an F1-score of 96.2%, a Precision of 98.1%, and a Recall of 95.3%. In contrast, among other models, CNN achieves the highest accuracy of 95.1% when paired with the Mexican Hat wavelet, while AlexNet demonstrates peak performance with an accuracy of 93.2% using the Shannon wavelet. Furthermore, ResNet50, coupled with the Morlet wavelet, achieves an accuracy of 94.9%, and MobileNetV3, under the same wavelet configuration, attains an accuracy of 95.8%.
Table 7. The overall performance of different DL models with different mother wavelets in PAMAP2.
Figure 6 displays the confusion matrix showcasing the excellent results obtained by the employed method. In Table 8, the detailed classification report of the PAMAP2 dataset is presented.
Figure 6. Confusion matrix of activity classification using ResNet-101 and Morlet wavelet with PAMAP2 dataset.
Table 8. Precision, Recall, F1-score, and Accuracy for each class of PAMAP2.

5. Comparison

In the comparative study presented in Table 9, the performance of the model was assessed using the PAMAP2 dataset and compared with several existing papers. Initially, the results were compared with those from Imen et al. [], where an accuracy of only 54.4% was achieved using CWT-CNN 2D, and the best result of 0.622 was obtained with Random Forest. Another relevant study, Zeng et al. [] reported an accuracy of 89.9% using LSTM with Continuous Temporal attention. Complexity issues were addressed by Xi et al. [] through the utilization of Deep Dilated Convolutional networks, resulting in an accuracy of 93.2%. Qian et al. [] utilized Distribution-Embedded Deep NN (DDNN) and an accuracy of 93.40% was achieved. Intriguingly, an image representation approach with CNN was employed by Sanchez et al. [], and a remarkable breakthrough was achieved with an accuracy of 0.967 for repetitive movements, although slightly lower at 0.887 for posture recognition. In the conducted experiments, a maximum accuracy of 98.1% on the PAMAP2 dataset was obtained using a combination of Continuous Wavelet Transform (CWT) and ResNet-101, which represents the state-of-the-art accuracy achieved so far in image representation for the PAMAP2 dataset.
Table 9. Comparison of results on PAMAP2 dataset.
Table 10, displays the existing papers’ results alongside the proposed model’s result. Notably, there has been limited prior research on activity detection using image representation in the SisFall dataset. In a study conducted by Abbas et al. [], the authors utilized a 4-2-1 1D spatial pyramid pooling approach with Haar wavelet features and employed various classifiers, including K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Random Forests, and extreme Gradient Boosting (XGB). Their achievements included an F1-score of 94.67% for fall and activity detection. However, it is important to highlight that while their model was trained on the SisFall dataset, the evaluation was conducted using their proprietary dataset. On the other hand, Al-majdi et al. [], adopted a CNN-LSTM-based method and achieved a maximum accuracy of 94.18%. In this proposed method, a maximum accuracy of 98.4% was attained in Resnet-101 for activity detection, representing an almost 3% improvement compared to existing work.
Table 10. Comparison of results on SisFall dataset.
This research demonstrates the efficacy of using scalogram images generated from wearable sensor data for human activity recognition through deep learning models. We compare the performance with the 1D sensor-based approach, and the proposed model outperforms other state-of-the-art models. In addition, we use different deep learning models to evaluate and compare their performance. However, in yielding this performance, we maintained consistent default model parameters to ensure fairness in comparison as this study does not focus on optimizing models for maximum performance.

6. Conclusions

In this study, an innovative Human Activity Recognition (HAR) model is introduced, which transforms raw accelerometer signal data into informative scalogram images using the Continuous Wavelet Transform (CWT) with an appropriate mother wavelet function. By converting these images to grayscale and merging data across three axes, this approach captures discriminant temporal and frequency patterns crucial for activity recognition. The visual interpretability provided by scalogram images, combined with the subsequent integration of deep learning models for feature extraction, yields a deep feature set for activity recognition. The proposed feature extraction model, coupled with classifier, offers an efficient HAR solution. This method’s ability to convert wearable sensor data into visually interpretable scalogram images underscores its efficacy in human activity recognition. The proposed approach is validated using two benchmark datasets, and the results demonstrate robust performance in classification, surpassing existing state-of-the-art methods, yielding 98.4% and 98.1% for the SisFall and PAMAP2 datasets, respectively. Overall, this approach represents a significant advancement in the field of HAR, offering a versatile and accurate solution with potential implications for a range of practical applications, including healthcare monitoring. The proposed model not only outperforms other state-of-the-art models but also provides a basis for future innovations in this domain.

Author Contributions

Conceptualization, N.A., M.R.I. and Y.W.; Methodology, N.A., M.O.A.N., M.R.I. and Y.W.; Software, M.O.A.N. and R.K.; Validation, R.K.; Formal analysis, N.A., M.O.A.N. and M.R.I.; Investigation, N.A.; Resources, R.K.; Data curation, M.O.A.N. and R.K.; Writing—original draft, N.A. and M.O.A.N.; Writing—review & editing, M.R.I. and Y.W.; Visualization, N.A. and R.K.; Supervision, M.R.I. and Y.W.; Project administration, M.R.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research paper is partially funded by the Institute of Energy, Environment, Research and Development (IEERD), University of Asia Pacific (UAP), Dhaka, Bangladesh.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lara, O.D.; Labrador, M.A. A Survey on Human Activity Recognition using Wearable Sensors. IEEE Commun. Surv. Tutor. 2013, 15, 1192–1209. [Google Scholar] [CrossRef]
  2. Mekruksavanich, S.; Jantawong, P.; Hnoohom, N.; Jitpattanakul, A. Wearable Fall Detection Based on Motion Signals Using Hybrid Deep Residual Neural Network. In Multi-Disciplinary Trends in Artificial Intelligence; Surinta, O., Kam Fung Yuen, K., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 216–224. [Google Scholar]
  3. Arshad, M.H.; Bilal, M.; Gani, A. Human Activity Recognition: Review, Taxonomy and Open Challenges. Sensors 2022, 22, 6463. [Google Scholar] [CrossRef] [PubMed]
  4. Ahmed, N.; Rafiq, J.I.; Islam, M.R. Enhanced Human Activity Recognition Based on Smartphone Sensor Data Using Hybrid Feature Selection Model. Sensors 2020, 20, 317. [Google Scholar] [CrossRef]
  5. Ahmed Bhuiyan, R.; Ahmed, N.; Amiruzzaman, M.; Islam, M.R. A Robust Feature Extraction Model for Human Activity Characterization Using 3-Axis Accelerometer and Gyroscope Data. Sensors 2020, 20, 6990. [Google Scholar] [CrossRef]
  6. Ahmed, N.; Kabir, R.; Rahman, A.; Momin, A.; Islam, M.R. Smartphone Sensor Based Physical Activity Identification by Using Hardware-Efficient Support Vector Machines for Multiclass Classification. In Proceedings of the 2019 IEEE Eurasia Conference on IOT, Communication and Engineering (ECICE), Yunlin, Taiwan, 3–6 October 2019; pp. 224–227. [Google Scholar] [CrossRef]
  7. Muaaz, M.; Chelli, A.; Abdelgawwad, A.A.; Mallofr, A.C.; PÃd’tzold, M. WiWeHAR: Multimodal Human Activity Recognition Using Wi-Fi and Wearable Sensing Modalities. IEEE Access 2020, 8, 164453–164470. [Google Scholar] [CrossRef]
  8. Chen, J.; Sun, Y.; Sun, S. Improving Human Activity Recognition Performance by Data Fusion and Feature Engineering. Sensors 2021, 21, 692. [Google Scholar] [CrossRef] [PubMed]
  9. Gomaa, W. Statistical and Time Series Analysis of Accelerometer Signals for Human Activity Recognition. In Proceedings of the 2019 14th International Conference on Computer Engineering and Systems (ICCES), Cairo, Egypt, 17 December 2019; pp. 351–356. [Google Scholar] [CrossRef]
  10. Ye, J.; Qi, G.J.; Zhuang, N.; Hu, H.; Hua, K.A. Learning Compact Features for Human Activity Recognition Via Probabilistic First-Take-All. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 126–139. [Google Scholar] [CrossRef]
  11. Yang, X.; Cao, R.; Zhou, M.; Xie, L. Temporal-Frequency Attention-Based Human Activity Recognition Using Commercial WiFi Devices. IEEE Access 2020, 8, 137758–137769. [Google Scholar] [CrossRef]
  12. Miao, G. Signal Processing in Digital Communications; Artech House: Norwood, MA, USA, 2006. [Google Scholar]
  13. Allen, J.B. The Short-Time Fourier Transform in Signal Analysis. IEEE Proc. 1977, 65, 1558–1563. [Google Scholar] [CrossRef]
  14. Wigner, E.P. On the Quantum Correction for Thermodynamic Equilibrium. Phys. Rev. 1948, 73, 1002–1009. [Google Scholar] [CrossRef]
  15. Baraniuk, R.G. Time-Frequency Signal Analysis using a Distribution-Theoretic Approach. IEEE Trans. Signal Process. 1996, 44, 2808–2820. [Google Scholar]
  16. Merry, R.J.E. Wavelet Theory and Applications: A Literature Study; Technische Universiteit Eindhoven: Eindhoven, The Netherlands, 2005. [Google Scholar]
  17. Ronao, C.A.; Cho, S.B. Deep learning for sensor-based human activity recognition: A survey. Pattern Recognit. Lett. 2016, 85, 1–11. [Google Scholar] [CrossRef]
  18. Meyer, Y. Wavelets and Operators; Cambrige University Journal: Cambridge, UK, 1992. [Google Scholar]
  19. Diker, A.; Cömert, Z.; Avcı, E.; Toğaçar, M.; Ergen, B. A Novel Application based on Spectrogram and Convolutional Neural Network for ECG Classification. In Proceedings of the 2019 1st International Informatics and Software Engineering Conference (UBMYK), Ankara, Turkey, 6–7 November 2019; pp. 1–6. [Google Scholar] [CrossRef]
  20. Türk, Ö.; Özerdem, M.S. Epilepsy Detection by Using Scalogram Based Convolutional Neural Network from EEG Signals. Brain Sci. 2019, 9, 115. [Google Scholar] [CrossRef] [PubMed]
  21. Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A survey of deep learning-based object detection. IEEE Access 2019, 7, 128837–128868. [Google Scholar] [CrossRef]
  22. Lu, L.; Zhang, C.; Cao, K.; Deng, T.; Yang, Q. A Multichannel CNN-GRU Model for Human Activity Recognition. IEEE Access 2022, 10, 66797–66810. [Google Scholar] [CrossRef]
  23. Zhang, H.; Parker, L.E. CoDe4D: Color-Depth Local Spatio-Temporal Features for Human Activity Recognition From RGB-D Videos. IEEE Trans. Circuits Syst. Video Technol. 2016, 26, 541–555. [Google Scholar] [CrossRef]
  24. Hirooka, K.; Hasan, M.A.M.; Shin, J.; Srizon, A.Y. Ensembled Transfer Learning Based Multichannel Attention Networks for Human Activity Recognition in Still Images. IEEE Access 2022, 10, 47051–47062. [Google Scholar] [CrossRef]
  25. Mliki, H.; Bouhlel, F.; Hammami, M. Human activity recognition from UAV-captured video sequences. Pattern Recognit. 2020, 100, 107140. [Google Scholar] [CrossRef]
  26. Deotale, D.; Verma, M.; Suresh, P. GSAS: Enhancing efficiency of human activity recognition using GRU based Sub-activity stitching. Mater. Today Proc. 2022, 58, 562–568. [Google Scholar] [CrossRef]
  27. Zhang, Y.; Man Po, L.; Liu, M.; Ur Rehman, Y.A.; Ou, W.; Zhao, Y. Data-level information enhancement: Motion-patch-based Siamese Convolutional Neural Networks for human activity recognition in videos. Expert Syst. Appl. 2020, 147, 113203. [Google Scholar] [CrossRef]
  28. Lillo, I.; Niebles, J.C.; Soto, A. Sparse Composition of Body Poses and Atomic Actions for Human Activity Recognition in RGB-D Videos. Image Vis. Comput. 2017, 59, 63–75. [Google Scholar] [CrossRef]
  29. Attal, F.; Mohammed, S.; Dedabrishvili, M.; Chamroukhi, F.; Oukhellou, L.; Amirat, Y. Physical Human Activity Recognition Using Wearable Sensors. Sensors 2015, 15, 31314–31338. [Google Scholar] [CrossRef] [PubMed]
  30. Inoue, M.; Inoue, S.; Nishida, T. Deep Recurrent Neural Network for Mobile Human Activity Recognition with High Throughput. Artif. Life Robot. 2018, 23, 173–185. [Google Scholar] [CrossRef]
  31. Singh, M.S.; Pondenkandath, V.; Zhou, B.; Lukowicz, P.; Liwicki, M. Transforming Sensor Data to the Image Domain for Deep Learning—An Application to Footstep Detection. arXiv 2017, arXiv:1701.01077. [Google Scholar] [CrossRef]
  32. Gholamrezaii, M.; Almodarresi, S. Human Activity Recognition Using 2D Convolutional Neural Networks. In Proceedings of the 2019 27th Iranian Conference on Electrical Engineering (ICEE), Yazd, Iran, 30 April–2 May 2019. [Google Scholar] [CrossRef]
  33. Leixian, S.; Zhang, Q.; Cao, G.; Xu, H. Fall Detection System Based on Deep Learning and Image Processing in Cloud Environment. In Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0); Springer: Berlin/Heidelberg, Germany, 2019; pp. 590–598. [Google Scholar] [CrossRef]
  34. Park, J.; Lim, W.S.; Kim, D.W.; Lee, J. Multitemporal Sampling Module for Real-Time Human Activity Recognition. IEEE Access 2022, 10, 54507–54515. [Google Scholar] [CrossRef]
  35. Zebhi, S. Human Activity Recognition Using Wearable Sensors Based on Image Classification. IEEE Sens. J. 2022, 22, 12117–12126. [Google Scholar] [CrossRef]
  36. Mekruksavanich, S.; Jitpattanakul, A.; Sitthithakerngkiet, K.; Youplao, P.; Yupapin, P. ResNet-SE: Channel Attention-Based Deep Residual Network for Complex Activity Recognition Using Wrist-Worn Wearable Sensors. IEEE Access 2022, 10, 51142–51154. [Google Scholar] [CrossRef]
  37. Dib, W.; Ghanem, K.; Ababou, A.; Eskofier, B.M. Human Activity Recognition Based on the Fading Characteristics of the On-Body Channel. IEEE Sens. J. 2022, 228, 8094–8103. [Google Scholar] [CrossRef]
  38. Trabelsi, I.; Françoise, J.; Bellik, Y. Sensor-based Activity Recognition using Deep Learning: A Comparative Study. In Proceedings of the 8th International Conference on Movement and Computing (MOCO ’22), Chicago, IL, USA, 22–24 June 2022; pp. 1–8. [Google Scholar] [CrossRef]
  39. Nadia, A.; Lyazid, S.; Okba, K.; Abdelghani, C. A CNN-MLP Deep Model for Sensor-based Human Activity Recognition. In Proceedings of the 2023 15th International Conference on Innovations in Information Technology (IIT), Al Ain, United Arab Emirates, 14–15 November 2023; pp. 121–126. [Google Scholar] [CrossRef]
  40. Noori, F.M.; Riegler, M.; Uddin, M.Z.; Torresen, J. Human Activity Recognition from Multiple Sensors Data Using Multi-fusion Representations and CNNs. ACM Trans. Multimed. Comput. Commun. Appl. 2020, 16, 1–19. [Google Scholar] [CrossRef]
  41. Zebin, T.; Scully, P.; Ozanyan, K. Human Activity Recognition with Inertial Sensors Using a Deep Learning Approach. In Proceedings of the 2016 IEEE SENSORS, Orlando, FL, USA, 30 October–3 November 2016; pp. 1–3. [Google Scholar] [CrossRef]
  42. Ogbuabor, G.; Labs, R. Human Activity Recognition for Healthcare Using Smartphones. In Proceedings of the 2018 10th International Conference on Machine Learning and Computing, Macao, China, 26–28 February 2018; pp. 41–46. [Google Scholar] [CrossRef]
  43. Kuncan, F.; Kaya, Y.; Tekin, R.; Kuncan, M. A new approach for physical human activity recognition based on co-occurrence matrices. J. Supercomput. 2022, 78, 1048–1070. [Google Scholar] [CrossRef]
  44. Silik, A.; Noori, M.; Altabey, W.; Ghiasi, R.; Wu, Z. Comparative Analysis of Wavelet Transform for Time-Frequency Analysis and Transient Localization in Structural Health Monitoring. Struct. Durab. Health Monit. 2021, 15, 1–22. [Google Scholar] [CrossRef]
  45. Gholizad, A.; Safari, H. Damage identification of structures using experimental modal analysis and continuous wavelet transform. J. Numer. Methods Civ. Eng. 2017, 2, 61–71. [Google Scholar] [CrossRef]
  46. Morlet, J.; Arens, G.; Fourgeau, E.; Giard, D. Wavelet transform or the continuous wavelet transform? IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 237–244. [Google Scholar]
  47. Jansen, E.; Blom, J.A.P.; Van der Hulst, A.C. Seven heuristic wavelet families. IEEE Trans. Image Process. 1999, 8, 415–431. [Google Scholar]
  48. Morlet, J. Sampling in time-frequency space using the Wigner-Ville transform. IEEE Trans. Inf. Theory 1982, 28, 221–232. [Google Scholar]
  49. Mallat, S. A Wavelet Tour of Signal Processing: The Sparse Way; The Academy Press: Lagos, Nigeria, 1999; Volume 1, pp. 1–9. [Google Scholar]
  50. Stokes, M.; Srinivasan, S.; Manjunath, R. A standard default color space for the internet—sRGB. In Sixth Color Imaging Conference: Color Science, Systems and Applications; Society for Imaging Science and Technology: Springfield, VA, USA, 1996; pp. 136–143. [Google Scholar]
  51. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2012, 60, 84–90. [Google Scholar] [CrossRef]
  52. Ciregan, D.; Meier, U.; Schmidhuber, J. Multi-column deep neural networks for image classification. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3642–3649. [Google Scholar] [CrossRef]
  53. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  54. O’Shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]
  55. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  56. Koonce, B.; Koonce, B. MobileNetV3. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Springer: Berlin/Heidelberg, Germany, 2021; pp. 125–144. [Google Scholar]
  57. Sucerquia, A.; López, J.D.; Vargas-Bonilla, J.F. SisFall: A Fall and Movement Dataset. Sensors 2017, 17, 198. [Google Scholar] [CrossRef]
  58. Reiss, A. PAMAP2 Physical Activity Monitoring. UCI Machine Learning Repository. 2012. [CrossRef]
  59. Wai Keng, N.; Leong, M.; Hee, L.; Abdelrhman, A. Wavelet Analysis: Mother Wavelet Selection Methods. Appl. Mech. Mater. 2013, 393, 953–958. [Google Scholar] [CrossRef]
  60. Upadhya, M.; Singh, A.K.; Thakur, P.; Nagata, E.A.; Ferreira, D.D. Mother wavelet selection method for voltage sag characterization and detection. Electr. Power Syst. Res. 2022, 211, 108246. [Google Scholar] [CrossRef]
  61. Polikar, R. The Wavelet Tutorial. Available online: https://users.rowan.edu/~polikar/WTtutorial.html (accessed on 23 June 2024).
  62. Zeng, M.; Gao, H.; Yu, T.; Mengshoel, O.; Langseth, H.; Lane, I.; Liu, X. Understanding and improving recurrent networks for human activity recognition by continuous attention. In Proceedings of the 2018 ACM International Symposium on Wearable Computers, New York, NY, USA, 8–10 October 2018; pp. 56–63. [Google Scholar]
  63. Xi, R.; Hou, M.; Fu, M.; Qu, H.; Liu, D. Deep dilated convolution on multimodality time series for human activity recognition. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
  64. Qian, H.; Pan, S.; Da, B.; Miao, C. A novel distribution-embedded neural network for sensor-based activity recognition. Sensors 2019, 19, 5498. [Google Scholar] [CrossRef]
  65. Sanchez Guinea, A.; Sarabchian, M.; Mühlhä, M. Improving Wearable-Based Activity Recognition Using Image Representations. Sensors 2022, 22, 1840. [Google Scholar] [CrossRef] [PubMed]
  66. Syed, A.S.; Sierra-Sosa, D.; Kumar, A.; Elmaghraby, A. A Hierarchical Approach to Activity Recognition and Fall Detection Using Wavelets and Adaptive Pooling. Sensors 2021, 21, 6653. [Google Scholar] [CrossRef]
  67. Al-Majdi, K.; AL-Musawi, R.; Ali, A.; Abudlmoniem, S.; Mezaal, D. Real-time classification of various types of falls and activities of daily livings based on CNN LSTM network. Period. Eng. Nat. Sci. (PEN) 2021, 9, 958–969. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.