1. Introduction
Falls pose a significant risk to the elderly, leading to hospitalizations, increased morbidity, and substantial healthcare expenses. According to the World Health Organization (WHO), falls are the second leading cause of accidental or unintentional injury deaths worldwide, with adults over 65 being the most vulnerable group [
1]. The statistics are alarming: more than 37.3 million falls requiring medical attention occur annually, and this number is expected to rise with the global increase in the elderly population [
2]. Early recognition and timely intervention are crucial in mitigating the consequences of falls, prompting extensive research and development efforts in fall-detection systems [
3].
The research community has explored various approaches to automate fall detection, aiming to provide real-time assistance and enable prompt medical interventions. Wearable sensors, such as accelerometers and gyroscopes, have been extensively utilized [
4]. These sensors capture movement patterns and acceleration changes associated with falls. Radar-based systems have also been employed to detect falls using Doppler-based signal-processing techniques [
5]. These systems offer privacy preservation advantages since they do not rely on visual information. However, challenges such as the bulkiness of wearable devices, the need for regular recharging, and limitations in distinguishing falls from other activities must be addressed for widespread adoption [
6].
Video-based fall-detection algorithms have gained significant attention due to their potential for real-time assistance and comprehensive monitoring capabilities. Using video datasets allows one to capture detailed spatiotemporal information and enables the development of advanced machine learning models [
7]. Recent research has shown promising results in using video-based approaches, including deep-learning-based methods [
8] and ensemble learning techniques [
9], to improve accuracy and robustness in fall detection. However, researchers must address several challenges to ensure video-based fall-detection systems’ widespread acceptance and effectiveness.
Privacy preservation is critical when working with video data as they often contain sensitive information about individuals [
10]. Data protection regulations, such as the General Data Protection Regulation (GDPR), impose strict requirements for collecting, storing, and processing personal data [
11]. Ensuring privacy while maintaining the effectiveness of fall-detection algorithms is challenging. Integrating privacy-preserving techniques, such as video obfuscation or blurring, becomes essential to balance privacy and fall-detection accuracy [
12].
Another drawback of traditional machine learning models is their lack of transparency, operating as “black boxes” where the decision-making process is not easily interpretable [
13]. This lack of interpretability can be a significant challenge, particularly in medical applications such as fall detection, where healthcare professionals need to understand the reasoning behind the detection outcomes to trust the system [
14]. Healthcare professionals must have insight into how the algorithm arrives at its decisions in healthcare settings as this helps them validate and understand the results. The explainability of fall-detection algorithms is essential for building trust, ensuring patient safety, and facilitating effective collaboration between the algorithm and healthcare providers.
Existing fall-detection solutions face several challenges. While direct and personal, wearable sensors often suffer from discomfort, leading to non-compliance, limited battery life requiring frequent recharges, and the inability to accurately differentiate between actual falls and daily activities like sitting. Although video-based systems offer comprehensive monitoring, they raise significant privacy concerns and are subject to environmental factors such as poor lighting and obstructions, which can hinder their effectiveness. Additionally, these systems and other technologies, such as radar or acoustic sensors, can be costly and complex, posing barriers to widespread adoption. These limitations underscore the need for developing more adaptable, efficient, and privacy-conscious fall-detection systems.
In addressing the challenges associated with video-based fall detection, our proposed approach integrates advanced machine learning techniques adapted to the specific needs of medical healthcare. This comprehensive architecture comprises several modular components: a Gaussian module for video obfuscation, the OpenPose library for pose estimation, a short-time Fourier transform (STFT) module for capturing motion-induced frames and generating temporal dynamics spectrograms, a classification module utilizing convolutional and dense layers, and a gradient-weighted class activation mapping (GradCAM) module for providing visual explanations of decision-making processes.
The application of STFT in fall detection, particularly within video datasets, is relatively novel despite its extensive use in audio and speech analysis [
15,
16]. STFT’s ability to analyze the frequency content of signals across short time intervals allows for the detection of temporal changes, offering valuable insights into the dynamics and movement patterns associated with falls. Integrating STFT enhances the understanding of fall events. It is a crucial tool for healthcare professionals, helping them identify and analyze patterns and behaviors that lead to falls and improving patient care.
Furthermore, our system incorporates the GradCAM technique, renowned for its explainability capabilities within computer vision. GradCAM effectively highlights the areas within an input image that significantly influence the decisions made by the neural network. By deploying GradCAM, we provide visual explanations that enhance the transparency of the fall-detection process, thereby allowing healthcare professionals to validate and trust the system’s accuracy and reliability in detecting fall events.
By integrating STFT and GradCAM, our approach merges the benefits of capturing detailed temporal dynamics and visualizing influential key points during falls. This dual integration facilitates a deeper understanding of the dynamics surrounding fall incidents and enables the precise analysis of crucial moments leading to falls. Additionally, the visual explanations provided by GradCAM significantly enhance the interpretability of the model’s decisions, fostering trust and collaboration between healthcare providers and the fall-detection system.
Our system’s modular design allows customized configurations to meet diverse operational needs. For instance, in environments where privacy is not a primary concern, the Gaussian module can be deactivated to provide unobstructed video analysis. Conversely, in settings requiring continuous and comprehensive monitoring, the STFT module can be bypassed to allow uninterrupted data analysis. This flexibility ensures the system can be optimized for specific scenarios, enhancing its applicability and effectiveness.
Key contributions of our approach include:
STFT-Based Motion Analysis: We introduce an innovative STFT application for fall detection that selectively processes only frames exhibiting significant motion. This targeted approach reduces the incidence of false alarms, a common challenge in continuous monitoring scenarios, thereby enhancing system reliability and efficiency. The system optimizes processing power and storage by focusing computational resources on moments of potential falls, making it ideal for extended use in healthcare facilities.
Explanatory Visualizations with GradCAM: We employ GradCAM to enhance the transparency and interpretability of our machine learning model. This technique provides visual explanations by highlighting the critical regions within the data that influence the detection outcomes. Such visual insights are invaluable for medical professionals as they provide a clear basis for understanding and trusting the model’s decisions, thereby fostering a deeper integration of AI tools in routine clinical practices.
Modular System Flexibility: Our system’s architecture is designed for modular flexibility, allowing for customization according to specific operational requirements and privacy concerns. Facilities can deactivate the Gaussian module where privacy is less concerned, offering unfiltered visual monitoring. Alternatively, in environments where constant, comprehensive data collection is required, the STFT module can be bypassed to maintain continuous monitoring without preselecting motion-induced frames. This adaptability ensures that the system can be adapted to the specific needs of different healthcare environments, enhancing its practicality and applicability across various clinical and care settings.
These contributions ensure that our fall-detection system meets the requirements of healthcare applications regarding accuracy, efficiency, and explainability.
The rest of the paper is organized as follows:
Section 2 provides some of the related work on automated fall-detection algorithms.
Section 3 discusses the preliminaries for a basic understanding of the libraries and models. The proposed architecture is explained in
Section 4. The performance analysis and experimental results are demonstrated and discussed in
Section 5. Finally,
Section 6 concludes the work and discusses the future aspects of the proposed study.
4. Proposed Architecture
As illustrated in
Figure 2, the proposed architecture consists of five main modules, each playing a crucial role in the fall-detection system. First, the Gaussian module ensures the protection of individual privacy by applying a video obfuscation technique to the video input. Next, the OpenPose module utilizes the OpenPose network to extract key points from the video. The STFT module serves a dual purpose. It optimizes computational resources and provides insights into the contextual information regarding fall events. This contextual information helps understand the sequence of events and detect any unusual motions or patterns preceding the sudden fall. The extracted frames are then fed into the classification module, which learns and identifies key point patterns associated with falls and non-fall movements. Lastly, the GradCAM module employs convolution layers to generate sequence heat visualization. This visualization aids in understanding the reasoning behind the fall-detection outcomes. It highlights the temporal variation in the key points contributing to the classification decisions. Further details on each module are discussed in the subsequent section.
4.1. Gaussian Module
In the Gaussian module, we implemented Gaussian blur, which was discussed in detail in
Section 3.1. In addition, we used a large kernel size of (21 × 21), resulting in more significant filtering. The size of the kernel was set with experimentation so that the filtering does not significantly impact the performance of the OpenPose module in extracting the key points. Once the video is obfuscated, we send it to the OpenPose module to extract the key points.
4.2. OpenPose Module
The OpenPose module integrates the OpenPose library, a real-time multiperson key point detection tool. It extracts 25 key points with x and y coordinates. For this study, we utilize only a subset of four key points: the neck, left shoulder, right shoulder, and hip. This selection primarily serves two purposes.
Firstly, these key points demonstrate minimal data loss across various datasets, ensuring robustness in data processing. An example from the MCFD dataset illustrates this particular selection, as shown in
Figure 3, where these key points show significantly fewer missing values than others. Similar results have been demonstrated in other datasets as well. This preservation makes the subset not just reliable but a robust choice for training the network and improving model performance.
Secondly, by focusing on this subset, the network achieves improved cost-effectiveness and precision due to a substantial reduction in the calculations required. Following the selection of this subset, the velocities of these four key points are computed along the
X and
Y axes using the formula below:
where
represents the velocity at frame
i,
k denotes the key point, and
j is the number of frames over which the velocity is calculated.
Reducing the number of key points from twenty-five to four significantly decreases the amount of data that need to be processed. This streamlined approach enhances the computational efficiency and reduces the processing time required for key point extraction and subsequent analysis. By focusing on the most relevant key points, the system can quickly and accurately detect falls without being bogged down by unnecessary calculations associated with less critical key points.
This optimization is particularly advantageous in real-time applications where rapid response times are essential. For instance, in a healthcare setting where real-time monitoring is crucial, the ability to quickly process and analyze video data can make a significant difference in the timely detection of falls and the provision of immediate assistance. Moreover, the reduced computational load translates to lower power consumption and resource usage, making the system not just efficient but also practical and scalable for widespread deployment.
4.3. STFT Module
The STFT module, a fundamental part of our fall-detection system, demonstrates its precision by providing crucial capabilities for distinguishing between fall and non-fall patterns through careful frame analysis. By calculating the velocities of key points from the OpenPose module, the STFT module transforms these time-based measurements into the frequency domain with utmost accuracy. This transformation is key in revealing the subtle differences between transient, abrupt movements characteristic of falls and more sustained, regular motions associated with normal activities.
The process begins with the segmentation of video data into overlapping frames. This step is vital to ensure that the continuity of motion is preserved, which helps accurately capture events that may lead to falls without missing any sudden movements due to temporal gaps. Following segmentation, each frame is treated with a window function, typically a Hamming window, to reduce spectral leakage and enhance the Fourier transform’s resolution. This windowing is crucial as it helps maintain the integrity of the signal edges, thereby ensuring that the frequency analysis is accurate and reliable.
Subsequently, the STFT is applied to each windowed frame, converting the velocity signals from the time domain to the frequency domain. This conversion allows us to construct a spectrogram that visualizes the intensity and frequency of motion data over time. In the spectrogram, some visible patterns often correspond to vigorous activities, which may indicate a fall or some motion activities. Such visual representation is helpful as it allows for the easy identification of high-motion activities such as falls by highlighting sudden increases in frequency. Conversely, low-motion activity like sleeping or sitting manifests differently in the spectrogram as they do not produce patterns. This visualization helps differentiate falls from low-motion activities and high-motion activities, thus enhancing the fall-detection system’s reliability and accuracy.
Furthermore, the STFT module’s adaptability demonstrates its robustness. By optimizing parameters such as the size and overlap of the window function in the STFT, the module’s sensitivity and specificity can be finely tuned to respond to different types of movements. This adaptability is a key factor in ensuring the module performs accurately across a variety of real-world scenarios, where the nature of movements can vary significantly, providing a sense of reassurance about its performance.
The STFT module strengthens the technical foundation of our fall-detection system and enhances its interpretability and utility. Capturing detailed motion patterns and transforming them into an analyzable format provides critical insights that are used for robust fall detection. When significant motion is detected, the module extracts a sequence of 50 frames that encapsulate the event, which are then forwarded to the classification module to determine whether a fall has occurred. This sequence of operations underscores the module’s integral role in the overall effectiveness and reliability of our fall-detection mechanism.
4.4. Classification Module
After extracting the sequence of frames exhibiting substantial motion from the input videos, the respective sequence comprising a subset of key points is fed into the classification module, aiming to discern the occurrence of a fall event. As depicted in
Figure 4, the classification module encompasses various vital components, including the convolution layer, MaxPool layer, Flatten layer, and dense units. The computational pipeline commences with applying the convolution layer, effectively capturing essential spatiotemporal features from the input key points. Subsequently, the MaxPool layer is employed to reduce the dimensionality of the extracted features while preserving crucial information. As a subsequent step, the Flatten layer is introduced to transform the output of the MaxPool layer into a one-dimensional vector representation. Finally, the compressed and structured representation is fed into a fully connected layer comprising dense units to facilitate the detection and identification of instances of fall events. The network structure and parameters of the classification module are presented in
Table 1.
4.5. GradCAM Module
By employing the GradCAM module discussed in
Section 3.5, we highlight the key regions within the key points sequence that significantly contribute to the classification decision, enhancing the understanding and trustworthiness of our model’s predictions. GradCAM enables the identification of temporal variations in the sequence most indicative of fall events. The resulting GradCAM heat map effectively provides visual cues, indicating the areas of the sequence with significant variations between fall and not fall events. By visualizing these salient regions, medical professionals gain profound insights into the decision-making process of our model, empowering them to validate the accuracy and reliability of its fall-event-detection capabilities.
5. Experimental Results
5.1. Dataset Description
This study employs multiple datasets to enhance the robustness and applicability of the proposed fall-detection system. Below are descriptions of the datasets used:
NTU RGB+D Dataset (NTU): This dataset [
45] is renowned for its extensive collection of human activities captured through high-resolution videos. Researchers selected a subset of activities specifically relevant to medical conditions, focusing on critical actions for assessing health-related incidents. This subset includes activities such as sneezing/coughing, staggering, falling, headache, chest pain, back pain, neck pain, nausea/vomiting, and fan self, providing a rich resource for analyzing fall-related events.
Multiple Cameras Fall Dataset (MCFD): Auvinet et al. [
46] developed the MCFD, which consists of 192 video recordings captured from 24 different scenarios using eight synchronized cameras with a resolution of 720 × 480. This dataset uniquely includes a wide variety of fall scenarios (22) and activities of daily living (2), such as moving boxes, dressing, and room cleaning, recorded from multiple viewpoints. This diversity aids in simulating more realistic fall situations.
UR Fall Dataset (URFD): The Computational Modeling Discipline Centre at the University of Rzeszow [
47] created the UR fall-detection dataset (URFD), which includes 70 videos—30 depicting falls and 40 showing non-fall activities such as walking, sitting, and squatting. The videos capture performers exhibiting a range of fall-related behaviors, including backward leaning and sudden descents to the ground. The videos are recorded in RGB format with a resolution of 640 × 480.
Each dataset provides unique insights and challenges, offering a comprehensive platform for testing and improving the proposed fall-detection algorithms.
5.2. Evaluation Approach and Metrics
We assess the effectiveness of our classification model through a rigorous evaluation methodology that incorporates k-fold cross-validation and various metrics, including the confusion matrix and classification report. The k-fold cross-validation technique is a crucial part of our methodology as it ensures a robust evaluation of the model’s performance. This methodology divides the dataset into k partitions or folds, enabling multiple training rounds and testing on distinct data subsets. We conduct five-fold cross-validation to assess the model’s performance, deliberately choosing a value of k as five. Each fold is the testing set, while the remaining folds constitute the training set. We iterate this process five times to ensure that each fold is used as the testing set precisely once. Within each iteration of k-fold cross-validation, the model undergoes training on the training set, which encompasses 80 percent of the available data. Subsequently, we evaluate the model on the corresponding testing set to assess its performance on previously unseen data.
We employ metrics such as accuracy, F1 score, sensitivity, and specificity to assess the model’s performance thoroughly.
The accuracy metric evaluates the model’s prediction accuracy, estimating its overall performance, and is computed as
where
TP (True Positive) is when the predicted and actual output is true.
FP (False Positive) is when the predicted output is true but the actual output is false.
TN (True Negative) is when the predicted output is false and the actual output is also false.
FN (False Negative) is when the predicted output is false but the actual output is true.
The F1 score is a statistical measure that combines precision and recall in a balanced manner, calculated as the harmonic mean of the two metrics. It offers a comprehensive assessment of the model’s performance by simultaneously considering precision and recall, which proves advantageous in evaluating classification models and is given by
and
Sensitivity, also known as recall, measures the proportion of actual positives correctly identified by the model and is computed as
Specificity measures the proportion of actual negatives correctly identified by the model and is computed as
5.3. Performance Analysis for Multiple Cameras Fall Dataset (MCFD)
5.3.1. Result from Gaussian Module
We employed Gaussian blur in the module with an optimally chosen kernel size through experimentation to effectively obfuscate the individual’s identity. The results are demonstrated in
Figure 5. The top row displays frames without blurring, while the bottom row shows frames with blurring. This approach ensures that the video remains sufficiently blurred to the naked eye, making deblurring exceedingly challenging, even with recent advancements in deblurring techniques. Using a large kernel size results in a significant loss of information, posing a substantial challenge for any regenerative deblurring techniques.
5.3.2. Result from OpenPose Module
The effectiveness of the OpenPose module in capturing and analyzing body poses, even in the presence of blurring, is demonstrated in
Figure 6. These results showcase the key points extracted from subjects in various videos, providing valuable insights into body movements and postures. Such accurate extraction enables further analysis and the subsequent prediction of falls. Although the OpenPose module successfully extracted most of the key points from the frames, there were instances where it failed to detect the person or generated key points randomly within the frames. To address this issue, we applied data preprocessing techniques and employed linear interpolation to manage these occurrences.
5.3.3. Result from STFT Module
The selection of STFT parameters such as window type, segment size, overlap ratio, FFT length, scaling, and mode was crucial for accurately capturing and analyzing significant movements indicative of falls. We employed a Hann window for its smooth tapering, which minimizes spectral leakage. A segment size of 16 and an FFT length of 256 were chosen to balance time and frequency resolutions effectively. The overlap ratio was 93.75%, ensuring motion continuity is preserved without sacrificing computational efficiency. Density scaling was applied to normalize the spectrogram by the signal’s power, assisting in the consistent interpretation of spectral density. Meanwhile, the PSD mode focused on the power distribution within the frequency spectrum.
These parameters were not arbitrarily chosen but were determined through an extensive empirical approach, including trials with various settings. This rigorous evaluation allowed us to identify a parameter set that optimally balances sensitivity to motion with computational efficiency. Initially, the effectiveness of these parameters was confirmed through the manual verification of 112 sequences derived from high-motion events captured across two camera angles. This initial test ensured the STFT settings were accurately capturing significant motion events.
Following the successful manual verification, the STFT module was applied to a broader dataset, extracting high-motion sequences from 448 sequences recorded across eight different cameras. This extensive application demonstrates the module’s capability to consistently identify and analyze significant motion across diverse settings, as shown in
Figure 7.
To ensure the statistical reliability of our findings from the STFT module, we conducted a power analysis using G*Power as shown in
Figure 8. Based on an effect size of 0.5, an
error probability of 0.05, and a desired power of 0.8, the analysis determined that a total sample size of 102 sequences would be necessary for robust results. With our initial manual testing covering 112 sequences and the STFT module effectively analyzing 448 sequences, our study significantly exceeds the recommended sample size.
5.3.4. Result from Classification Module
The classification reports in
Table 2 and
Table 3 offer detailed insights into the performance of the classification module, enabling a thorough assessment of the system’s accuracy.
We observed that utilizing a subset of key points consisting of the neck, left shoulder, right shoulder, and hip yielded significantly improved results than using all the key points. The subset achieved an impressive accuracy of 98%, outperforming the accuracy of 91% obtained when using all the key points. This finding suggests two things. The first is that the other key points contain a lot of missing data, and the second one is that the selected subset contains key anatomical landmarks that are particularly informative for fall detection. Focusing on these key points could capture essential body movements and orientations strongly indicative of falls.
However, it is crucial to note that there were some misclassifications within the detection system, particularly in scenarios where falls did not conform to typical patterns, such as when an individual fell onto a sofa instead of the floor. These instances, which often involve softer, more gradual descents that may not generate the distinct motion signatures expected from a fall, were sometimes incorrectly labeled as non-falls. Such nuanced scenarios underscore the complexity of accurately detecting falls across diverse real-world environments. The variability in how falls occur is affected by factors such as the individual’s interaction with surrounding furniture and the nature of the fall itself, which presents substantial challenges in automating reliable fall detection.
The environment in which a fall occurs plays a significant role in detection accuracy. For instance, objects like sofas can cushion a fall, significantly altering the fall’s dynamics and the associated sensory data captured by the system. This can obscure critical visual and motion cues necessary for accurate classification. Additionally, the effectiveness of fall detection can be influenced by the camera placement. The current system utilizes eight different cameras positioned around the room to capture multiple angles of potential falls. However, optimal placement and improved sensitivity may enhance the system’s ability to capture subtle movements associated with atypical falls, such as those partially obstructed by furniture.
5.3.5. Result from GradCAM Module
The GradCAM module is pivotal in enhancing the interpretability of our fall-detection model. By generating heatmaps highlighting the temporal variance of key points within sequences where a fall occurs, GradCAM offers visual insights into the specific features and areas that the model considers most relevant for making its decisions. This visualization not only helps identify the critical areas of interest for fall identification but also clarifies the decision-making process of the model, which is particularly useful in complex, real-world scenarios where multiple factors influence the outcome.
The heatmaps produced by GradCAM effectively illustrate how certain patterns of movement and specific key points are emphasized during fall events compared to non-fall activities as seen in
Figure 9. For instance, during a fall, significant key points such as the neck, shoulders, and hips might show synchronized and rapid movements, which are captured and emphasized in the heatmap. These patterns are critical for distinguishing falls from other activities like sitting or bending, where movements are more gradual and less coordinated.
Moreover, the ability of GradCAM to visually demonstrate these differences enhances trust in the model’s predictive capabilities. Medical practitioners and caregivers can use these insights to understand the model’s reasoning better, potentially leading to more informed decisions about patient safety measures and interventions. For example, knowing which movements or positions most commonly lead to falls can inform targeted exercises or changes in the living environment to prevent such incidents.
From a practical standpoint, the interpretability provided by GradCAM serves as a vital link between the raw data processed by the model and the human understanding of that data. This connection is particularly crucial in healthcare settings, where the ability to explain AI systems is not just a convenience but a necessity for user acceptance and trust. By offering clear, intuitive visual explanations, GradCAM demystifies the model’s operations, making its insights more accessible and actionable for healthcare professionals.
The interpretability provided by GradCAM is not only beneficial for understanding the model’s decisions but also for iterative model improvement. Developers and clinicians can scrutinize cases where the model’s output deviates from expected results, pinpointing potential areas for enhancement. For example, if the GradCAM heatmap consistently highlights incorrect regions during fall events, this could signal a need to adjust the model’s focus or augment the training dataset to better reflect the intricacy of various fall types.
In conclusion, integrating GradCAM into our fall-detection system enhances the system’s transparency and effectiveness. The clear visual feedback provided by GradCAM not only validates the model’s decisions but also fosters continuous improvement and cultivates user trust. This makes GradCAM an invaluable asset in the development of dependable and explainable AI solutions for healthcare.
5.4. Impact of Gaussian Blur on Fall-Detection Performance
In our fall-detection system, Gaussian blur is employed to ensure privacy protection and test the OpenPose module’s resilience in key point detection under varying levels of image obfuscation. We conducted detailed experiments with different Gaussian blur kernel sizes to find the optimal balance between these requirements—privacy and detection accuracy.
The following table demonstrates how the performance of the fall-detection system varies with the degree of Gaussian blur applied:
As the data in
Table 4 illustrates, smaller kernel sizes like 5 × 5 offer the highest performance in terms of precision, recall, and overall accuracy. With an increase in kernel size, there is a notable trend towards reduced precision and accuracy, though recall remains relatively high. This trend is attributed to the blurring effect obscuring key points, making it challenging for the OpenPose module to detect and classify key points accurately. Particularly with a kernel size of 31 × 31, there is a noticeable drop in performance metrics, indicating significant difficulty in key point detection due to excessive blurring.
The kernel size of 21 × 21 represents a compromise, providing adequate privacy while maintaining reasonable accuracy in fall detection. However, as the kernel size increases to 31 × 31, the decline in performance underscores the critical trade-off faced. While stronger blurring enhances privacy, it significantly impairs the system’s ability to effectively detect and analyze fall incidents.
This analysis clearly demonstrates the need to carefully select the degree of Gaussian blur. It ensures that while the privacy of the individuals in the video footage is protected, the fall-detection capabilities of the system are not unduly compromised. The findings highlight the delicate balance between obscuring sensitive information and retaining sufficient image clarity for accurate fall detection, which is crucial for practical applications in environments where privacy and security are paramount.
5.5. Performance Analysis for Other Datasets
We evaluated the performance of our fall-detection framework on several additional datasets, each offering unique challenges and providing valuable insights. Below is a summary of the classification performance for each dataset.
5.5.1. UR Fall Dataset (URFD)
The UR fall dataset (URFD) included 30 fall event videos captured by two cameras, resulting in 60 instances of fall events. In addition, the dataset comprised 40 videos depicting other activities. We utilized the STFT to extract instances with a significant motion for both fall and non-fall events. For non-fall activities that did not exhibit significant motion, we randomly selected 50 frames for classification to ensure robustness in varied scenarios. Notably, the Gaussian blur module was not employed in this evaluation to maintain consistency with state-of-the-art algorithms for a direct comparison.
The results demonstrate that the framework effectively distinguishes between fall and non-fall events with high accuracy as seen in
Table 5, supporting its potential utility in real-world scenarios where precise and reliable fall detection is crucial.
5.5.2. NTU RGB+D Dataset (NTU)
The NTU RGB+D dataset includes 946 videos labeled as falling under medical conditions. For this dataset, we applied STFT to extract 50 frames, specifically at moments when a fall occurred. We randomly extracted frames from videos without significant motion or fall events to simulate the lack of subtle movement typically associated with non-fall scenarios. This approach aimed to challenge the model’s ability to discern true falls from low-activity states.
The retraining of our model on this dataset excluded the use of the Gaussian module, focusing solely on the capabilities of STFT and the classification framework to evaluate its performance against sophisticated activities and subtle motions.
The classification report indicates a slight misclassification rate in the fall category, primarily due to incomplete fall actions in some videos, which pose challenges in detecting definitive fall patterns as seen in
Table 6. Nonetheless, the high precision and recall rates affirm the model’s robustness and capability to handle complex scenarios in medical monitoring applications.
5.6. Comparison with Other State-of-the-Art Methods
To validate our proposed fall-detection system, we conducted extensive comparisons against several state-of-the-art methods across three distinct datasets: MCFD, URFD, and NTU, as detailed in
Table 7. Our system consistently achieved higher sensitivity and specificity across all tested scenarios, demonstrating its robustness and accuracy in detecting fall events.
In the MCFD dataset, our method outperformed traditional techniques such as PCANet-SVM [
48], HLC-SVM [
49], and a conventional CNN approach [
50]. Moreover, it excelled over the DSM [
51] and OpenPose-SVM [
52], the latter of which, while not explicitly labeled as ’real-time’ in its documentation, has been demonstrated to possess characteristics suitable for real-time applications due to its computational efficiency. Our system achieved a perfect sensitivity of 1.00 and a specificity of 0.963, accurately detecting all fall events while maintaining a high True Negative rate.
For the URFD dataset, our system’s performance notably excelled compared to methods like GLR-FD [
53], Dense-OF-FD [
54], and I3D-FC-NN [
56]. We also included newer methods such as Grass
Manifold [
57] and HCAE [
58], which are recognized for their robust performance and potential for real-time processing. Our approach achieved a sensitivity of 1.00 and a specificity of 0.975, proving its efficacy in accurately distinguishing between fall and non-fall events, a critical factor in reducing false alarms and ensuring timely medical responses.
Lastly, in the NTU dataset, our proposed method showcased an exemplary specificity of 1.00 and a sensitivity of 0.980. This performance underscores our system’s ability to handle varied and complex scenarios effectively, proving its suitability for real-world applications. Our comparative analysis highlights our method’s state-of-the-art performance and applicability in real-time settings.
5.7. System Configuration and Real-Time Performance Analysis
We conducted fall-detection experiments on a high-performance Linux platform, utilizing an NVIDIA DGX Server Version 4.6.0 equipped with a GNU Linux 4.15.0-121-generic x86 operating system (Dell Inc., Round Rock, TX, USA). The deep learning models, including the CNN for fall detection, leveraged an NVIDIA Tesla V100 SXM3 GPU (Santa Clara, CA, USA) with 32 GB of memory. This setup ensures substantial computational efficiency, which is critical for real-time processing.
The real-time efficiency of our system is primarily attributed to the CNN classification module, which is optimized to utilize the robust capabilities of the NVIDIA Tesla V100 GPU. This configuration allows for the rapid processing of video frames. Each frame undergoes preprocessing, pose estimation via the OpenPose module, and STFT analysis within approximately 50 ms. The CNN module processes each frame in about 16.7 ms. Consequently, the entire sequence of 50 frames is analyzed in about 833 ms, demonstrating the system’s ability to operate effectively under real-time constraints. This processing speed is crucial for scenarios requiring a timely fall-incident response, enabling prompt detection and intervention.
To highlight our system’s real-time capabilities, we compared its frame processing speed with other contemporary methods, as summarized in the
Table 8. This comparative analysis underscores our system’s suitability for real-time applications by demonstrating its processing efficiency relative to other methods in the field.
Our proposed method processes 60 frames per second on the GPU, providing a competitive processing rate in fall detection. This capability demonstrates our system’s efficiency and applicability in scenarios where quick response and low latency are critical, such as monitoring elderly individuals to prevent fall-related injuries. This processing speed supports the real-time capability of our system, positioning it as a reliable solution in fall detection for urgent care scenarios.
6. Conclusions and Future Work
This paper presented a comprehensive and modular fall-detection framework designed to enhance the accuracy, efficiency, and explainability of monitoring systems for elderly care. At the core of our approach is the novel integration of short-time Fourier transform (STFT) for dynamic frame extraction, significantly reducing false alarms by focusing on frames exhibiting substantial motion. This targeted frame extraction is particularly beneficial in environments where continuous monitoring is essential yet computational efficiency is also a requirement. The framework includes a lightweight 1D convolutional neural network (CNN) optimized for low computational demand while maintaining high accuracy, making the system suitable for real-time applications in resource-constrained settings. Moreover, integrating gradient-weighted class activation mapping (GradCAM) provides valuable insights into the model’s decision-making process, enhancing transparency and offering trustworthy feedback to caregivers and medical professionals.
Despite these advancements, the system faces limitations, particularly in environments with complex dynamics, such as varying lighting conditions or physical obstructions that can obscure fall events. To address these challenges, future work will explore the integration of multimodal data inputs, such as combining visual data with other sensors like infrared or depth cameras. This approach aims to integrate more complex deep learning models to enhance detection robustness across diverse operational settings. Additionally, future studies will expand the dataset to include a broader spectrum of fall-related scenarios and diverse demographics to test the system’s efficacy more comprehensively. We also plan to refine the GradCAM module to deliver more detailed visual explanations, facilitating a deeper understanding of the model’s predictive behaviors.
Overall, the proposed modular fall-detection framework represents a significant step forward in applying advanced machine learning techniques to elderly care. By enhancing the system’s accuracy, reliability, and user trust, we aim to contribute to safer living environments for the elderly, ultimately reducing the incidence and impact of falls worldwide.