Intelligent Fusion: A Resilient Anomaly Detection Framework for IoMT Health Devices

Pastore, Flavio; Anwar, Raja Waseem; Jabeur, Nafaa Hadi; Ali, Saqib

doi:10.3390/info17020117

Open AccessArticle

Intelligent Fusion: A Resilient Anomaly Detection Framework for IoMT Health Devices

¹

Department of Computer Science, German University of Technology in Oman, P.O. Box 1816, Muscat P.C 130, Oman

²

Department of Information Systems, Sultan Qaboos University, Muscat P.C 123, Oman

^*

Authors to whom correspondence should be addressed.

Information 2026, 17(2), 117; https://doi.org/10.3390/info17020117

Submission received: 5 December 2025 / Revised: 21 January 2026 / Accepted: 23 January 2026 / Published: 26 January 2026

(This article belongs to the Special Issue Intrusion Detection Systems in IoT Networks)

Download

Browse Figures

Versions Notes

Abstract

Modern healthcare systems increasingly depend on wearable Internet of Medical Things (IoMT) devices for the continuous monitoring of patients’ physiological parameters. It remains challenging to differentiate between genuine physiological anomalies, sensor faults, and malicious cyber interference. In this work, we propose a hybrid fusion framework designed to attribute the most plausible source of an anomaly, thereby supporting more reliable clinical decisions. The proposed framework is developed and evaluated using two complementary datasets: CICIoMT2024 for modelling security threats and a large-scale intensive care cohort from MIMIC-IV for analysing key vital signs and bedside interventions. The core of the system combines a supervised XGBoost classifier for attack detection with an unsupervised LSTM autoencoder for identifying physiological and technical deviations. To improve clinical realism and avoid artefacts introduced by quantised or placeholder measurements, the physiological module incorporates quality-aware preprocessing and missingness indicators. The fusion decision policy is calibrated under prudent, safety-oriented constraints to limit false escalation. Rather than relying on fixed fusion weights, we train a lightweight fusion classifier that combines complementary evidence from the security and clinical modules, and we select class-specific probability thresholds on a dedicated calibration split. The security module achieves high cross-validated performance, while the clinical model captures abnormal physiological patterns at scale, including deviations consistent with both acute deterioration and data-quality faults. Explainability is provided through SHAP analysis for the security module and reconstruction-error attribution for physiological anomalies. The integrated fusion framework achieves a final accuracy of 99.76% under prudent calibration and a Matthews Correlation Coefficient (MCC) of 0.995, with an average end-to-end inference latency of 84.69 ms (p95 upper bound of 107.30 ms), supporting near real-time execution in edge-oriented settings. While performance is strong, clinical severity labels are operationalised through rule-based proxies, and cross-domain fusion relies on harmonised alignment assumptions. These aspects should be further evaluated using realistic fault traces and prospective IoMT data. Despite these limitations, the proposed framework offers a practical and explainable approach for IoMT-based patient monitoring.

Keywords:

Internet of Medical Things (IoMT); multi-sensor data fusion; anomaly detection; cyber–physical security; fault tolerance; explainable AI (XAI)

Graphical Abstract

1. Introduction

Wearable medical devices that connect to networks are increasingly changing the way healthcare is delivered. Instead of relying on a limited number of isolated measurements collected during clinic visits, patients can now be monitored continuously and in real time within the Internet of Medical Things (IoMT) ecosystem [1]. These platforms generate large streams of longitudinal physiological signals, such as heart rate (HR), blood pressure (BP), and oxygen saturation (SpO₂), which support earlier detection of clinical deterioration and enable more proactive and personalised care [2]. Deploying such devices as nodes in resource-constrained Wireless Sensor Networks (WSNs) can enhance chronic disease management, support surgical recovery, and help control healthcare costs by enabling timely interventions [3].

The clinical usefulness of these continuous data, however, depends on their reliability, which is challenged by three broad classes of anomalies. First, genuine physiological changes may indicate a worsening of the patient’s condition. Second, data integrity can be compromised by technical issues such as sensor drift, disconnections, signal loss, or motion artefacts, all of which may lead to inaccurate measurements [4]. Third, the networked nature of IoMT deployments exposes them to adversarial interference. Cyber-attacks like Denial of Service (DoS), Man-in-the-Middle (MitM), data injection, and replay attacks can manipulate or disrupt data flows. This disruption may lead to missed critical events, false alarms, and the possible disclosure of sensitive patient data. In this respect, and beyond technical robustness, IoMT systems must comply with privacy regulations such as the Health Insurance Portability and Accountability Act (HIPAA) and the General Data Protection Regulation (GDPR), which influence design choices and deployment strategies. The security of these heterogeneous and often insecure-by-design environments remains a major open challenge for research and deployment [5].

The current academic literature addresses these challenges predominantly in isolated, domain-specific silos. A substantial part of the literature is focused on Network Intrusion Detection Systems (NIDS), by applying several machine learning (ML) models to datasets such as CIC-IDS2017 for identifying malicious traffic patterns [6]. These systems, even if effective for their specific task, are generally agnostic to the physiological context and to the operational state of the sensors. A network-level anomaly detector, as an example, cannot distinguish between a data drop caused by a DDoS attack and one that is caused, instead, by a sensor battery failure. On the other hand, another line of research focuses on the detection of physiological anomalies, often using deep learning (DL) models, including Long Short-Term Memory (LSTM) and autoencoders (AE) on clinical datasets such as MIMIC to predict patient deterioration [7]. These clinical models, however, can be easily misled by the same technical defects and adversarial manipulations that they are not designed to recognise. A sudden fall of the heart rate signal, for example, can be interpreted as a critical cardiac event, when in fact it is a sensor malfunction. Therefore, although a growing body of work has begun to explore data fusion approaches for addressing this issue [8,9], the development of a unified framework capable of a holistic evaluation of the IoMT ecosystem by contextually differentiating between these distinct anomaly classes continues to be a main research challenge [10].

To address this challenge, we propose and evaluate a hybrid fusion framework for robust, context-aware anomaly detection. Our framework’s novelty lies in its data-driven fusion architecture, which combines a high-performance supervised ensemble for security threats with an unsupervised DL model for physiological and technical anomalies. By combining the outputs of these two expert modules into a calibrated fusion decision, the framework can go beyond the simple binary flagging of anomalies and provide a contextualised assessment of system state. This joint modelling of security events, sensor behaviour, and physiological signals allows us to distinguish more clearly between cyber-attacks, sensor malfunctions, and genuine clinical signals, which is a key requirement for building resilient IoMT systems.

Our study offers several empirically validated contributions:

We propose a unified IoMT anomaly detection framework that distinguishes between security incidents, sensor malfunctions, and physiological deterioration through context-aware fusion;
We design a hybrid architecture combining an XGBoost security classifier and an LSTM autoencoder for physiological and technical deviations, integrated via a calibrated decision layer producing Stable, High-Risk, and Critical alerts;
We evaluate the framework on CICIoMT2024 and MIMIC-IV, including fault-resilience testing under controlled sensor corruption to assess robustness under operational stress;
We demonstrate practical feasibility by reporting inference latency and by providing explainability analyses based on SHAP and reconstruction error profiles, which are critical factors for clinical adoption [11].

The rest of the paper is organised as follows. In Section 2, we provide an overview of related work on network and physiological anomaly detection and highlight the main gaps in the current literature. Section 3 explains the methodology and system architecture, including the choice of algorithms and fusion strategies. Section 4 describes the experimental setup, while Section 5 presents and discusses our results, with particular attention to practical implications and performance trade-offs. Finally, Section 6 concludes by summarising the key findings, discussing the limitations, and outlining future research directions for anomaly detection in IoMT systems.

2. Related Work

Ensuring reliability in Internet of Medical Things (IoMT) systems requires addressing anomalies that come from multiple and intersecting domains. Although a substantial body of work exists, most contributions in the literature still consider network security, physiological signal analysis, and sensor integrity as separate and isolated issues. This section identifies the state of the art in each of these areas and highlights the challenges targeted by our fusion framework. We derived the state-of-the-art corpus from a targeted search across major engineering and biomedical databases, and we report the query groups and screening criteria in Appendix A.

2.1. Anomaly Detection in IoMT Networks

Securing IoMT networks against malicious interference is a fundamental research area, with a strong focus on the application of ML models on network traffic data [12]. Early intrusion detection in IoMT networks relied mainly on signature-based IDS. While effective against known threats, these systems cannot, in principle, detect novel or zero-day attacks, which is a critical limitation in a rapidly evolving threat landscape [13]. For this reason, modelling normal behaviour and detecting deviations has become a dominant paradigm in intrusion detection [14].

Many studies have conducted comparative analysis of ML algorithms on benchmark datasets such as CIC-IDS2017 and UNSW-NB15, showing with consistency that tree-based ensemble models, particularly XGBoost, often demonstrate strong performance on flow-based structured data [15]. Their capacity to capture complex and non-linear interactions among statistical features is often superior to other classical models, and even to some DL models [16]. Our work builds on this foundation and extends it by combining an ensemble security model with a sequential unsupervised model within a unified fusion setting on modern IoMT and clinical datasets. This body of work is a confirmation of the viability of using a very accurate XGBoost model like the security expert inside our framework. However, a crucial limit of the approaches focused on NIDS is their limited operational awareness: they are designed to detect the “what” (e.g., the attack) but not the “so what”—that is, the potential consequences on physiological data streams.

This limitation is a well-known challenge in the literature on Cyber–Physical Systems (CPS) and IoT, where a cyber-attack is not an isolated event but a potential threat to physical processes or, in our case, to the safety of the patient. Accordingly, security must go beyond intrusion identification to interpret alerts in the context of the controlled physical process. When monitoring relies on highly stationary time series, as is common for many physiological and industrial signals, traditional ML models can rival and, in some cases, outperform DL models. Detector design should therefore account for stationarity when trading off accuracy, complexity, and interpretability [17]. Complementary work in CPS and Operational Technology (OT) security argues for process-aware monitoring that correlates network events with the physical state of the controlled process. For instance, a systematic review in Industrial Control Systems (ICS) catalogues methods that fuse network anomalies with physics-based invariants or plant-process models [18]. In healthcare, recent IoMT surveys similarly call for evaluation protocols sensitive to device behaviour and data context, not only to aggregate classification scores, with calibration practices that control false escalation under realistic alert budgets. This emphasis is echoed in IoMT-specific datasets and reviews [19] and aligns with our approach, which trains and evaluates a detector on physiological time series, introduced in the Section 2.2. These practices are also consistent with a threat-modelling perspective for IoT context-sharing platforms, which links detection outcomes to the surrounding cyber–physical context [20].

2.2. Anomaly Detection in Physiological Signals

A parallel field of research, and one that is equally critical, regards the analysis of time-series data from wearable and clinical sensors to detect physiological anomalies [21]. The main challenge in this case is to distinguish between early signs of patient deterioration and benign physiological variability [22]. The availability of large clinical datasets has been fundamental for progress [23].

DL models, particularly Recurrent Neural Networks (RNN) such as LSTM and AE, have now become the state-of-the-art for this task because of their inherent capability to model complex temporal dependencies [24]. These models can learn a personalised baseline of a patient’s normal physiological state and can detect subtle, multivariate deviations that may precede a major adverse event. Our work is inspired directly by this approach, and it combines signal processing with DL techniques on the MIMIC-IV dataset with the goal of detecting clinically relevant anomalies. Using an unsupervised LSTM autoencoder, we adopt a similar philosophy for detecting generic anomalies, without the need for labelled examples for every possible physiological failure mode. The limitation of these models, however, is their implicit assumption regarding data integrity. They are not inherently designed for distinguishing between a true clinical anomaly and a sensor signal corrupted by either a technical fault or a malicious attack [25].

A persistent practical issue is how models trained on large single-centre datasets behave when moved to a new ward or hospital. Even when trained on MIMIC-IV, models that look strong internally can lose recall or calibration in prospective or external cohorts, so thresholds and alert budgets should be tuned for the local case-mix rather than reused blindly. Performance drift over time also remains a risk and needs explicit post-deployment monitoring [26]. Signal integrity from wearables and bedside sensors is another bottleneck for unsupervised detectors. For wrist photoplethysmography (PPG), motion, posture, and sensor placement can dominate the variance. Consequently, practical pipelines apply signal-quality indices and quality gates before inference, discarding windows with poor signal-to-noise ratio (SNR) or morphology [27]. Because LSTM autoencoders rely on reconstruction error under a stable baseline, evaluation should report precision and recall under operational alert budgets and use temporally aware cross-validation, as emphasised by recent surveys. Lightweight, time-series-specific explanation methods can also help distinguish benign variability from clinically meaningful deviations by localising the subsequences that drive high anomaly scores. These practices informed our pipeline design choices while keeping the overall modelling approach unchanged [28].

From a modelling perspective, LSTM autoencoders are a natural choice for this task. They learn a compact representation of normal temporal dynamics and then use the reconstruction error to flag deviations, which matches our goal of detecting subtle multivariate anomalies without requiring exhaustive labelling. The LSTM units, through their gating mechanisms, capture both slowly varying trends and short-lived fluctuations in physiological signals. At the same time, the encoder–decoder structure preserves the temporal order within each window, so the detector is sensitive to changes in the temporal pattern rather than to isolated outliers. This behaviour is especially relevant for Intensive Care Unit (ICU)-grade time series such as those in MIMIC-IV, where adverse events are rare and class imbalance is substantial: training mainly on normal physiology produces patient-specific baselines, and segments with high reconstruction error are highlighted as inconsistent with these baselines, as observed in our experiments. Recent studies on electrocardiogram (ECG) anomaly detection report that LSTM-based autoencoders are effective precisely for this reason, modelling normal beat-to-beat structure and surfacing anomalies through elevated reconstruction loss [29].

2.3. Sensor Fusion and Fault Tolerance

Despite extensive work on anomaly detection within individual network and physiological domains, the fusion of these data streams at their intersection remains a formidable challenge. Although some pioneering studies have started to propose integrated frameworks, many existing approaches still struggle to resolve in a contextual way the ambiguity between different sources of anomaly. A system that triggers a clinical alert based on data that were maliciously manipulated by a cyber-attack is not only unreliable; it is actively dangerous. In a similar way, a security alert that does not consider the physiological context could misinterpret a benign sensor malfunction as a malicious action, and this can lead to interventions that are not necessary and are costly. This structural limitation represents a significant obstacle for the development of resilient and trustworthy IoMT systems [30]. In practice, many IoMT deployments already rely on commercial device-management stacks, network security monitoring, and clinical early warning systems. These solutions often operate in silos; for example, they detect network anomalies without linking them to patient-level context, or they trigger clinical alerts without considering cyber manipulation and sensor integrity. This gap motivates fusion approaches that support root-cause triage and safer escalation policies in real monitoring pipelines.

Moreover, the concept of multi-sensor fusion gives a theoretical basis to try to reduce this ambiguity. The fusion strategies are well-established, and it is possible to categorise them by the level at which they combine information: data-level, feature-level, and decision-level fusion [31]. Even if simpler decision-level methods such as weighted voting can offer a practical starting point, they frequently do not have a formal mechanism to handle the ambiguity and conflict that are inherent in real-world data [32]. More advanced theoretical frameworks—for example, the Dempster–Shafer theory of evidence (DST)—are very powerful for this purpose [33]. DST gives a mathematical basis for reasoning under uncertainty, and it can combine “beliefs” from heterogeneous sources and explicitly quantify the uncertainty of the final decision. Its effectiveness has been demonstrated in several complex domains, which include medical diagnosis and multi-sensor fusion, and this makes it a strong theoretical candidate for our problem. However, the application of these formal theories to the specific, tripartite problem of IoMT, that is, distinguishing among clinical, technical, and security anomalies, remains a nascent area of research.

Moreover, a critical aspect that is often overlooked in academic models with high performance is fault tolerance. A lot of studies evaluate these models on clean and curated data, which does not reflect the operational realities in the IoMT deployments where the sensors are subject to noise, drift, and failure. A truly robust framework has to not only detect external threats and internal physiological events but also be self-aware about the quality of its own data inputs. As noted in comprehensive surveys on Wireless Body Area Networks (WBAN), the research in the field of fault detection and recovery is still emerging and requires further validation [34]. The need for resilience, which includes the ability to recover quickly and to diagnose faults, is a critical requirement for modern IoMT networks. Our research makes a direct contribution to this area not only through the proposal of a fusion framework, but also through the quantitative evaluation of its resilience to synthetically generated fault scenarios.

In summary, while the individual domains of network anomaly detection and physiological anomaly detection are relatively well-established, their integrated application remains an evolving and complex research challenge. Only a limited number of studies have proposed and rigorously evaluated unified frameworks that can simultaneously distinguish security threats, physiological anomalies, and technical faults. This gap confirms the need for integrated, context-aware and fault-tolerant fusion systems. In this work, we address this need by designing and validating a comprehensive framework that tackles all three aspects within a single architecture. A structured overview of the reviewed streams, including typical algorithms and datasets, is provided in Table 1.

3. The Proposed Anomaly Detection and Fusion Framework

This section presents the architecture and methodology of the proposed framework. The system is designed for robust, real-time anomaly detection in IoMT environments through the integration of two specialised modules: a network security module and a physiological-signal analysis module. Their outputs are combined in a final decision-fusion layer that produces a single, contextualised assessment of the system’s overall operational state.

3.1. System Architecture

The framework uses a modular architecture for parallel processing, as shown in Figure 1. The workflow begins with two independent data streams: the IoMT network traffic and the signals from wearable sensors. Each stream is processed by its dedicated module for anomaly detection. The Security Anomaly Detection Module is responsible for processing network data to identify malicious interference, while the Physiological Anomaly Detection Module makes an analysis of time-series vital signs for detecting clinical deterioration or technical sensor faults. The outputs of these two “expert” modules, which are an attack probability score and a physiological anomaly score, are then provided as input to the fusion layer. This final component is responsible for aggregating evidence from the two domains, which is modulated by a real-time sensor health assessment, and aims to produce a holistic and interpretable alert about the general status of both the system and the patient.

This modular architecture offers several advantages: it allows for the independent development, optimisation, and validation of each expert model using the most suitable algorithms for its specific data modality, a crucial consideration given the heterogeneity of network and physiological data, and it enhances the scalability and maintainability of the system, as individual components can be updated or replaced without impacting the entire framework.

3.2. Types of Anomalies and Operational Definitions

We target anomalies in IoMT monitoring that can originate from three distinct sources, which often generate similar symptoms in the observed data and therefore introduce ambiguity during decision-making. Security anomalies stem from malicious or suspicious network activity affecting IoMT devices or communications, and they primarily appear as abnormal flow-level patterns (e.g., scanning, flooding, spoofing, malformed traffic). Physiological anomalies reflect clinically meaningful deviations in vital-sign dynamics, such as sustained instability or abnormal multivariate trends that depart from the learned baseline of normal temporal behaviour. Technical sensor anomalies arise from the sensing and acquisition layer, including dropouts, flat-lines, saturation, drift, and artefacts introduced by disconnections or imputation, and they can mimic clinical deterioration without representing true physiological change. In practice, physiological and technical anomalies are not treated as separate supervised classes but are jointly captured by the unsupervised physiological module, while their differentiation is handled at fusion time through sensor health assessment and contextual evidence. This taxonomy informs the fusion logic described in Section 3.5, where we combine domain evidence with sensor-health assessment to produce a single contextualised decision mapped to three operating states: Stable, High-Risk, and Critical.

3.3. Security Anomaly Detection Module

This module has the task of classifying network traffic from the CICIoMT2024 dataset. It follows a rigorous pipeline of feature engineering and supervised classification to achieve high accuracy.

3.3.1. Feature Engineering and Preprocessing

The initial feature set provided by the dataset was refined so that it included only behavioural characteristics of the network flows. The non-behavioural attributes that can cause data leakage, such as static protocol-type flags (e.g., TCP, UDP), have been removed. This is a critical step to ensure that the model learns generalisable patterns of malicious behaviour and not superficial identifiers.

Specifically, we removed protocol identifier features that describe the communication technology rather than traffic behaviour. These include the Protocol Type field and the binary protocol indicators (HTTP, HTTPS, DNS, Telnet, SMTP, SSH, IRC, TCP, UDP, DHCP, ARP, ICMP, IGMP, IPv, LLC). The remaining feature set contains 29 behavioural characteristics that describe timing, rate, size dispersion, and header and flag dynamics. The full list of removed attributes and the retained behavioural features is reported in Table 2.

The rationale is that, in scenario-based intrusion datasets, protocol identifiers can act as shortcuts, for example, when attack families appear predominantly under specific protocols. By restricting the input space to behavioural flow statistics and header and flag dynamics, the classifier is forced to rely on rate, timing, size dispersion, and flag-related signatures that are more likely to transfer across devices and protocols. No additional automated feature selection was applied beyond this deterministic removal step to preserve interpretability and keep the pipeline fully reproducible.

To manage the high dimensionality of the target space (51 original classes of attack) and the related class imbalance, we aggregated the labels in six principal categories based on their operational intent (Normal, DDoS, DoS, Recon, Spoofing, Malformed), as shown in Table 3.

The reason for this class aggregation has two key aspects. First, it combines granular types of attack with similar behaviours underneath (e.g., all the DDoS variants), making classes that are more populated and more statistically significant for the learning component of the model. Second, it makes the classification task simpler, from a very specific 51-class problem to a more general 6-class problem that is also operationally relevant, and this is more robust against small variations in the execution of the attack.

3.3.2. Supervised Classification Model

For the task of classifying network threats, we opted for a supervised learning approach. Although unsupervised models such as Isolation Forest can detect deviations from the normal traffic, they are inherently limited to a binary classification (anomalous vs. normal). They fail to distinguish between different categories of attack (e.g., DDoS vs. Recon), which is essential for targeted security responses. Other unsupervised models, such as LSTM-based models and other DL approaches we tested, tend to increase processing time and resource usage and do not appear well-suited to this type of dataset. Since the CICIoMT2024 dataset provides trustworthy attack labels, a supervised approach is more adapted for this specific problem. The core module used here is the XGBoost (Extreme Gradient Boosting) classifier because of its well-documented state-of-the-art performance and its high efficiency with structured and tabular data for all the reasons explained in the Section 3.3.1. Additionally, the robustness of this model derives from its regularised learning objective, which strikes a balance between predictive accuracy and model complexity, thereby preventing overfitting issues. The objective function it minimises is formally defined as follows:

L (ϕ) = \sum_{i}^{l} (y_{i}, \hat{y_{i}}) + \sum_{k}^{Ω} (f_{k})

(1)

where l (y_i, ŷ_i) is the training loss function, which measures the difference between the true label y_i and the prediction ŷ_i, and Ω(f_k) is the regularisation term, which penalises the complexity of each tree f_k in the ensemble. The output from this module is a vector with the probabilities for each of the six classes, and this is used to derive a final security anomaly score. From this vector, we derive a scalar security anomaly score, defined as one minus the predicted probability of the Normal class, as detailed in Section 3.5.

3.4. Physiological Anomaly Detection Module

This module performs unsupervised anomaly detection on physiological multivariate time series extracted from the MIMIC-IV dataset. Its objective is to identify clinically meaningful deviations from normal physiology while remaining sensitive to technical artefacts such as transient signal dropouts. In addition to controlled fault injection experiments, we introduce a three-level clinical severity proxy (Stable, High, Critical) aligned at the window level to ICU stays and hospital admissions, enabling a clinically grounded validation of whether anomaly scores and reconstruction error profiles behave consistently with real deterioration patterns. This proxy is used exclusively for evaluation and validation purposes, and it does not supervise the unsupervised model training.

3.4.1. Time-Series Data Preparation

The raw event data, which were provided in a “long” format, were at first transformed into a “wide format” multivariate time series that has a uniform frequency of 15 min for each patient. This resampling step is essential for creating a consistent temporal structure that is suitable for sequential models. The missing values, which realistically represent transient faults or disconnect of the sensor, were imputed using a forward-fill strategy to preserve the temporal continuity of the state of the patient.

3.4.2. Unsupervised Anomaly Detection Model

We used an LSTM autoencoder, which is a DL model uniquely suited to learn the temporal dependencies within normal physiological signals. The model consists of an encoder, which learns how to compress the input sequence X into a latent representation z of low dimension, and a decoder, which attempts to reconstruct the original sequence

\hat{X}

from z. The model is trained exclusively on data that is assumed to be normal.

An anomaly is flagged when the model fails to accurately reconstruct a sequence that it has not seen before. This failure is measured by the reconstruction error, and this serves as our anomaly score. The error for a sequence is defined as the Mean Absolute Error (MAE):

Error (X) = \frac{1}{T \cdot F} \sum_{t = 1}^{T} \sum_{f = 1}^{F} | X_{t f} - \hat{X_{t f}} |

(2)

where T is the number of timesteps and F is the number of features.

A high value of reconstruction error is an indication that the input sequence is not conforming to the normal patterns that the model has learned, and so it is anomalous. The capability of the model to detect these deviations is shown in Figure 2.

For clarity, Figure 2 is a mechanism illustration based on a controlled fault injection on the SpO₂ channel. In this example, a short interval is deliberately replaced with an implausible constant plateau to emulate common IoMT sensing artefacts, such as probe detachment, temporary signal freeze, or short communication interruptions. The plotted values are standardised for modelling (rather than expressed in clinical units); therefore, they should not be interpreted as actual SpO₂ percentages. The key point is that the autoencoder reconstructs the expected normal dynamics, so the injected artefact yields a sustained reconstruction mismatch, which is captured by a higher reconstruction error and used as the anomaly score.

This unsupervised approach is very powerful because it has no need for previous knowledge or for labelled examples of every possible type of physiological or technical anomaly, and this makes it robust to new failure events. The ability of the model to learn the “essence” of normal behaviour allows the detection of any pattern that has a deviation from this learned baseline.

3.5. The Fusion Layer and Decision Logic

The fusion layer constitutes the decision-making core of the framework, and it integrates the outputs from the two expert modules to provide a final and intelligent alert. This layer deals with the main challenge of contextual differentiation by making a synthesis of the evidence from both the security domain and the physiological domain. The system is organised into the sub-modules explained in the next sections.

3.5.1. Sensor Health Scoring

Before the fusion of the outputs of the models, a preliminary data-level fusion and an assessment of quality are performed. To each incoming physiological signal, a dynamic “health score” Hc ∈ [0, 1] is assigned. The sensor health score is first computed at the channel level and then aggregated to a window-level score used for fusion. This score is a composite metric that is derived from several indicators of data quality, which include the status of the data point (e.g., original vs. imputed), its adherence to plausible physiological ranges, and the local stability of the signal (e.g., the variance over a short time window). This mechanism allows the framework to make a quantitative assessment of its input reliability, and this is a crucial step for robust decision-making in real-world environments, where sensor data are often imperfect.

3.5.2. Decision Fusion and Final Alerting

The principal logic for fusion is based on an intermediate continuous risk representation produced by the expert models. The security anomaly score, denoted as Ss, is defined as 1−P(Normal), where P(Normal) is the probability that the XGBoost model assigns to the “Normal” class. The clinical anomaly score, denoted as Sc, is given by the mean absolute reconstruction error of the LSTM autoencoder and is normalised to the range [0, 1].

Prior to fusion, Sc is modulated by the average sensor health score Hc described in Section 3.4.1. This produces a physiology-aware anomaly score Sc * = Sc × Hc, which down-weights anomalies detected on signals that are already marked as low quality.

In the final implementation, we do not rely on fixed fusion weights. Instead, we learn the fusion rule from data by training a multiclass fusion model on a development split. The fusion input combines complementary evidence from the two expert modules, namely the security anomaly score, the global clinical reconstruction error, the per-channel reconstruction error profile, and the output probabilities of a lightweight clinical student model trained on vital sign summaries and reconstruction profiles. The fusion model is trained on the training partition only, using class balancing to mitigate class imbalance. The lightweight clinical student model is a shallow supervised model trained on summary statistics of vital signs and reconstruction profiles, used only to provide complementary calibrated probabilities to the fusion layer.

To keep the final alert policy transparent and reproducible, we then select two probability thresholds on a separate calibration partition, one for the Critical class and one for the High-Risk class. We choose the threshold pair that maximises macro F1 on the calibration data, under the rule that Critical is assigned first when its probability exceeds the critical threshold, and High-Risk is assigned otherwise when its probability exceeds the high-risk threshold. The selected thresholds are then frozen and applied once on the held-out evaluation partition, yielding the final system state classification as Stable, High-Risk, or Critical.

To address the challenge of operational blindness, the two expert modules (the supervised NIDS for security threats and the LSTM autoencoder for physiological and technical anomalies) contribute their scores to a shared fusion representation, which is computed in the fusion layer and then mapped to the three alert categories. The conceptual architecture of this framework, including the two expert modules and the fusion layer that implements the learned fusion and calibration logic, is illustrated in Figure 3.

The workflow clarifies how the data from network and physiological sources is processed by the specialised expert models (XGBoost NIDS and LSTM autoencoder). Their respective anomaly scores, together with the sensor-health index, are combined in the fusion layer through the risk score computation and calibrated thresholds, which produce the final context-aware alert (Stable, High-Risk, or Critical).

4. Experimental Setup

This section is about the experimental design established for the proposed fusion framework. We provided a detailed description of the selected datasets, the rationale, and the implementation of their corresponding preprocessing pipelines, the specifics of the architecture and training of ML models, as well as the formal definitions of the multi-faceted metrics that we used to evaluate the overall performance of the framework. We selected these datasets to reflect the two dominant evidence streams available in realistic IoMT deployments, namely network telemetry for security monitoring and longitudinal physiological signals for clinical and technical monitoring. CICIoMT2024 was chosen as a representative IoMT network dataset because it contains modern traffic patterns and diverse attack families, enabling a multi-class evaluation of the security module under realistic adversarial conditions. MIMIC-IV was selected as the clinical backbone because it provides large-scale, high-quality ICU time series with heterogeneous patient trajectories and routinely charted vital signs, allowing us to validate physiological anomaly detection and robustness to measurement irregularities. Using both datasets enables an end-to-end evaluation of the fusion logic in a setting that mirrors real IoMT operation, where security threats and clinical instability can coexist and must be disentangled. The methodology has been designed with great emphasis on reproducibility and scientific validity.

4.1. Security Anomaly Dataset: CICIoMT2024

For security threat modelling, we employed the CICIoMT2024 dataset [35], a state-of-the-art, high-fidelity resource generated within a realistic IoMT testbed comprising smart home devices, wearable technologies, and medical sensors. The main advantage here is that it includes a diverse variety of contemporary cyber-attacks, which is highly relevant to our study and has already been used in recent work on ensemble-based IoMT intrusion detection [36,37]. The raw dataset, composed of multiple CSV files for each attack scenario, was first merged into a unified corpus. A key challenge is the large number of fine-grained attack labels (51 in total). To create a more robust and generalisable classification task, as well as to mitigate the effects of the extreme class imbalance, we aggregated these labels into six principal categories that are logically consistent, based on their operational category (Normal, DDoS, DoS, Recon, Spoofing, Malformed). A key feature-engineering step removed non-behavioural attributes, such as static protocol flags, so that the model learns generalisable traffic patterns such as rate, duration, and packet statistics instead of superficial shortcuts. As a result, the final set of features consists of 29 behavioural characteristics. In addition, we also performed a lightweight data-integrity audit on a subset of the CICIoMT2024 dataset to reduce the risk that hidden errors in the input samples artificially inflate performance. Specifically, we verified schema consistency after merging the scenario files, checked for duplicate records and degenerate columns, and ensured that label aggregation into six categories was deterministic and reproducible across runs. We also validated that the final tabular matrix matches the expected dimensionality of 29 behavioural features and that feature types remain numerically stable for downstream learning. These checks are not intended to remove attacks, but rather to prevent artefacts such as duplicated rows, inconsistent encodings, or silent schema drift from becoming unintended shortcuts for the classifier.

4.2. Physiological and Technical Anomaly Dataset: MIMIC-IV

To model physiological signals and technical sensor faults, we have used the MIMIC-IV dataset (v3.1) [38]. As a large, de-identified clinical database, it provides ICU-grade physiological monitoring at scale and serves as a realistic proxy for multi-modal vital-sign streams in clinical environment. The raw data, originally in a long event format, was transformed into a multivariate wide time series. To handle irregular measurement times, each patient’s series was resampled at a 15-min interval. The resulting gaps, representing transient sensor faults or disconnects, were then imputed. While simpler methods such as mean imputation exist, they cannot capture the temporal nature of the data. The forward-fill (ffill) strategy was chosen for this reason, as it realistically imitates the behaviour of a clinical monitor, which holds the last known valid reading until a new measurement is made, thereby preserving the temporal continuity of the state of the patient. Because MIMIC-IV is large, heterogeneous, and collected under real clinical constraints, we also introduced explicit quality checks to minimise the risk of subset-level artefacts biasing the anomaly detection results. First, we restricted extraction to a fixed mapping of seven routinely charted ICU signals and enforced timestamp consistency before resampling to the 15-min grid. Second, prior to sequence construction, we discarded time bins that are fully missing across all selected channels to prevent degenerate windows from dominating the reconstruction objective. Finally, we conducted sanity checks on feature distributions to detect discretisation or placeholder-like effects that may arise from default charting practices, which can otherwise behave like unintended labels during learning. In addition, we stored sequence-level metadata to guarantee that rule-based labels and learned residual features are computed on correctly aligned windows.

Vital sign features were also standardised to zero mean and unit variance, a common preprocessing step in ML pipelines, ensuring that no single feature scale dominates the learning process [39]. We modelled the physiological stream using MIMIC-IV ICU charted vital signs re-sampled into fixed 15-min bins. We extracted seven routinely available signals, namely heart rate, SpO₂, respiratory rate, arterial blood pressure (systolic, diastolic, mean), and body temperature, and converted the long event format into a wide multivariate time series. We then built sliding windows of 24 steps (6 h at 15-min resolution) with a stride of 1, applied within-subject forward-fill for short gaps, removed windows with fully missing channels, and capped sequences per subject to limit the dominance of very long ICU stays. We performed leakage-safe splitting at the subject level (train, calibration, evaluation) and fitted StandardScaler on training subjects only. This yields 5,812,990 sequences of shape (24, 7) for the physiological model and a sequence-aligned rule table of 1,439,168 windows (de-duplicated by seq_id) for clinical proxy supervision in fusion.

To produce an interpretable anomaly signature, we trained a lightweight LSTM autoencoder on training sequences only and summarised deviations through residual features defined as mean absolute reconstruction error, computed both globally per window and per channel. In parallel, we derived a three-level clinical severity proxy (Stable, High, and Critical) from window-local rules based on lactate, vasopressor exposure, mechanical ventilation, and vital extrema, enabling consistent supervision and evaluation at the sequence level. The exact clinical rules used to construct this three-level severity proxy are reported in Table 4, together with the operational interpretation of each level.

This three-level severity proxy is used exclusively for validation and fusion supervision, and it does not supervise the training of the unsupervised physiological model.

4.3. Model Implementation Details

The models were implemented in Python 3.11 using standard open-source libraries in Google Colab Pro, on an instance with an NVIDIA L4 GPU and more than 50 GB of RAM. To strengthen reproducibility, we report the full set of model hyperparameters and training settings in Table 5, together with the key pipeline knobs used for sequence construction (window length, stride, and subject-level splitting). Importantly, we did not perform large-scale hyperparameter optimisation, because our goal was to preserve a deployment-oriented configuration and avoid overfitting to a specific subset. Instead, we used stable, literature-aligned configurations and limited tuning to clinically meaningful operating choices, namely threshold and calibration selection on a dedicated calibration split.

A key aspect of the implementation was the efficient allocation of computational resources. Although the GPU was used for the intensive phases of model training, several important tasks were intentionally executed in CPU mode, such as data loading, preprocessing, and feature engineering, which were managed by Pandas and NumPy, which are libraries typically bound to the CPU. Similarly, the training of the simple Decision Tree for the fidelity test described in Section 4.4 was also carried out by using CPU resources, through multi-core processing where applicable. This hybrid CPU/GPU approach was chosen to reflect a realistic deployment scenario and to align with principles of energy efficiency by means of reserving the high-consumption GPU resources only for the tasks requiring massive parallelisation.

XGBoost: The security module uses XGBClassifier from the XGBoost library (v2.0). The configuration of the model was for multi-class classification, and it was trained on the GPU (device= “cuda”) for a significant acceleration of the process on the large dataset CICIoMT2024.
LSTM Autoencoder: The detector of physiological anomalies was implemented in TensorFlow (v2.16) by using the Keras API. The architecture is composed of a symmetric encoder–decoder structure with LSTM layers of 64 units. The model training was accelerated on the GPU, which is typical of many DL models in these steps. It was trained with the goal to minimise the Mean Absolute Error (MAE) loss function using the Adam optimiser, performed using Google Colab Pro, a cloud-based computational platform provided by Google LLC (Mountain View, CA, USA) with an NVIDIA L4 GPU provided through the Colab infrastructure and manufactured by NVIDIA Corporation (Santa Clara, CA, USA).

4.4. Evaluation Metrics

The comprehensive evaluation of our proposed anomaly detection framework covers performance, resilience, efficiency, and explainability metrics, as detailed below:

Detection Metrics: The initial performance of the model was evaluated with a suite of standard metrics: accuracy, precision, recall, F1-score, and the Area Under the Receiver Operating Characteristic Curve (AUC-ROC). For the whole evaluation of the integrated fusion framework, we also report the Matthews Correlation Coefficient (MCC) and Cohen’s Kappa, which are effective in cases of multi-class classification with imbalanced data [40]. Furthermore, we assessed the statistical significance of these overall metrics by the calculation of 95% Bootstrap Confidence Intervals. All the metrics were calculated with the use of the metrics module of the scikit-learn library.
Fault Resilience: The robustness of the framework was put to the test by measuring the increase in the reconstruction error of the LSTM autoencoder in response to sensor faults that were injected synthetically at various levels of severity (10% to 90%).
Time Efficiency: To assess the feasibility of an edge deployment, we measured the average latency of inference of each expert model in milliseconds. The measurement was performed by using Python’s time.perf_counter and we calculated the average over 100 cycles of prediction.
Explainability (Fidelity): The fidelity of the explanations of the security model was measured in a quantitative way. A simple “proxy” model (a DecisionTreeClassifier) was trained with the use of only the top-10 features that were identified by the complex XGBoost model. A fidelity score was then given by comparing the accuracy of the proxy model to that of the original model.

5. Results and Discussion

This section presents the empirical evaluation of the framework, structured into three parts. First, we assess the performance and explainability of the security anomaly detection module. Second, we evaluate the physiological anomaly detection module, with a particular focus on the resilience to sensor faults.

Finally, we provide a holistic validation of the integrated fusion framework, assess the associated privacy implications, and evaluate its real-time efficiency and decision-making performance across selected critical operational scenarios.

5.1. Performance of the Security Anomaly Detection Module

The security anomaly detection module, implemented with an XGBoost classifier, achieved consistently very high performance in identifying network threats in the CICIoMT2024 dataset. On the six-class aggregated task, the model reached an overall accuracy of 99.91% on the hold-out test set. The corresponding class-wise precision, recall and F1-scores for this module are reported in Table 6. More specifically, we report here the detailed composition of the processed evaluation split to make the security results fully reproducible. Our CICIoMT2024 evaluation partition contains 2,148,250 flows after label aggregation into six operational categories and after removing non-behavioural attributes, resulting in 29 behavioural features. The class supports in this split are DDoS (1,494,156), DoS (558,803), Recon (31,118), Spoofing (4814), Malformed (1539), and Normal (57,820). These values correspond to the “Support” column in Table 6 and define the empirical base rates underpinning the reported precision, recall, and F1-scores.

The results show that the model not only achieves high overall accuracy but also demonstrates a very strong performance on a per-class basis, with a macro average F1-score of 0.95. This is also supported by an excellent macro-average AUC-ROC score of 0.99 (Figure 4), which indicates high separability across all the classes. This further suggests that the model is unlikely to rely primarily on memorisation but is capturing potentially stable and generalisable traffic patterns.

Beyond reporting high classification performance, these results have direct implications for improving the security of IoMT networks. First, reliable multi-class detection enables earlier recognition of anomalous traffic patterns, which can reduce attacker dwell time and limit the window in which data streams are disrupted or manipulated. Second, separating attack families such as DDoS, DoS, reconnaissance, spoofing, and malformed traffic provides actionable information for incident response, because different classes map to different mitigation strategies, for example, rate limiting and filtering for flooding behaviour, segmentation and access control review for reconnaissance, and integrity checks for spoofing and injection patterns. Third, the strong performance on the Normal class reduces false alarms, which is important in clinical settings where alarm fatigue can cause operators to ignore genuine incidents. Finally, when combined with the physiological module, network anomalies can be interpreted in context, helping operators distinguish between a loss of telemetry caused by an attack and a loss caused by benign technical faults, which supports safer decisions such as whether to escalate cybersecurity response actions or to prioritise device troubleshooting.

This is in alignment with the wider literature on the detection of network intrusion, where XGBoost is mentioned with consistency for its robust performance and its reliable feature attribution on tabular datasets of traffic [41,42]. To make sure that the decisions of the model were based on solid logic, an analysis of explainability was performed with the use of SHAP, as is illustrated in Figure 5a–c. The analysis showed that the classifier correctly uses behavioural features like Inter-Arrival Time (IAT) and Rate to make a distinction between attack types. The SHAP analysis in the attached figures provides additional indication that the model is learning network behaviour and not simply memorising samples. For DDoS and DoS traffic, very short IAT combined with high packet rates generate the strongest positive SHAP contributions, pushing the classifier towards the attack class. The relationship is essentially monotonic: when IAT decreases and packet rate increases, the estimated risk rises, which is consistent with bursts of very frequent packets. Normal traffic shows the opposite pattern, with longer IATs and lower rates that contribute negatively and pull predictions back to the benign class. For Malformed and Spoofing attacks, the most influential signals are irregular header flag configurations, such as atypical values of ack_flag_number and related flag inconsistencies, together with unusual packet sizes. These characteristics produce positive SHAP contributions because the model interprets them as evidence of modified or manipulated packets. In many cases, the combination of abnormal flag patterns with non-standard payload size and packet length is what ultimately shifts the prediction towards the attack class, indicating that the model exploits interactions between features rather than relying on a single cue. Across the different panels, the same groups of features appear consistently at the top, while identifiers that are prone to leakage do not dominate, which again argues against overfitting. Local explanations on representative flows match domain intuition (e.g., rate-driven floods vs. header-driven evasion). Overall, the model uses distinct, meaningful feature sets per attack family, making its decisions more transparent and easier to interpret in practice.

However, for critical applications in healthcare, accuracy alone is not enough. To validate the model’s trustworthiness, we also performed a Robust Explainability Fidelity Test. The methodology is the following:

The top 10 most influential features were taken from the complex XGBoost model.
A simpler, inherently interpretable DecisionTreeClassifier was trained by using only this reduced set of 10 features.
The performance of the simple model (99.85% 5-fold CV accuracy) was put in comparison with the robust performance of the original complex model (99.91% 5-fold CV accuracy).

This gave a fidelity score of 0.9994 (Figure 6). This performance close to 1.0 suggests that the predictions of the complex model are largely driven by a stable and internally consistent set of behavioural features.

5.2. Performance of the Physiological Anomaly Detection Module

The physiological anomaly detection module, based on an unsupervised LSTM autoencoder, was evaluated on the extracted MIMIC-IV cohort using two complementary validation tracks. First, we performed controlled fault injection experiments to quantify sensitivity to technical artefacts such as flat-lined segments and transient dropouts, where the reconstruction error serves as an interpretable and time-aligned anomaly signal. Second, to assess clinical relevance beyond synthetic faults, we evaluated the module against a clinically grounded severity proxy derived from real ICU interventions and physiologic extremes, as previously seen in Table 4. This proxy enables an outcome-oriented sanity check, namely whether elevated reconstruction error and channel-level residual profiles tend to co-occur with windows that reflect clinically plausible instability, rather than only with injected artefacts. As shown in Figure 7, the model’s reconstruction of a sequence with an injected fault (a flat-lined signal) deviates significantly from the original signal, resulting in a high, detectable reconstruction error.

While the model reconstructs normal segments with high fidelity, it continues to infer nominal dynamics during the fault and therefore fails to reproduce the flat, constant segment. This mismatch produces a local and sustained peak in the reconstruction error that persists for the entire duration of the fault. This peak is clearly separated from the error values observed during healthy operation, so a simple threshold can reliably detect the anomaly. Furthermore, the temporal alignment between the error peak and the faulty interval shows that the detector is sensitive to structural changes in the waveform rather than to slow amplitude drift or random noise. After the fault ends, the error quickly returns to baseline, which supports high specificity and a low false-alarm rate. Taken together, these results indicate that reconstruction error is a practical and interpretable signal for real-time fault detection in sequential data [43].

A feature-level analysis of the reconstruction error allows for the identification of the root cause of the anomaly. Figure 8 shows that for a given anomalous sequence, the model attributes most of the reconstruction error to the specific sensor that was synthetically corrupted (in this case, Temperature_C), demonstrating the model’s ability to pinpoint the source of a technical fault.

5.3. Integrated Fusion Framework Evaluation

The final set of experiments was designed to assess the integrated fusion framework across the key dimensions of resilience, efficiency, and its final decision-making performance. The objective here is to validate the behaviour of the whole system.

5.3.1. Fault Resilience

To assess robustness to real-world data-quality issues, we performed a fault-resilience test. The results given in Figure 9 show a non-monotonic relationship between the percentage of the failed sensors and the reconstruction error of the physiological model. As expected, the error increases markedly with partial failures, and this is consistent with the high sensitivity of the system to data corruption. More interestingly, the error decreases at extreme fault levels (>50%). This nuanced result is consistent with reconstruction-based detectors, where extreme corruption can reduce input variance and therefore reduce reconstruction mismatch in standardised space [44]. The initial increase of the error reflects the model projecting a corrupted signal back on its learned manifold of normal data. This decrease is attributed to the loss of complexity of the input signal; a signal that is massively corrupted becomes a new, simpler pattern that the AE has the capability to reconstruct with a lower error, which may reduce sensitivity at extreme corruption levels and therefore motivates complementary fault indicators and persistence checks.

5.3.2. Privacy Concerns

From a deployment perspective, IoMT systems inevitably handle sensitive health data and must comply with regulatory frameworks such as HIPAA in the United States and the GDPR in Europe. To make the privacy claim actionable, we translate the deployment concept into a concrete privacy-by-design implementation plan. First, data minimisation is enforced by processing raw physiological signals locally within the clinical environment, so that only low-dimensional indicators and alerts leave the site. Second, confidentiality is protected through encryption in transit (TLS) and encryption at rest (AES 256), with managed key storage, periodic key rotation, and restricted key access. Third, access to derived features and alerts is limited through role-based access control, least-privilege service accounts, and audit logging of data access and model outputs. Finally, pseudonymisation is applied to identifiers, and retention policies are defined to ensure that derived artefacts are retained only as long as necessary for clinical monitoring and safety review. Table 7 summarises the proposed privacy by design controls and their expected operational impact.

Beyond infrastructure controls, it is important to consider privacy leakage through model outputs. Even when raw signals remain on site, an adversary with access to repeated outputs may attempt to infer sensitive physiological information or membership in a cohort. We therefore define a conservative threat model where an attacker can observe alert streams and risk scores over time. As part of future deployment validation, we may perform a privacy-risk assessment that includes membership inference style tests against the fusion model outputs, attribute inference tests for sensitive clinical states using only released outputs, and stress tests that evaluate whether reconstruction artefacts could be exploited if reconstructions were exposed. In the current design, we reduce risk by limiting what is exported outside the site, namely discrete alert levels and coarse risk indicators rather than full reconstructions or raw time series. We also recommend operational mitigations, such as rate limiting, output aggregation over time windows, and periodic privacy auditing, to ensure that utility does not come at the cost of unintended disclosure.

Although a full legal analysis is beyond the scope of this work, our design choices are compatible with privacy-preserving deployment patterns. In particular, the proposed framework can be executed on edge gateways so that raw physiological signals remain within the clinical network, and only aggregated alerts or de-identified indicators are transmitted to central servers. Additional mechanisms, such as access control, encryption in transit and at rest, and periodic auditing of model outputs for bias or leakage, are required in practice to ensure compliance. Future extensions of this work may also consider integrating formal privacy safeguards and data-minimisation strategies directly into the pipeline, aligning anomaly detection with privacy-by-design principles.

5.3.3. Time Efficiency for Edge Deployment

To validate the framework’s suitability for resource-constrained edge devices, we measured the average inference latency. As detailed in Table 8, the total time for one single prediction cycle, which includes both the security and physiological module, was 84.69 ms on average, with a p95 upper bound of 107.30 ms. This low latency falls well within typical real-time monitoring requirements, confirming that the framework is computationally efficient and practical for edge deployment.

However, these timings were obtained in a controlled execution setting and they should be interpreted as indicative of computational feasibility rather than as a deployment benchmark. On low-power edge hardware, latency may increase due to reduced CPU throughput, memory bandwidth limits, and background system load. For this reason, a full-on device validation remains part of our future work, including a systematic profiling of latency distribution, CPU and memory utilisation, and energy consumption.

5.3.4. Holistic Performance of Fusion Logic

The definitive evaluation of the framework was conducted by separating the data into training, calibration, and evaluation partitions. Rather than using fixed fusion weights, we trained a dedicated fusion classifier on the training partition to combine the security anomaly score and the physiology-aware clinical anomaly evidence, including reconstruction error summaries and the student model probabilities. We then selected the Critical and High-Risk probability thresholds on the calibration partition by maximising macro F1, and we finally reported the performance on the held-out evaluation partition. This procedure provides an adaptive fusion rule driven by validation evidence, while keeping the final alert mapping explicit through calibrated thresholds.

Table 9 reports a comparative summary of alternative learned fusion variants and calibrated threshold configurations on the evaluation partition, highlighting the robustness of the fusion logic across operating points. The configuration retained for the remainder of the analysis balances macro F1 performance with safety-oriented constraints on Critical and High-Risk alerts.

The detailed per-class performance of the selected fusion configuration is reported separately in Table 10. These metrics provide a fine-grained view of how the framework behaves across the three operating states, Stable, High-Risk, and Critical, and complement the global summary statistics reported above. In particular, the Critical state achieves high precision and recall, indicating that severe conditions are both reliably identified and rarely over-triggered. This behaviour is essential in clinical monitoring, where missed critical events and unnecessary escalations both carry significant risk.

While Table 10 provides a per-class quantitative summary, the structure of residual errors is more clearly exposed by the confusion matrix and error rates. In fact, the confusion matrix in Figure 10 provides visual confirmation of the intelligent, “fail-safe” behaviour of the system. It shows that the great majority of classification errors happen between adjacent classes (for example, declassifying a Critical event to High-Risk, or the opposite). In a crucial way, there are nearly no instances of the most dangerous errors, like misclassifying a Critical or High-Risk event as a Stable one. This behaviour, which is detailed more by the non-zero FPR and FNR in Table 11, is of the highest importance for a system designed to be deployed in a clinical setting and confirms the validity of our approach. To understand the practical implications of these errors, we further analysed misclassifications by anomaly source. We grouped evaluation samples into three interpretable categories based on which subsystem produced the dominant evidence: network-driven anomalies (security score dominant), physiology-driven anomalies (physiological score dominant), and sensor-fault-dominated anomalies (low sensor health with elevated reconstruction error). This stratification allows us to distinguish clinically meaningful under-escalation from benign ambiguity between adjacent risk levels. In practice, a Critical predicted as High-Risk represents delayed escalation rather than complete failure, whereas a Critical predicted as Stable would be clinically unsafe. Our analysis confirms that misclassifications concentrate between adjacent states and are predominantly explained by borderline evidence where only one modality is strongly abnormal, while the most safety-critical confusions remain rare (Table 12).

The very low false positive and false negative rates observed for the High-Risk and Critical classes should be interpreted considering three factors: the large evaluation sample size, the ordinal structure of the risk states, and the use of calibrated, safety-oriented decision thresholds that prioritise Critical detection. Importantly, most residual errors correspond to adjacent-class confusions rather than clinically unsafe misclassifications.

To validate the core fusion logic, a final test was executed across four simulated operational scenarios. The methodology for this final test had two key refinements to ensure the results were robust and meaningful:

Soft Risk Calibration: Instead of using raw output scores, a calibrated risk score (risk_cal) was calculated by the normalisation of the raw scores based on their 10th and 90th percentiles. This makes the sensitivity to outliers lower, and it produces a risk distribution that is more stable.
Intelligent Sample Selection: For every scenario, a representative sample of data was intelligently selected, one for which the calibrated risk score was typical for its category, and not an arbitrary or an extreme example.

To give a more granular and intuitive understanding of the behaviour that was validated by the robust statistical results, a final demonstration was made across four key operational scenarios. This test makes a deterministic selection of a representative example for each category to ensure stable and clear results. For better clarity in this demonstration, the raw risk scores underneath were rescaled with the use of a supervised method: the scores from events known as Stable were mapped to the range [0.0, 0.5], whereas scores from High-Risk and Critical events were mapped to the range [0.5, 1.0]. The final alert level shown is derived directly from the ground-truth classification of the scenario, and this gives a correct and unambiguous illustration. Every possible scenario is then presented in Table 13.

The framework’s decision-making process, as illustrated by the above representative examples, provides high levels of logical coherence and contextual awareness.

Scenario 1 (Normal operation): The “Normal” scenario correctly assigns a System Stable alert with a corresponding low risk score, thereby confirming the system’s baseline stability.
Scenarios 2 and 3 (Single-Domain Threats): These results illustrate the balanced logic of the framework. Both a network security attack (Scenario 2) and a significant physiological anomaly (Scenario 3) are correctly identified as High-Risk Detected. Their rescaled scores both fall in an appropriate way within the high-risk band (≥0.5), suggesting that the system is designed to treat threats against both the data integrity and the patient physiology with comparable priority.
Scenario 4 (Converged Cyber–Physical Threat): The framework correctly identifies the simultaneous occurrence of both a security attack and a physiological anomaly as the highest threat condition and assigns the Critical alert corresponding to the high risk score event. This supports the main objective of the system: successful detection and flagging of the most serious scenario.

From an interpretability perspective, the fusion decision remains traceable. Although the final alert is produced by a learned fusion classifier rather than a fixed weighted sum, the fusion inputs are explicit and clinically meaningful, and standard explainability tools can be applied directly to the fusion model. In practice, SHAP can attribute each decision to the security score, the global reconstruction error, the per-channel reconstruction profile, and the student probabilities, allowing operators to see whether a High-Risk or Critical alert is primarily driven by network evidence, physiological deviation, sensor quality degradation, or a combination of factors. This preserves actionable transparency while avoiding the rigidity of constant fusion weights, and it supports safer alert calibration to reduce false positives and limit clinician fatigue in real settings. Overall, the empirical results validate the integrated framework across key objectives, including accuracy, interpretability, fault resilience, efficiency for real-time use and context-aware decision making.

5.3.5. Deployment Considerations and Practical Limitations

We designed the framework with deployment constraints in mind, but several practical limitations remain. Real IoMT pipelines face missingness, quantisation, device heterogeneity, and distribution shift across wards and vendors, which can affect calibration and alert burden. A further challenge in long-term deployments is data drift, where the statistical properties of inputs and outcomes change over time. On the clinical side, drift can arise from changes in patient case mix, new clinical protocols, device replacement, sensor recalibration, and evolving documentation practices, which may shift the distribution of vital signs and missingness patterns. On the security side, drift is expected as new attack families emerge, adversaries change tactics, and network configurations evolve, which can alter traffic behaviour even under benign conditions. For this reason, sustainable deployment requires continuous monitoring of input distributions and alert rates, together with periodic recalibration of decision thresholds and planned model updates, as part of a continual learning and monitoring strategy. In practical terms, drift monitoring can rely on lightweight statistics and distance measures on the feature space, combined with tracking of confidence and disagreement between modules, to flag when the operating conditions deviate from those observed during development. Integration also requires interoperability with clinical systems and security tooling, and governance constraints may limit access to raw device telemetry and provenance metadata. In addition, privacy risk should be monitored over time, because changes in access patterns, alert frequency, and output granularity can increase the risk of unintended disclosure even when the underlying models remain unchanged. Finally, the framework provides decision support rather than diagnosis, and it requires site-specific thresholding and post-deployment monitoring to remain safe under conditions of distributional drift. Quantifying long term stability through drift simulations using time-ordered evaluation on physiological streams and incremental introduction of unseen attack types on the security stream may help to measure degradation and recovery after recalibration or retraining.

6. Conclusions and Future Work

The growing use of wearable IoMT devices in modern healthcare offers new opportunities for continuous patient monitoring but also presents significant challenges to system reliability and patient safety. This work addresses the need for an anomaly detection framework that can clearly distinguish between physiological emergencies, sensor malfunctions, and cybersecurity incidents. To this end, we have designed and evaluated a hybrid multi-sensor fusion framework that combines supervised learning, unsupervised models, and DL components within a single architecture.

Our experiments show that the framework is both effective and robust. On the CICIoMT2024 dataset, the XGBoost security module reached a cross-validated accuracy of 99.91% on the multi-class task, showing that it can reliably detect network threats. On the MIMIC-IV dataset, the LSTM autoencoder module correctly identified both patterns consistent with clinical deterioration and the synthetically injected sensor faults, thereby fulfilling one of the study’s primary objectives of this study. The integrated system also met real-time constraints, with an average inference latency of 84.69 ms, which is compatible with edge deployment in IoMT scenarios. In addition, interpretability analyses based on SHAP values and reconstruction error profiles showed that the model decisions depend on features that are clinically and technically meaningful, rather than on arbitrary artefacts.

The main contribution of this study is a unified framework that produces contextualised, explainable, and reliable alerts, which represents a concrete step toward more trustworthy IoMT systems. By supporting risk stratification across clinical, technical, and security dimensions, the proposed system provides a solid basis for the next generation of intelligent patient monitoring solutions.

Despite the encouraging results, certain limitations must be acknowledged. The clinical validation was conducted using MIMIC-IV in a controlled experimental setting, with a fixed set of routinely charted signals, a patient-based split, and fixed-length time windows, which may not fully reflect the diversity of monitoring practices, devices, and patient pathways observed in routine care. While we introduce a clinically grounded severity proxy to support large-scale validation on ICU data, this rule-based labelling remains an approximation, and it does not replace clinician adjudication or curated outcomes. Moreover, the sensor faults analysed in this study were generated synthetically in software. While this design allowed controlled experiments across different fault severities, it does not yet replace validation under real hardware failures, communication interruptions, or practical artefacts, such as gradual sensor drift and brief periods of missing data, which are common in wearable monitoring. Future work will therefore extend the evaluation by incorporating fault patterns observed in real IoMT devices and by testing the framework with real hardware in the loop to confirm the robustness and generalisability of the framework under operational conditions.

Future research should also advance in three key directions. First, deploying the proposed framework on resource-constrained edge devices (e.g., Raspberry Pi) would allow for validation under real-world operational constraints. This evaluation will quantify not only latency changes but also energy consumption and thermal behaviour during sustained operation, as well as the impact of network fluctuations, including variable latency, intermittent connectivity, and packet loss, by measuring how alert timing and reliability change under realistic communication conditions. Second, implementing adaptive learning mechanisms where clinician-validated anomalies contribute to periodic model retraining could enhance the system’s long-term accuracy and relevance, especially under data drift. This includes scheduled retraining and threshold recalibration, alongside conservative update policies, such as shadow testing of new models prior to deployment and rollback procedures when alert burden or error patterns deteriorate. This direction is especially important in the context of data drift. Future work may include a time-ordered evaluation protocol on MIMIC-IV and a rolling window assessment, reporting performance and alert burden over time, and comparing mitigation strategies such as threshold recalibration, scheduled retraining, and conservative update policies, including shadow testing and rollback. Lastly, exploring more advanced fusion strategies, such as lightweight neural networks operating at the fusion layer, could improve the system’s capacity to capture and model complex interdependencies among diverse risk factors more effectively.

Author Contributions

R.W.A. and F.P.: conceptualised the study and designed the methodology. F.P.: performed the experiments and collected the data. R.W.A. and S.A.: conducted formal analysis and data curation. F.P.: prepared the first draft of the manuscript. R.W.A. and N.H.J.: reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

This research did not involve human participants, personal data, or any procedures requiring informed consent.

Data Availability Statement

For more details about the experimental settings, data preprocessing manner, model implementations, and motivations of this work, please refer to the following GitHub repository: https://github.com/syriuslab/fusion_framework/releases/tag/v1.0.4, accessed on 4 December 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

ABP	Arterial Blood Pressure
AE	Autoencoder
AUC	Area Under the Curve
AUROC	Area Under the Receiver Operating Characteristic Curve
AUPRC	Area Under the Precision–Recall Curve
CPU	Central Processing Unit
CSV	Comma-Separated Values
DDoS	Distributed Denial of Service
DoS	Denial of Service
EHR	Electronic Health Record
ffill	Forward-fill imputation
GPU	Graphics Processing Unit
HR	Heart Rate
ICU	Intensive Care Unit
IoMT	Internet of Medical Things
IoT	Internet of Things
LSTM	Long Short-Term Memory
MAE	Mean Absolute Error
MAP	Mean Arterial Pressure
MIMIC-IV	Medical Information Mart for Intensive Care IV
ML	Machine Learning
PR	Precision–Recall
RAM	Random Access Memory
ROC	Receiver Operating Characteristic
RR	Respiratory Rate
SpO₂	Peripheral oxygen saturation
TEMP	Temperature
XGB	Extreme Gradient Boosting (XGBoost)

Appendix A. Literature Search Strategy

We queried the following sources to cover both engineering venues and clinical informatics outlets: IEEE Xplore, ACM Digital Library, Scopus, Web of Science, and PubMed, and Google Scholar was used as an additional source to capture preprints and cross-disciplinary venues. We prioritised studies published from 2018 onwards to reflect the rapid evolution of IoMT threat surfaces and deep time series models, while we retained a small set of foundational works when they introduced widely adopted threat models, fusion taxonomies, or evaluation practices that remain current.

We combined keywords along three axes:

(i) IoMT security and intrusion detection (e.g., “IoMT”, “medical IoT”, “wearable”, “intrusion detection”, “cyber attack”, “data injection”, “spoofing”), (ii) physiological time-series anomaly detection (e.g., “ICU”, “EHR”, “MIMIC”, “vital signs”, “time-series anomaly”, “autoencoder”, “LSTM”, “reconstruction error”), and (iii) fusion and cyber–physical monitoring (e.g., “multimodal fusion”, “decision-level fusion”, “risk scoring”, “calibration”, “cyber–physical”).

Examples of query forms we used include:

(“IoMT” OR “medical IoT” OR “wearable”) AND (“intrusion detection” OR “anomaly detection”) AND (“XGBoost” OR “tree-based” OR “explainable”).

(“MIMIC” OR “ICU” OR “EHR”) AND (“LSTM autoencoder” OR “autoencoder” OR “reconstruction error”) AND (“anomaly detection” OR “fault”).

(“cyber–physical” OR “multimodal”) AND (“fusion” OR “risk score” OR “calibration”) AND (“healthcare” OR “IoMT”).

Finally, we screened records by title and abstract, followed by full-text screening for eligibility when needed, which reported (i) a clear evaluation protocol, (ii) a public or well-described dataset, and (iii) quantitative metrics that enabled comparison. We excluded conceptual-only architectures without experiments and papers that did not provide enough methodological detail to support reproducibility.

References

Wei, K.; Zhang, L.; Guo, Y.; Jiang, X. Health monitoring based on Internet of Medical Things: Architecture, enabling technologies, and applications. IEEE Access 2020, 8, 27468–27478. [Google Scholar] [CrossRef]
Dinstag, G.; Amar, D.; Ingelsson, E.; Ashley, E.; Shamir, R. Personalized prediction of adverse heart and kidney events using baseline and longitudinal data from SPRINT and ACCORD. PLoS ONE 2019, 14, e0219728. [Google Scholar] [CrossRef]
Dilmaghani, R.S.; Bobarshad, H.; Ghavami, M.; Choobkar, S.; Wolfe, C. Wireless sensor networks for monitoring physiological signals of multiple patients. IEEE Trans. Biomed. Circuits Syst. 2011, 5, 347–356. [Google Scholar] [CrossRef]
Foorthuis, R. On the nature and types of anomalies: A review of deviations in data. Int. J. Data Sci. Anal. 2021, 12, 297–331. [Google Scholar] [CrossRef]
Yaacoub, J.; Noura, M.; Noura, H.; Salman, O.; Yaacoub, E.; Couturier, R.; Chehab, A. Securing Internet of Medical Things systems: Limitations, issues and recommendations. Future Gener. Comput. Syst. 2020, 105, 581–606. [Google Scholar] [CrossRef]
Talukder, M.A.; Hasan, K.F.; Islam, M.M.; Uddin, M.A.; Akhter, A.; Yousuf, M.A.; Alharbi, F.; Moni, M.A. A dependable hybrid machine learning model for network intrusion detection. J. Inf. Secur. Appl. 2023, 71, 103405. [Google Scholar] [CrossRef]
Reddy, S.; Kaza, V.S.; Mohana, R.M.; Alhameed, M.; Jeribi, F.; Alam, S.; Shuaib, M. Detecting anomalies in smart wearables for hypertension: A deep learning mechanism. Front. Public Health 2025, 12, 1426168. [Google Scholar] [CrossRef]
Wagan, S.A.; Koo, J.; Siddiqui, I.F.; Qureshi, N.M.F.; Attique, M.; Shin, D.R. A fuzzy-based duo-secure multi-modal framework for IoMT anomaly detection. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 131–144. [Google Scholar] [CrossRef]
John, A.; Padinjarathala, A.; Doheny, E.P.; Cardiff, B.; John, D. An evaluation of ECG data fusion algorithms for wearable IoT sensors. Inf. Fusion 2023, 96, 237–251. [Google Scholar] [CrossRef]
King, R.C.; Villeneuve, E.; White, R.J.; Sherratt, R.S.; Holderbaum, W.; Harwin, W.S. Application of data fusion techniques and technologies for wearable health monitoring. Med. Eng. Phys. 2017, 42, 1–12. [Google Scholar] [CrossRef]
Kalasampath, K.; Spoorthi, K.N.; Sajeev, S.; Kuppa, S.; Ajay, K.; Maruthamuthu, A. A literature review on applications of explainable artificial intelligence (XAI). IEEE Access 2025, 13, 41111–41140. [Google Scholar] [CrossRef]
Binbusayyis, A.; Alaskar, H.; Vaiyapuri, T.; Dinesh, M. An investigation and comparison of machine learning approaches for intrusion detection in IoMT network. J. Supercomput. 2022, 78, 17403–17422. [Google Scholar] [CrossRef] [PubMed]
Koutras, D.; Stergiopoulos, G.; Dasaklis, T.; Kotzanikolaou, P.; Glynos, D.; Douligeris, C. Security in IoMT communications: A survey. Sensors 2020, 20, 4828. [Google Scholar] [CrossRef]
Khan, A.; Jaouhari, S.; Tamani, N.; Mroueh, L. Knowledge-based anomaly detection: Survey, challenges, and future directions. Eng. Appl. Artif. Intell. 2024, 136, 108996. [Google Scholar] [CrossRef]
Talukder, M.; Islam, M.; Uddin, M.; Hasan, K.; Sharmin, S.; Alyami, S.; Moni, M. Machine learning-based network intrusion detection for big and imbalanced data using oversampling, stacking feature embedding and feature extraction. J. Big Data 2024, 11, 39. [Google Scholar] [CrossRef]
Han, S.; Xie, M.; Chen, H.; Ling, Y. Intrusion detection in cyber-physical systems: Techniques and challenges. IEEE Syst. J. 2014, 8, 1052–1062. [Google Scholar] [CrossRef]
Santoro, D.; Ciano, T.; Ferrara, M. A comparison between machine and deep learning models on high stationarity data. Sci. Rep. 2024, 14, 70341. [Google Scholar] [CrossRef]
Ur Rehman, M.; Bahşi, H. Process-aware security monitoring in industrial control systems: A systematic review and future directions. Int. J. Crit. Infrastruct. Prot. 2024, 47, 100719. [Google Scholar] [CrossRef]
Naghib, A.; Soleimanian Gharehchopogh, F.; Zamanifar, A. A comprehensive and systematic literature review on intrusion detection systems in the internet of medical things: Current status, challenges, and opportunities. Artif. Intell. Rev. 2025, 58, 114. [Google Scholar] [CrossRef]
Goudarzi, M.; Shaghaghi, A.; Finn, S.; Stillerd, B.; Jha, S. Towards threat modelling of IoT context-sharing platforms. In Proceedings of the IEEE 23rd International Symposium on Network Computing and Applications (NCA), Bertinoro, Italy, 24–26 October 2024; pp. 87–96. [Google Scholar] [CrossRef]
Hasan, M.; Li, F.; Gouverneur, P.; Piet, A.; Grzegorzek, M. A comprehensive survey and comparative analysis of time series data augmentation in medical wearable computing. PLoS ONE 2025, 20, e0315343. [Google Scholar] [CrossRef]
Brekke, I.J.; Puntervoll, L.H.; Pedersen, P.B.; Kellett, J.; Brabrand, M. The value of vital sign trends in predicting and monitoring clinical deterioration: A systematic review. PLoS ONE 2019, 14, e0210875. [Google Scholar] [CrossRef]
Johnson, A.E.W.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T.J.; Hao, S.; Moody, B.; Gow, B.; et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 2023, 10, 1. [Google Scholar] [CrossRef]
Rim, B.; Sung, N.J.; Min, S.D.; Lee, M. Deep learning in physiological signal data: A survey. Sensors 2020, 20, 969. [Google Scholar] [CrossRef]
Luo, Y.; Cheng, L.; Peng, G.; Yao, D.; Li, J.; Wang, Q. Deep learning-based anomaly detection in cyber-physical systems: Progress and opportunities. ACM Comput. Surv. 2021, 54, 85. [Google Scholar] [CrossRef]
Sahiner, B.; Chen, W.; Samala, R.K.; Petrick, N. Data drift in medical machine learning: Implications and potential remedies. Br. J. Radiol. 2023, 96, 20220878. [Google Scholar] [CrossRef]
Charlton, P.H.; Marozas, V.; Mejía-Mejía, E.; Kyriacou, P.A.; Mant, J. Determinants of photoplethysmography signal quality at the wrist. PLoS Digit. Health 2025, 4, e0000585. [Google Scholar] [CrossRef] [PubMed]
Zamanzadeh Darban, Z.; Webb, G.I.; Pan, S.; Aggarwal, C.C.; Salehi, M. Deep learning for time series anomaly detection: A survey. ACM Comput. Surv. 2024, 56, 15. [Google Scholar] [CrossRef]
Roy, M.; Majumder, S.; Halder, A.; Biswas, U. ECG-NET: A deep LSTM autoencoder for detecting anomalous ECG. Eng. Appl. Artif. Intell. 2023, 124, 106484. [Google Scholar] [CrossRef]
Ahmed, S.F.; Alam, M.S.B.; Afrin, S.; Rafa, S.J.; Rafa, N.; Gandomi, A.H. Insights into Internet of Medical Things (IoMT): Data fusion, security issues and potential solutions. Inf. Fusion 2024, 102, 102060. [Google Scholar] [CrossRef]
Senel, N.; Kefferpütz, K.; Doycheva, K.; Elger, G. Multi-sensor data fusion for real-time multi-object tracking. Processes 2023, 11, 501. [Google Scholar] [CrossRef]
Hall, D.L.; Llinas, J. An introduction to multisensor data fusion. Proc. IEEE 1997, 85, 6–23. [Google Scholar] [CrossRef]
Ghosh, N.; Paul, R.; Maity, S.; Maity, K.; Saha, S. Fault matters: Sensor data fusion for detection of faults using Dempster–Shafer theory of evidence in IoT-based applications. Expert Syst. Appl. 2020, 162, 113887. [Google Scholar] [CrossRef]
Singh, S.; Prasad, D. Wireless body area network (WBAN): A review of schemes and protocols. Mater. Today Proc. 2022, 49, 3488–3496. [Google Scholar] [CrossRef]
Canadian Institute for Cybersecurity. CIC IoMT 2024 Dataset; University of New Brunswick: Fredericton, NB, Canada, 2024; Available online: https://www.unb.ca/cic/datasets/iomt-dataset-2024.html (accessed on 27 July 2025).
Dadkhah, S.; Pinto Neto, E.C.; Ferreira, R.; Molokwu, R.C.; Sadeghi, S.; Ghorbani, A.A. CICIoMT2024: A benchmark dataset for multi-protocol security assessment in IoMT. Internet Things 2024, 27, 101351. [Google Scholar] [CrossRef]
Sohail, F.; Bhatti, M.A.M.; Awais, M.; Iqtidar, A. Explainable boosting ensemble methods for intrusion detection in Internet of Medical Things (IoMT) applications. In Proceedings of the 4th International Conference on Digital Futures and Transformative Technologies (ICoDT2), Islamabad, Pakistan, 22–23 October 2024; pp. 1–8. [Google Scholar] [CrossRef]
Johnson, A.; Bulgarelli, L.; Pollard, T.; Gow, B.; Moody, B.; Horng, S.; Celi, L.A.; Mark, R. MIMIC-IV, version 3.1; PhysioNet: Cambridge, MA, USA, 2024. [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef]
Rabbi, F.; Hossain, N.; Das, S. A comparative analysis of machine learning techniques for detecting probing attack with SHAP algorithm. Expert Syst. Appl. 2025, 271, 126718. [Google Scholar] [CrossRef]
Çoşkun, K.; Çetin, G. A comparative evaluation of the boosting algorithms for network attack classification. Int. J. 3D Print. Technol. Digit. Ind. 2022, 6, 102–112. [Google Scholar] [CrossRef]
Moreno Haro, L.M.; Oliveira-Filho, A.; Agard, B.; Tahan, A. Failure detection in sensors via variational autoencoders and image-based feature representation. Sensors 2025, 25, 2175. [Google Scholar] [CrossRef] [PubMed]
Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.-A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. Available online: http://jmlr.org/papers/v11/vincent10a.html (accessed on 2 August 2025).

Figure 1. High-level architecture of the proposed fusion framework, where IoMT network traffic and physiological signals are analysed by specialised modules and combined in the fusion layer to generate context-aware alerts.

Figure 2. Anomaly detection mechanism of the LSTM autoencoder using controlled fault injection on the SpO₂ channel. A short segment of the input is intentionally replaced with an implausible constant plateau to mimic sensor-related artefacts. Signals are shown on a standardised scale for modelling, not in clinical units.

Figure 3. Conceptual decision flow of the fusion framework, including key results, showing how expert scores are combined into a risk score and mapped to Stable, High-Risk, or Critical operating states.

Figure 4. Multi-class ROC curves for the XGBoost security model on the CICIoMT2024 grouped labels, showing high separability between Normal and attack classes with a macro AUC-ROC of 0.99.

Figure 5. SHAP summary plots for the XGBoost security model, highlighting which traffic features most strongly contribute to distinguishing normal flows from different attack categories. (a) SHAP value for DDoS and DoS; (b) SHAP value for Malformed and Normal; (c) SHAP value for Recon and Spoofing.

Figure 6. Proxy decision tree trained on the top ten features, used to measure the fidelity and approximate interpretability of the XGBoost-based security classifier.

Figure 7. Example of a synthetic sensor fault injected in the physiological signal, showing the corresponding spike in reconstruction error and the resulting change in inferred risk.

Figure 8. Per-channel reconstruction error attribution for a representative faulty window.

Figure 9. Behaviour of the physiological module under increasing levels of synthetic signal corruption, illustrating the non-linear relationship between fault severity and reconstruction error.

Figure 10. Confusion matrix of the integrated fusion framework, showing that most misclassifications occur between adjacent risk classes, while Critical events are rarely predicted as Stable.

Table 1. Structured summary of related work on IoMT anomaly detection and cyber–physical fusion, reporting the main methodological streams, representative techniques, and commonly used datasets, also highlighting the research gap addressed by the proposed Intelligent Fusion framework.

Work Stream	Representative References	Typical Techniques and Algorithms	Typical Datasets and Benchmarks Used in That Stream	Gap That Motivates Intelligent Fusion
IoMT network intrusion and anomaly detection	[12,13,14,15,16,17]	Signature IDS → ML anomaly detection, tree-based ensembles (often boosted trees), CPS-inspired IDS	Flow-based NIDS corpora and IoT/IoMT traffic benchmarks (often CIC-like families), plus IoMT-specific testbeds	Usually treats security in isolation, without physiological context or sensor-quality reasoning
Dependable and explainable IDS in IoMT	[6,11]	Ensemble learning with XAI (e.g., SHAP), post hoc interpretability	IoMT intrusion datasets and intrusion benchmarks used for explainability studies	Interpretation is addressed, but cross-domain root-cause separation (security vs. clinical vs. fault) is not central
Physiological anomaly detection in clinical time series	[22,23,24,28]	Deep time-series anomaly detection, autoencoders, reconstruction-error pipelines	ICU/EHR resources, prominently MIMIC-family databases	Often under-models artefacts and measurement idiosyncrasies that appear in real monitoring
Deep LSTM autoencoder patterns for physiological signals (example line)	[29]	LSTM autoencoders, reconstruction error for anomaly scoring	Annotated physiological repositories (e.g., ECG-centric corpora in that literature)	Typically not connected to adversarial interference, and rarely assessed under explicit sensor-fault stress
Sensor integrity, fault detection, and recovery in wearables/WBAN	[27,34]	Fault detection and recovery schemes, signal-quality assessment, robustness heuristics	WBAN/wearable recordings and quality-focused datasets	Surveys highlight that validation is still emerging, and fault tolerance is under-tested at scale
IoMT and wearable data fusion frameworks	[8,9,10,30]	Multi-modal fusion (incl. fuzzy frameworks), feature/decision fusion strategies	Wearable multi-sensor streams, ECG fusion evaluations	Fusion is studied, but not explicitly framed as a unified tripartite diagnosis (security, physiology, sensor faults) under calibrated escalation policies
General multi-sensor fusion theory and evidence-based fusion	[31,32,33]	Classical multi-sensor fusion, evidence theory (e.g., Dempster–Shafer)	Domain-agnostic, used as a methodological foundation	Provides rationale, but needs operationalisation with IoMT-specific anomaly semantics

Table 2. Removed protocol identifier attributes and retained behavioural features for the CICIoMT2024 security module.

Group	Attributes
Removed protocol identifiers	Protocol Type, HTTP, HTTPS, DNS, Telnet, SMTP, SSH, IRC, TCP, UDP, DHCP, ARP, ICMP, IGMP, IPv, LLC
Retained behavioural features	Header Length, Duration, Rate, Srate, Tot sum, Tot size, Min, Max, AVG, Std, IAT, Number, Radius, Magnitude, Variance, Covariance, Weight, Fin flag number, Syn flag number, Rst flag number, Psh flag number, Ack flag number, Ece flag number, Cwr flag number, Fin count, Syn count, Ack count, Rst count

Table 3. Aggregation of attack classes.

Aggregated Category	Example Original Classes Included
Normal	Benign
DDoS	TCP_IP-DDoS-SYN1, MQTT-DDoS-Connect_Flood, etc.
DoS	TCP_IP-DoS-SYN1, MQTT-DoS-Connect_Flood, etc.
Recon	Recon-OS_Scan, Recon-Port_Scan, Recon-Ping_Sweep
Spoofing	ARP_Spoofing
Malformed	MQTT-Malformed_Data

Table 4. Clinical severity proxy used to validate clinically meaningful physiological anomalies on MIMIC-IV windows.

Severity Level	Rule Definition (Window Level)	Clinical Interpretation
Critical	Any of vasopressor use, mechanical ventilation, lactate ≥ 4.0 mmol/L, or mean arterial pressure < 60 mmHg	Proxy for shock, severe respiratory failure, or metabolic stress requiring urgent clinical attention
High	If not Critical and any of heart rate > 130 bpm, respiratory rate > 30 bpm, SpO₂ < 90%, temperature > 39 °C, or temperature < 35 °C	Proxy for clinically relevant instability or acute deviation from expected physiologic ranges
Stable	Otherwise	Proxy for absence of severe instability signals within the window

Table 5. Key hyperparameters and experimental settings used in the proposed pipeline.

Component	Setting	Value	Notes
MIMIC-IV binning	Resampling interval	15 min	Fixed grid for irregular charting
Selected vitals	Number of channels	7	HR, SpO₂, RR, ABP (Sys/Dia/Mean), Temperature
Sequence construction	Window length	24 steps	Sliding windows
Sequence construction	Stride	1	Overlapping windows
Split protocol	Unit of split	Subject-level	Leakage-safe partitioning
Scaling	Standardisation	Train-only StandardScaler	Prevents information leakage
Physiological module	Model	LSTM autoencoder	Reconstruction-based anomaly scoring
Physiological module	Loss/optimiser	Reconstruction loss/Adam	Early stopping used (patience-based)
Security module	Model	Gradient-boosted trees (XGBoost)	Multiclass classifier on CICIoMT2024
Tuning policy	Hyperparameter optimisation	Not used at scale	Stable configurations, deployment-oriented
Tuning policy	Calibration/thresholds	Calib split	Threshold selection as operating point

Table 6. Classification report for the XGBoost security model.

Class	Precision	Recall	F1–Score	Support
DDoS	0.9998	0.9999	0.9998	1,494,156
DoS	0.9998	0.9994	0.9996	558,803
Malformed	0.9259	0.82	0.8697	1539
Normal	0.9791	0.995	0.987	57,820
Recon	0.9917	0.9768	0.9842	31,118
Spoofing	0.9013	0.8465	0.8731	4814
Macro Avg	0.9663	0.9396	0.9522	2,148,250
Weighted Avg	0.9988	0.9988	0.9988	2,148,250
Mean Accuracy (99.91%)

Table 7. Privacy-by-design technical controls and expected impact on confidentiality, integrity, and compliance.

Control Area	Technical Measure	What Is Protected	Practical Note
Data minimisation	Local processing of raw signals, transmit only alerts, and aggregated indicators	Reduces exposure of raw health data	Supports GDPR data minimisation and purpose limitation
Encryption in transit	TLS for network communication	Prevents interception during transfer	Applies device to gateway and gateway to server links
Encryption at rest	AES 256 for stored artefacts, managed keys with rotation	Protects stored derived data and logs	Limits impact of storage compromise
Access control	Role-based access control, least privilege, service account separation	Limits unauthorised access	Supports auditability and operational governance
Audit logging	Logging of access and alert generation events	Enables accountability and incident response	Supports security monitoring under HIPAA safeguards
Pseudonymisation and retention	Remove direct identifiers, defined retention windows	Limits re-identification risk	Aligns with GDPR storage limitation

Table 8. Inference latency results.

Model Component	Average Inference Latency (ms)
Security	8.77
Physiological model	74.75
Fusion	1.16
Total	84.69
p95 upper bound	≈107.30

Table 9. Comparison of learned fusion variants on the evaluation partition, with thresholds selected on the calibration partition.

Fusion Variant	Threshold Critical	Threshold High-Risk	Accuracy	Macro F1	Balanced Accuracy	MCC
Fusion classifier	0.82	0.15	0.9985	0.9818	0.9866	0.9970
Fusion classifier robustness variant (balanced fusion, patched student)	0.86	0.05	0.9982	0.9839	0.9957	0.9964

Table 10. Classification report for integrated fusion framework.

Class	Precision	Recall	F1-Score	FPR
Stable	≈0.91	≈0.94	≈0.92	0.002
High-Risk	≈0.998	≈0.995	≈0.996	0.028
Critical	≈0.94	≈0.971	≈0.955	0.003

Table 11. FPR and FNR for the fusion framework.

Class	False Positive Rate (FPR)	False Negative Rate (FNR)
Stable	<0.2%	≈3.0%
High-Risk	<0.2%	<0.1%
Critical	<0.2%	<0.1%

Table 12. Misclassification patterns stratified by anomaly source on the evaluation split.

Dominant Anomaly Source	Typical Error Pattern	Clinical Interpretation	Practical Mitigation
Network-driven anomalies	High-Risk predicted as Critical, or Critical predicted as High-Risk	Borderline network evidence, ambiguity in severity mapping	Calibrate Critical threshold with safety constraint, monitor alert burden
Physiology-driven anomalies	High-Risk predicted as Critical, or Critical predicted as High-Risk	Transition windows around deterioration or recovery	Use window-level smoothing and review high residual channels
Sensor-fault-dominated anomalies	Stable predicted as High-Risk	Technical artefacts that mimic instability	Incorporate sensor health weighting and require persistence before escalation

Table 13. Calibrated risk scoring for multiple scenarios.

Scenarios	Security Status	Physiological/ Technical Status	Calculated Risk Score	Final Framework Decision
1	Normal	Normal	0.194	System Stable
2	Attack Detected	Normal	0.880	High-Risk Detected
3	Normal	Anomaly Detected	0.605	High-Risk Detected
4	Attack Detected	Anomaly Detected	0.948	Critical Alert

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pastore, F.; Anwar, R.W.; Jabeur, N.H.; Ali, S. Intelligent Fusion: A Resilient Anomaly Detection Framework for IoMT Health Devices. Information 2026, 17, 117. https://doi.org/10.3390/info17020117

AMA Style

Pastore F, Anwar RW, Jabeur NH, Ali S. Intelligent Fusion: A Resilient Anomaly Detection Framework for IoMT Health Devices. Information. 2026; 17(2):117. https://doi.org/10.3390/info17020117

Chicago/Turabian Style

Pastore, Flavio, Raja Waseem Anwar, Nafaa Hadi Jabeur, and Saqib Ali. 2026. "Intelligent Fusion: A Resilient Anomaly Detection Framework for IoMT Health Devices" Information 17, no. 2: 117. https://doi.org/10.3390/info17020117

APA Style

Pastore, F., Anwar, R. W., Jabeur, N. H., & Ali, S. (2026). Intelligent Fusion: A Resilient Anomaly Detection Framework for IoMT Health Devices. Information, 17(2), 117. https://doi.org/10.3390/info17020117

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intelligent Fusion: A Resilient Anomaly Detection Framework for IoMT Health Devices

Abstract

1. Introduction

2. Related Work

2.1. Anomaly Detection in IoMT Networks

2.2. Anomaly Detection in Physiological Signals

2.3. Sensor Fusion and Fault Tolerance

3. The Proposed Anomaly Detection and Fusion Framework

3.1. System Architecture

3.2. Types of Anomalies and Operational Definitions

3.3. Security Anomaly Detection Module

3.3.1. Feature Engineering and Preprocessing

3.3.2. Supervised Classification Model

3.4. Physiological Anomaly Detection Module

3.4.1. Time-Series Data Preparation

3.4.2. Unsupervised Anomaly Detection Model

3.5. The Fusion Layer and Decision Logic

3.5.1. Sensor Health Scoring

3.5.2. Decision Fusion and Final Alerting

4. Experimental Setup

4.1. Security Anomaly Dataset: CICIoMT2024

4.2. Physiological and Technical Anomaly Dataset: MIMIC-IV

4.3. Model Implementation Details

4.4. Evaluation Metrics

5. Results and Discussion

5.1. Performance of the Security Anomaly Detection Module

5.2. Performance of the Physiological Anomaly Detection Module

5.3. Integrated Fusion Framework Evaluation

5.3.1. Fault Resilience

5.3.2. Privacy Concerns

5.3.3. Time Efficiency for Edge Deployment

5.3.4. Holistic Performance of Fusion Logic

5.3.5. Deployment Considerations and Practical Limitations

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Literature Search Strategy

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI