IoTBystander: A Non-Intrusive Dual-Channel-Based Smart Home Security Monitoring Framework

Chi, Haotian; Ma, Qi; Wang, Yuwei; Yang, Jing; Geng, Haijun

doi:10.3390/app15094795

Open AccessArticle

IoTBystander: A Non-Intrusive Dual-Channel-Based Smart Home Security Monitoring Framework

by

Haotian Chi

^*

,

Qi Ma

^*,

Yuwei Wang

,

Jing Yang

and

Haijun Geng

School of Automation and Software Engineering, Shanxi University, Taiyuan 030031, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 4795; https://doi.org/10.3390/app15094795

Submission received: 19 March 2025 / Revised: 22 April 2025 / Accepted: 23 April 2025 / Published: 25 April 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The increasing prevalence of IoT technology in smart homes has significantly enhanced convenience but also introduced new security and safety challenges. Traditional security solutions, reliant on sequences of IoT-generated event data (e.g., notifications of device status changes and sensor readings), are vulnerable to cyberattacks, such as message forgery and interception and delaying attacks, and fail to monitor non-smart devices. Moreover, fragmented smart home ecosystems require vendor cooperation or system modifications for comprehensive monitoring, limiting the practicality of the existing approaches. To address these issues, we propose IoTBystander, a non-intrusive dual-channel smart home security monitoring framework that utilizes two ubiquitous platform-agnostic signals, i.e., audio and network, to monitor user and device activities. We introduce a novel dual-channel aggregation mechanism that integrates insights from both channels and cross-verifies the integrity of monitoring results. This approach expands the monitoring scope to include non-smart devices and provides richer context for anomaly detection, failure diagnosis, and configuration debugging. Empirical evaluations on a real-world testbed with nine smart and eleven non-smart devices demonstrate the high accuracy of IoTBystander in event recognition: 92.86% for recognizing events of smart devices, 95.09% for non-smart devices, and 94.27% for all devices. A case study on five anomaly scenarios further shows significant improvements in anomaly detection performance by combining the strengths of both channels.

Keywords:

IoT security; multi-platform smart homes; non-intrusive monitoring

1. Introduction

With the continuous progress in Internet of Things (IoT) technologies and the growing consumer demand for higher living standards, the smart home sector is experiencing substantial growth. According to Grand View, the market is projected to expand at a compound annual growth rate (CAGR) of 27.07% between 2023 and 2030, with an estimated market value of USD 537.01 billion by 2030 [1]. The penetration of smart homes in households is expected to increase from 18.9% in 2024 to 33.2% in 2028 [2]. Smart homes are residences equipped with smart IoT devices that are interconnected through platforms designed to enhance comfort, convenience, energy efficiency, and more by integrating internet connectivity, automation, and remote control capabilities. Homeowners of smart homes can remotely monitor and control their devices through mobile applications and voice assistants, or configure automation rules to allow devices to function automatically based on predefined conditions (i.e., home automation). However, despite the many advantages, intricate interconnection, and interactions among IoT devices, platforms and applications also broaden the attack surface, making the smart home more susceptible to a variety of security and safety vulnerabilities [3,4,5,6,7].

Traditional security monitoring solutions [4,5,7,8,9,10,11] typically rely on event sequences generated by IoT devices, such as device status changes or sensor readings, to detect anomalies. These systems often analyze predefined event patterns to identify deviations that may indicate security threats, such as unauthorized access, device malfunctions, or cyberattacks. While effective in controlled environments, these approaches suffer from several critical limitations. One major drawback is their susceptibility to attacks that tamper with event data, such as message forging, interception, or delay, which can compromise the reliability of the system [12,13,14]. Furthermore, many traditional methods are inherently dependent on the correct reporting of events by the devices themselves. As a result, the integrity of security monitoring is directly tied to the accuracy of device-generated data, leaving the system vulnerable to failures, misconfigurations, and attacks that manipulate event notifications.

Additionally, these conventional solutions often struggle to account for non-smart devices (commonly referred to as “dumb devices”) within the smart home ecosystem. Since these devices do not generate event notifications in the same manner as IoT devices, they are largely ignored by traditional security mechanisms. This oversight limits the monitoring scope, leaving potential vulnerabilities undetected. Moreover, the growing trend of fragmented smart home ecosystems, where devices from multiple manufacturers are integrated into a variety of platforms, creates further challenges. Each platform typically supports only a subset of devices, and, without universal standards, it becomes increasingly difficult to achieve comprehensive security monitoring [15]. In many cases, such systems require vendor cooperation or even modifications to the devices or platforms, which is often unfeasible or impractical, especially in real-world scenarios. As a result, these existing solutions face significant scalability and interoperability challenges, further complicating their deployment and reducing their overall effectiveness in heterogeneous multi-platform environments.

To address the challenges posed by fragmented smart home ecosystems, we propose a platform-agnostic approach to achieve non-intrusive monitoring of physical activities. In this paper, we leverage two ubiquitous in situ physical channels, audio and network traffic, to monitor activities within smart homes. Each channel offers distinct advantages and limitations (see Section 7.2 for further details). To provide a more comprehensive and reliable view, we adopt a multi-modal perception strategy. Specifically, we introduce IoTBystander, a non-intrusive dual-channel security monitoring framework designed for standalone deployment. This design ensures that IoTBystander can be deployed across diverse smart home environments without the need for vendor cooperation or system modifications, making it a versatile and universally applicable solution. IoTBystander integrates effective event recognition methods from both audio and network traffic channels. Additionally, it features a novel dual-channel aggregation mechanism that cross-verifies event integrity and combines insights from both channels to improve monitoring performance. Unlike existing monitoring systems that rely solely on event messages from smart IoT devices, IoTBystander can also monitor the activities of dumb devices, thus expanding the scope of monitoring. This broader coverage provides a richer context for subsequent analyses, including configuration debugging, anomaly detection, and failure diagnosis.

To evaluate IoTBystander, we established a real-world testbed comprising nine smart devices and eleven dumb devices, conducting a series of comprehensive experiments. IoTBystander achieved an overall accuracy of 92.86% for recognizing events of smart devices, 95.09% for events of dumb devices, and 94.27% for all devices, demonstrating its effectiveness. A statistical reliability analysis confirms that IoTBystander produces consistent and reliable results across different scenarios. Additionally, computational efficiency tests show that IoTBystander performs well on both PCs and resource-constrained platforms like the Raspberry Pi. Furthermore, a case study involving five different scenarios illustrates the advantages of IoTBystander, highlighting its enhanced performance in real-world settings.

The contributions of this work are summarized as follows:

We identify three key limitations in the existing smart home security monitoring approaches. To address these, we propose IoTBystander, a more realistic and robust monitoring framework that utilizes two ubiquitous physical channels for smart home monitoring. Specifically, IoTBystander (1) ensures accurate activity detection even in the presence of failures, misoperations, or attacks; (2) monitors both smart and non-smart devices; and (3) operates without requiring modifications to IoT devices or platforms.
We present effective methods for dual-channel monitoring, taking into account the scarcity and heterogeneity of data in specific deployments. We extend the existing audio-based and traffic-based event recognition techniques, proposing an efficient pipeline to jointly monitor smart home activities. Additionally, we introduce a novel dual-channel aggregation mechanism that cross-verifies the integrity of IoT events and integrates insights from both audio and network traffic. This approach offers a more robust and universally applicable solution for IoT security monitoring, overcoming the limitations of traditional event-based systems.
A real-world testbed is built to evaluate IoTBystander. The results show that IoTBystander achieves an overall accuracy of 94.27% when monitoring a combination of smart and dumb devices. A statistical reliability analysis confirms consistent performance, while computational efficiency tests demonstrate effective operation on both PCs and resource-constrained platforms such as the Raspberry Pi. A case study with five scenarios further highlights the advantages of IoTBystander, showcasing its superior performance in real-world environments.

The remainder of this paper is organized as follows. The related works are discussed in Section 2. We introduce the background of smart homes in Section 3. Then, we present the design overview of IoTBystander in Section 4. Section 5 and Section 6 provide the technical details of three key modules of IoTBystander. After that, Section 7 evaluates the performance. The limitations and ethical considerations are discussed in Section 8. Finally, Section 9 concludes the paper.

2. Related Work

2.1. Security Monitoring in IoT

The increasing complexity and interconnectivity of Internet of Things (IoT) devices in smart homes have significantly broadened the scope of security challenges. Security monitoring, which involves the continuous observation and collection of system data, has become a critical focus in the field of IoT security. The goal of security monitoring is to produce reliable, comprehensive, and real-time data that can serve as input for subsequent decision-making tasks, such as anomaly detection, failure diagnosis, or intrusion detection.

Several studies have explored different monitoring approaches that focus on capturing device activities, environmental changes, or communication patterns within smart homes. These approaches lay the foundation for building robust security systems that can later interpret the data for decision-making tasks. For example, Zhu et al. made notable contributions to IoT security monitoring by analyzing traffic patterns, application security, device behaviors, and privacy aspects within smart home ecosystems [6,11,16,17]. Similarly, Zhang et al. identified vulnerabilities in IoT protocols, revealing security risks such as critical flaws in state transitions and weaknesses in protocols like MQTT [18,19]. These studies emphasize the importance of continuous and accurate monitoring to ensure that security systems can later perform decision-making tasks effectively.

Rule-based monitoring approaches focus on verifying the compliance of smart home operations against predefined security rules using deterministic models or formal verification techniques. These methods rely on systematic observation of events and activities to check whether they align with expected behaviors. Celik et al. introduced Soteria [20] and IoTGuard [21], which apply static and dynamic model checking techniques to verify smart home operations, ensuring that all activities adhere to predefined security properties. Zhang et al. modeled automation applications as Deterministic Finite Automatons (DFAs), comparing event sequences derived from IoT traffic to detect rule violations [11]. Similarly, Chi et al. proposed HomeGuard, a system that uses Satisfiability Modulo Theories (SMTs) to identify conflicts in automation rules during monitoring [22]. These rule-based systems focus on monitoring the flow of events and ensuring compliance with security standards, making them suitable for providing data input for higher-level decision-making approaches.

Learning-based monitoring approaches aim to monitor system behavior by learning from historical data. These approaches involve the use of probabilistic models or machine learning techniques to capture patterns in device activity, network traffic, or sensor readings. For instance, Kapitanova et al. developed the SMART system, which monitors user activities by training classifiers on sensor data to detect deviations from expected behavior [23]. Choi et al. proposed DICE, a framework for analyzing state transitions and contextual information to monitor system behavior [24]. Other research, such as the work by Hela et al. [25], uses learning techniques to monitor causal relationships between system events. These learning-based monitoring approaches rely on training models to recognize and track patterns in real time, which can then serve as input for decision-making tasks such as identifying vulnerabilities or triggering responses to potential threats.

Table 1 provides a summary of the aforementioned approaches. Despite the effectiveness of these approaches, they face common challenges. A key limitation of both rule-based and learning-based monitoring systems is their reliance on accurate event data. These systems are vulnerable to event-targeted attacks, such as data tampering, message interception, or event reordering, which can compromise the integrity of the monitoring data [14,18,19]. Moreover, due to the heterogeneous nature of smart home environments, where devices from different manufacturers are integrated into various platforms, achieving comprehensive monitoring often requires modifications to the underlying system architecture or access to proprietary APIs. These requirements can be impractical in real-world deployments, where access to certain system configurations or event logs may be restricted.

2.2. Side-Channel-Based Monitoring

Side-channel-based monitoring does not require modifications or cooperation from IoT stakeholders, making it a promising approach to mitigate the time-to-market challenges associated with deploying security solutions. Through a literature survey, we identified four categories of side channels that hold significant potential for smart home monitoring: encrypted network traffic, audio, vision, and system load. In addition, we surveyed existing hybrid approaches that aggregate multi-source or multi-modal data for monitoring.

Traffic-Based Monitoring. Traffic anomaly detection is a significant research direction as it encompasses monitoring, collection, evaluation, and interpretation of data transmitted through network communications. Its development is broadly categorized into two stages: rule-based methods [26,27] and machine- or deep-learning-based techniques [28]. For example, Snort initially extracted association rules for different types of anomalies and then conducted pattern matching between incoming traffic and this ruleset to identify anomalous behaviors [26]. Although rule-based traffic analysis methods provide simplicity and interpretability, they face limitations when encountering unknown or dynamically evolving traffic patterns. Consequently, network traffic analysis methods that utilize machine learning techniques have gained considerable attention and are widely applied in tasks such as traffic classification and anomaly detection, employing approaches such as decision trees, support vector machines (SVMs), and random forests [29]. Moreover, traffic analysis methods that leverage deep learning approaches, such as CNNs and RNNs, have demonstrated substantial improvements in the accuracy of traffic classification and anomaly detection in the context of large-scale and high-dimensional network traffic [30]. For example, Vinayakumar et al. proposed a CNN-based framework for anomaly detection, which also demonstrated that both a CNN and its variant architectures exhibit superior detection performance compared to classical machine learning classifiers on the KDD Cup dataset [31]. Ahsan et al. proposed a network traffic anomaly detection approach that integrates the spatial feature extraction capabilities of CNNs with the temporal sequence modeling strengths of LSTM, thus achieving improved detection performance compared to the previously discussed methods [32]. Chen et al. exploited side channel information leaks to infer sensitive and detailed user information by analyzing the network traffic of web applications secured by HTTPS and WPA/WPA2 Wi-Fi [33].

Audio-Based Monitoring. In light of the demonstrated effectiveness of CNN-based models in image classification, Hershey et al. conducted an exploration of their applicability in audio recognition. This investigation involved the evaluation of five architectures: deep neural networks (DNNs), AlexNet [34], VGG [35], Inception [36], and ResNet [37] on benchmark audio datasets, demonstrating promising results [38]. Subsequently, this research has promoted the widespread adoption of CNN architectures for event recognition utilizing audio channel information [39,40,41]. For instance, Cakir et al. proposed the RCNN model, which integrates a CNN to extract high-level features from local spectral information with an RNN to capture long-term temporal dynamics in audio [41]. Laput et al. introduced Ubicoustics, a plug-and-play model developed for environmental activity recognition, which is built based on the pre-trained Vggish model [42].

Vision-Based Monitoring. Migue et al. employed a combination of techniques, including background subtraction algorithms (to delineate subject outlines), Kalman filters (to mitigate noise and address data imprecision), and optical flow (to track stationary subjects within the scene), to process video data. The processed data are then input into a K-Nearest Neighbor (KNN) classification algorithm to identify the fall states of elderly people [43]. Kim et al. proposed to utilize PrimeSense 3D sensors to extract features of skeletal joints for training a hidden Markov model that aims to recognize the daily activities of elderly individuals in indoor environments [44]. Fu et al. proposed IoTSentry, which employs a Siamese deep neural network to extract high-level semantic features from streaming video data [45]. This approach facilitates the detection of variations in the appearance of IoT devices upon the occurrence of events, establishing a benchmark for verifying IoT events through a video channel.

Load-Based Monitoring. Gupta et al. proposed ElectriSense, which identifies and classifies electrical events associated with the operation of electronic devices utilizing switch-mode power supplies (SMPSs). This model leverages the electromagnetic interference (EMI) generated during SMPS operation, which exhibits a highly repeatable frequency domain signature, to effectively differentiate between various device types [46]. Mari et al. introduced a non-intrusive monitoring method to assess the power status of loads through sweep frequency response analysis (SFRA) utilizing support vector machines (SVMs) [47]. Ramadan et al. improved the accuracy of device identification in non-intrusive load monitoring by combining artificial neural networks with particle swarm optimization techniques [48].

Multi-Channel Approaches. Recent studies have proposed various techniques for fusing different data sources or modalities to enhance performance in anomaly detection and other security tasks in smart homes. For example, Li et al. introduced a human activity recognition framework that integrates multiple environmental sensor data, showcasing its applicability to context-aware monitoring in smart environments [49]. By combining data from multiple types of sensors (such as motion detectors, temperature sensors, and cameras), their approach improves the accuracy of recognizing human activities, which can be critical for applications such as intrusion detection, fall detection, and energy management. Similarly, Guarino et al. proposed a two-level fusion framework for cyber–physical anomaly detection where they fused sensor data with network traffic information to enhance the detection of anomalies in cyber–physical systems [50]. This multi-level fusion approach is designed to address the complexities of real-time detection by considering both the input of the physical sensor and the data flowing through the system network, improving the overall reliability and robustness of the anomaly detection process. Leal-Junior et al. explored the use of heterogeneous optical sensors in smart home environments, applying artificial intelligence techniques to combine data from optical sensors and machine learning models for remote monitoring and anomaly detection [51]. Their work emphasizes the potential of using diverse sensor modalities to capture a wide range of events, enhancing the ability to detect abnormal behaviors and improving the overall security posture of smart homes. In contrast to these approaches, which mainly employ data fusion, our work focuses on leveraging a decision fusion mechanism to create a highly scalable and flexible framework. While data fusion combines raw sensor data or signals to generate a more comprehensive input for further analysis, decision fusion aggregates the outputs of multiple independent monitoring channels to make more accurate and reliable security decisions. This approach allows our framework to handle a wide range of smart and non-smart devices, such as IoT appliances, traditional sensors, and non-digital devices, by cross-verifying the outputs from each channel in a coherent and structured way. By merging decisions from heterogeneous channels (e.g., audio and network traffic), we enhance the reliability and robustness of the monitoring system. In addition, this design facilitates the easy integration of new monitoring channels and techniques, ensuring that the system can adapt and scale as users incorporate more devices into their smart homes over time. For example, users can gradually add cameras, power meters, or environmental sensors to extend the monitoring coverage, thus enhancing the system’s overall performance without requiring significant changes to the underlying framework. This flexibility in upgrading the monitoring system over time makes our approach highly adaptable to the evolving needs of smart home environments.

Table 2 summarizes the reviewed studies. Each of these side channels offers unique advantages and challenges. Encrypted network traffic provides a balance between security and privacy, while audio and vision offer rich contextual information at the cost of potential privacy concerns. System load monitoring is less intrusive but cannot work for smart devices that do not cause significant power changes. Therefore, our core idea is to create a more comprehensive and robust monitoring system that takes advantage of the strengths while mitigating the limitations of multiple channels.

3. Background: Emerging Multi-Heterogeneous-Platform Smart Homes

The architecture of the heterogeneous multi-platform smart home system is illustrated in Figure 1. Within this system, various entities, including smart home IoT devices, hubs, platforms, and user terminals, interact with each other both in the physical environment and on the network to collectively fulfill the functions of the smart home.

IoT Devices and Hubs. Smart devices in smart homes are classified into sensors and actuators. Sensors utilize specific sensing technologies to measure environmental attributes (e.g., temperature) and report the corresponding values. Actuators, such as smart door locks, receive and execute control commands, subsequently reporting their updated status. The new sensor measurements and actuator status are encapsulated in event notification messages (referred to as events) and transmitted to the smart home platform. An IoT hub device serves as a bridge to facilitate Internet access for smart devices using non-IP protocols. In scenarios where multiple smart home platforms are deployed, multiple hubs are installed to interconnect devices from different manufacturers.

Platforms. Smart home platforms are generally categorized into cloud-based platforms and local platforms depending on their location of deployment. Cloud-based platforms are typically provided and maintained by smart device vendors or smart home system providers. Platforms offered by smart device vendors (i.e., vendor clouds) are often designed to be compatible only with devices produced by the same manufacturer. In contrast, the platforms provided by smart home system providers (i.e., integration platforms) are designed to promote greater interoperability, allowing the integration of devices from multiple brands and types. To achieve this, these platforms not only connect directly with devices or hubs but can also interface with platforms provided by device vendors. With user authorization, they can access devices connected to platforms from various manufacturers. Local platforms offer similar functionalities to cloud-based platforms, but they differ in that they are typically hosted on IoT devices, desktops or laptops, or local servers within the home network rather than in the cloud.

Mobile Device. A platform typically provides a mobile companion app that enables users to remotely monitor and control devices. When a user issues a remote control command, the app sends the command to the platform, which subsequently forwards it to the corresponding physical device. In addition to the companion app, smartphone operating system providers offer voice assistant software (e.g., Google Assistant, or Siri) that allows users to control devices through voice commands.

Environment. The environment of a smart home system refers to the collection of measurable environmental variables within the physical space that houses the smart devices. These environmental variables include factors such as temperature, humidity, light intensity, motion, sound, etc.

Interactions in Smart Homes. The interactions among entities in smart homes are illustrated in Figure 2. Sensors monitor and measure variables in the physical environment, while actuators influence the physical environment through their functions. When an actuator executes a command (for example, a smart light bulb executes the turn-on command), it can directly change the reading of its corresponding sensor (such as the status of the bulb changing from off to on) and indirectly affect related sensors by altering the environment. In the running example, the light bulb increases the ambient brightness and thus affects the readings of the brightness sensor.

Modern smart homes support multiple control methods, including remote, voice, physical, and automated control. Remote control is typically performed through a mobile companion app that enables the configuration and management of smart devices. Voice control allows users to issue commands via smart speakers (e.g., Amazon Echo Dot) or smartphone voice assistants (e.g., Google Assistant). These commands are interpreted by the platform and sent to the relevant devices for execution. Physical control involves direct interaction with devices, such as pressing buttons or using touchscreens. Automation in smart homes offers a personalized and intelligent approach to responsive control. Users configure automation rules by installing applications on the platform, with each application capable of defining one or more rules. These rules follow a “trigger–condition–action” programming paradigm.

4. Design Overview of IoTBystander

In this section, we provide an overview of the proposed framework, IoTBystander. The notations used throughout this paper are summarized in Table 3. As shown in Figure 3, IoTBystander follows a modular design, comprising three key modules: audio-based event recognition (AER), traffic-based event recognition (TER), and dual-channel aggregation (DCA).

The module AER identifies physical events related to both smart and dumb devices through the audio channel, which contains sounds made by these events. Audio signals are captured by microphones and transformed into spectrograms (i.e., image-style representations) via an Audio2Image component. The spectrograms are then taken as input by a Recognizer model to identify physical events. The recognition result is encoded as an event sequence

E S_{A E R}

.

Similarly, the TER module recognizes activities of the smart device (generating event notifications) from the traffic channel, where the network traffic generated during activities resides. Due to the generality of Wi-Fi in home area networks, this paper focuses on Wi-Fi traffic. We hook the access point in the home to obtain all the encrypted packets going between IoT devices/hubs and platforms. The captured packets are converted into Feature tuples for further processing. As network packets carrying the same type of IoT event notification have highly fixed patterns denoted as signatures, events are recognized from ongoing packets through a signature matching process and are denoted as an event sequence

E S_{T E R}

.

The DCA module takes

E S_{A E R}

and

E S_{T E R}

as input and outputs a final event sequence

E S_{a g g r}

after a line of three components: event alignment, event verification, and event fusion. Due to the distinct underlying mechanism of AER and TER,

E S_{A E R}

and

E S_{T E R}

denote the recognized physical events in different modalities (e.g., in terms of event types, time boundaries, etc.) and cannot be aggregated directly. Event alignment handles this issue and converts

E S_{A E R}

and

E S_{T E R}

into a unified format. After that, event verification handles the erroneous recognition results in

E S_{A E R}

and

E S_{T E R}

through cross verification. Finally, event fusion fuses the monitoring results from both channels and outputs

E S_{a g g r}

, which can be used by decision models (e.g., anomaly detection) to further detect security/security issues, e.g., anomalies, attacks, failure, etc.

In the following sections, we present the design details of the three key modules of IoTBystander: AER, TER, and DCA in Section 5.1, Section 5.2, and Section 6, respectively.

5. Event Recognition

In this section, we present the technical details of IoTBystander in recognizing events from audio and network traffic, respectively.

5.1. Audio-Based Event Recognition (AER)

We developed an AER model consisting of two modules, Audio2Image and Event Recognizer, to identify physical events in smart home environments. The detailed framework is illustrated in Figure 4.

The Audio2Image module converts audio signals into spectrogram representations. We begin by segmenting each WAV-format audio signal into 960 ms frames to capture local temporal features while minimizing the effects of dynamic variations. Each frame is sampled at 48 kHz with a bit depth of 32 bits, followed by the application of a 25 ms Hann window with a 10 ms stride. The Short-Time Fourier Transform (STFT) is then applied, resulting in a linear spectrogram of size

96 \times 257

. While this transformation captures frequency components across both high and low frequencies, the uniform resolution does not effectively represent low-frequency details, which are critical for audio recognition tasks. To address this, we use a 64-band Mel filter bank to divide the linear spectrogram into 64 frequency bands, which improves resolution for low frequencies and filters out high-frequency noise. This reduces dimensionality, producing a Mel spectrogram of size

96 \times 64

. Finally, we apply logarithmic smoothing to further compress energy fluctuations, enhancing feature extraction and mitigating audio noise interference, ultimately producing a log-Mel spectrogram with improved robustness for recognition.

The Event Recognizer module is designed to identify events from audio data, represented by log-Mel spectrograms. It consists of two general components: Feature Extractor and classifier, which can be implemented using various neural network models. In this work, we utilize the pre-trained Vggish model for feature extraction and a support vector machine with a Radial Basis Function (SVM-RBF) for classification. This structure was chosen because of its ability to achieve high recognition performance with a relatively small amount of labeled audio data, which is particularly beneficial in situations where time and expertise for extensive labeling are limited. The Vggish architecture includes convolution and pooling operations with kernels of size 3 × 3 and 2 × 2, respectively, and a stride of 1, as detailed in Table 4. After extracting features from the spectrogram, we apply principal component analysis (PCA) and whitening using pre-trained parameters (pca_matrix and pca_means), yielding a more efficient and sparse feature representation. This results in a 128-dimensional vector for each short-term frame. While both SVM-RBF and Multi-Layer Perceptron (MLP) models have demonstrated favorable results in previous studies [12], we selected SVM-RBF for the classification task in this work. Notably, the framework is highly flexible, and alternative models can be employed for both the feature extraction and classification components in different IoT applications or settings.

5.2. Traffic-Based Event Recognition (TER)

The network traffic patterns of event notification messages are generally fixed, which makes the offline signature extraction and online signature matching mechanism widely used for identifying IoT device events from payload-encrypted traffic [52,53,54,55,56,57]. However, approaches that rely on protocol-specific features to construct traffic signatures often lack generalizability across diverse IoT environments as IoT devices employ a wide range of communication protocols. In this context, packet-level signatures, represented as Directional Frame Length (DFL) sequences—i.e., sequences consisting of the packet direction and length—have proven to be effective [58]. In this work, we adapt this signature construction method for fingerprinting IoT device events.

Signature Extraction. Event signatures are learned in an offline manner. Each event is manually triggered by operating the device, while a traffic sniffer component captures the traffic traces associated with the device’s communication of the triggered event to the platform. By integrating the traffic sniffer atop the home router, we can capture wireless traffic from IoT devices after decrypting the 802.11 traffic, thus gaining access to packet-level details such as MAC addresses, IP addresses, timestamps, and packet lengths. In our implementation, we use TShark [59] to capture traffic packets and store the 5-tuple information (as defined in Equation (1)) in .tsv-formatted files for subsequent analysis. To extract event-specific traffic signatures, only packets within the communication session between the device and the platform are considered. To this end, MAC and IP addresses are employed to filter out noisy packets.

f e a t u r e_v e c = [P a c k e t_l e n g t h, M A C - s r c, M A C - d s t, I P - s r c, I P - d s t]

(1)

In our setup, IoT devices are connected to a Raspberry Pi, which functions as the home router, as illustrated in Figure 5. To ensure the accuracy and generalizability of the extracted signatures, each event is manually triggered 50 times. When an event is triggered, up to 30 related packets are recorded, capturing their 5-tuple information. To exclude irrelevant packets, we use the MAC and IP addresses, and we label the direction of each packet as C (from the device to the platform) or S (from the platform to the device). The DFL sequences, which are essential for identifying an event, are then extracted as the event’s signature. The extracted signatures are provided in Table 5.

We observe that the signature (denoted as

S_{1}

) extracted using the approach from [58], which only incorporates packets sent immediately when an event is triggered, has limitations in certain scenarios. For instance, the events switch-on and switch-off of a xiaomi smart plug generate identical patterns “S-223 C-207 S-54” in

S_{1}

, making it impossible to distinguish between them. To overcome this limitation, we enhance the signature construction by including additional packets that typically appear a few seconds after the immediate packets. These additional packets form a lagging DFL sequence, denoted as

S_{2}

. Consequently, when

S_{2}

is required, the original signature

S_{1}

is expanded to

S_{1} \dots S_{2}

. For example, additional sequences such as “C-223 S-143 C-54” and “C-111 S-111 C-54” are included in the extracted signature to differentiate between the switch-on and switch-off events of the xiaomi smart plug (see Table 5).

Signature Matching. The signatures of IoT events are learned offline and stored in a global database. During the initialization phase, IoTBystander loads these signatures from the global store with a one-time effort. Subsequently, it can recognize events in continuously captured traffic by matching the traffic traces against the stored signatures, similar to existing approaches [11,53,54,55,58]. It is important to note that, while the signature design utilized in this work is effective for smart home events, IoTBystander is designed with flexibility in mind, allowing for the substitution of more suitable signatures for different IoT applications.

6. Dual-Channel Aggregation

Due to the heterogeneity of the channels, recognized events in smart homes may vary in categories, coverage, and precision, often resulting in overlaps, as illustrated in Figure 6. To enable comprehensive and accurate monitoring, this section introduces a dual-channel aggregation mechanism that merges event sequences from audio and traffic channels, addressing the heterogeneity and improving the overall accuracy of event recognition. We provide a detailed explanation of the decision fusion mechanism employed by IoTBystander, which is essential for enhancing its scalability and flexibility. By aggregating outputs from multiple independent monitoring channels, IoTBystander leverages complementary insights from diverse data sources, improving the system’s reliability and robustness and ensuring more consistent and precise event detection.

6.1. Event Alignment

To aggregate the outputs of AER and TER modules, the event sequences from both channels,

E S_{A E R}

and

E S_{T E R}

, are re-aligned based on their respective event timestamps. However, two challenges must be addressed before a straightforward aggregation can be performed.

First, AER labels each audio spectrogram with an event class, while TER only marks single event notification messages from the surrounding traffic. As a result, the outputs of AER and TER represent events in different modalities. For example, when a smart kettle turns on and heats water, AER outputs successive kettle-heating events, with each denoting that “the kettle is heating water” within a time window of 0.96 s (i.e., the width of an audio spectrogram). In contrast, TER only identifies events when the smart kettle sends event notification information to notify the system of its state changes, i.e., the kettle-on and kettle-off events. In order to make the output of AER and TER inter-readable, we build an event naming space that incorporates comprehensive and inter-independent physical events and meanwhile use the onset and offset to mark the event, i.e.,

〈 t_{o n s e t}, t_{o f f s e t}, e v e n t - n a m e 〉

. Thus, events detected from both channels can be mapped to unified event denotation.

Second, the underlying physical properties of audio and network traffic lead to time discrepancies in the event boundaries detected by AER and TER with respect to the same physical event. Consider the smart kettle heating scenario. A smart kettle usually sounds dozens of seconds after being physically turned on but reports a kettle-on event notification message immediately. Similarly, the kettle also lags dozens of seconds after it reports a kettle-off event. Thus, thresholding the time discrepancy is required to determine whether the outputs of AER and TER refer to the same physical event. To this end, a classic confidence interval method [60] is adopted to determine the maximum allowable time discrepancy (abbr. MATD). Specifically, we collect a number of smart devices and learn the maximum allowable time discrepancies offline by physically operating these devices, capturing the corresponding digital events reported by them, and computing the time differences. For each event, we measure the above time difference

Δ t_{i}, i = 1, 2, \dots, 50

for 50 times and obtain the MATD with Equation (2):

M A T D = max_{i = 1, 2, \dots, 50} Δ t_{i} + 2.576 \times σ

(2)

where

σ

is the standard deviation of

Δ t_{i}

. When an event detected by AER lags less than

Δ t_{i}

behind that by TER, the two events are considered the same one; otherwise, they are considered as two different instances of the same event type.

By handling the above problems, the two event sequences output by AER and TER are semantically and temporally aligned, which could be further used for event verification and fusion.

6.2. Event Verification and Fusion

Note that TER cannot monitor dumb devices as they do not send event notification messages in the digital space. The absence of a certain event e in one channel cannot deny the presence of it in another one; on the contrary, merging the events recognized from both channels could yield a more comprehensive monitoring result if the recognition accuracy of each single channel is high enough. The two aligned output event sequences

E S_{A E R}

and

E S_{T E R}

are effectively aggregated into

E S_{r e s}

through a joint verification and fusion process of events.

We build two prior-knowledge sets (

k s_{A E R}

and

k s_{T E R}

) for both audio and traffic channels, respectively, through an offline learning process: for each channel, we collect all the recognizable events and the corresponding recognition precision for each event and obtain a prior-knowledge set

k s_{(\cdot)} = {(e_{(\cdot), 1}, p_{(\cdot), 1}), (e_{(\cdot), 2}, p_{(\cdot), 2}), \dots, (e_{(\cdot), m}, p_{(\cdot), m})}

, where

p_{(\cdot), 1}

denotes the precision for recognizing the event

e_{(\cdot), 1}

by

(\cdot)

(AER or TER). Then, the two event sequences are traversed and merged following the algorithm below.

If an event e instance appears in both $E S_{A E R}$ and $E S_{T E R}$ with overlapping, it is verified to be correctly recognized. Thus, we merge the two time intervals of e in $E S_{A E R}$ and $E S_{T E R}$ by taking the wider side of each interval for the merged result and mark the time interval as event e.
If e appears within a time interval in one sequence (say $E S_{A E R}$ ) but is not present in another sequence (say $E S_{T E R}$ ) within the same time interval, we first check whether e is in $k s_{T E R}$ . If so, a comparison $k s_{A E R} (e) > k s_{T E R} (e)$ is evaluated: if true, e is determined as “correctly recognized” and will be added to the merged result; otherwise, e is a false positive and will not be merged. If e is not in $k s_{T E R}$ , e will be added into $E S_{r e s}$ .

The above algorithm, on one hand, combines the perception view of both channels and, on the other, excludes the recognized events with low confidence. The resultant event sequence

E S_{r e s}

will be output for further analysis.

With the above dual-channel aggregation mechanism, IoTBystander is able to efficiently manage a wide variety of devices, including smart IoT appliances, traditional sensors, and non-smart devices. A key strength of our system is its ability to easily integrate additional monitoring channels as smart home ecosystems evolve. Users can seamlessly add new devices, such as cameras, power meters, or environmental sensors, without the need for significant changes to the underlying framework. This adaptability ensures that IoTBystander remains a highly scalable solution, capable of meeting the dynamic needs of the expanding smart home environments.

7. Experimental Results

In this section, we first present the setup and evaluation metrics of the experiment in Section 7.1. We test the effectiveness of each single channel: traffic-based (abbr. TER) and audio-based event recognition (abbr. AER), and discuss the results in Section 7.2.1 and Section 7.2.2, respectively. After that, we show the effectiveness of dual-channel monitoring in Section 7.2.3. Finally, we show the computational efficiency of IoTBystander in Section 7.3.

7.1. Testbed Setup and Evaluation Metrics

We establish a real-world testbed that includes nine types of smart devices, as shown in Table 6 and Figure 7, and eleven types of dumb devices: microwave, range hood, clock, mouse, keyboard, vacuum cleaner, faucet, toilet, washing machine, doorbell, and gas stove.

Sensor selection. To ensure the consistency and reliability of sensors, we chose standard off-the-shelf sensors that have been widely used in similar research contexts. Specifically, we used the Yue Changsheng USB Mini microphone and the built-in Raspberry Pi wireless network adapter to capture audio signals and wireless traffic, respectively. They were tested to be compatible with the Raspberry Pi, which serves as the main host device of IoTBystander. We manually tested all the sensors before using them and periodically observed the output during the run time to verify that the sensors were in good condition.

Data generation and collection. By manually operating these devices to trigger physical events that make sounds, we create a dataset HomeSound-13, which contains WAV-format audio recordings of 13 distinct event types (see Table 7); each WAV file is labeled with the name of the corresponding event type. This dataset is used to assess the effectiveness of the proposed AER. In addition, we manually trigger physical events by operating all the smart devices and simultaneously capture ambient traffic when these smart devices report the corresponding event notifications digitally. We record the timing for physically triggering the events and label the traffic packets whose timestamps are within the duration of these events. The collected traffic dataset HomeTraffic-14 is mainly used to learn the signatures of smart device events offline. To evaluate the performance of TER, we choose to physically trigger events and observe the output of TER in real time.

To assess the effectiveness of IoTBystander, we use the following metrics:

$P r e c i s i o n_{i} = \frac{T P_{i}}{T P_{i} + F P_{i}}$ , which indicates the proportion of samples predicted as $c l a s s_{i}$ that are indeed $c l a s s_{i}$ .
$R e c a l l_{i} = \frac{T P_{i}}{T P_{i} + F N_{i}}$ , which denotes the ratio of instances that are actually $c l a s s_{i}$ and are accurately predicted as $c l a s s_{i}$ .
${F 1 - s c o r e}_{i} = \frac{2 \times P r e c i s i o n_{i} \times R e c a l l_{i}}{P r e c i s i o n_{i} + R e c a l l_{i}}$ , representing the harmonic mean of $P r e c i s i o n_{i}$ and $R e c a l l_{i}$ , and serving as a comprehensive metric for evaluating the performance of the multi-class model.
$A c c u r a c y = \frac{\sum_{i} T P_{i} + \sum_{i} T N_{i}}{Total Samples}$ , which reflects the ratio of correctly predicted instances across all classes to the total number of samples.

7.2. Effectiveness of IoTBystander

The effectiveness of IoTBystander is evaluated based on the recognition performance (precision, recall, and F1-score) of traffic (i.e., TER), audio (i.e., AER), and dual channels (i.e., DCA). The results are collectively shown in Table 8.

7.2.1. Performance of Traffic-Based Monitoring

As shown in the TER columns in Table 8, the traffic-based event recognition (TER) component shows a satisfactory result. For most events, including “plug on/off (HP/XP)”, “camera on/off (XC)”, “gas-alarm-stop (XG)”, and “gas-detected (XG)”, TER achieved high precision, recall, and F1-scores of 1.00 across the board. This indicates that TER is very effective in recognizing these events due to the distinct traffic patterns that these devices produce while reporting these events. TER struggles with the “switch on/off (GP)” event, achieving only 0.50 in precision, recall, and F1-score. This lower performance suggests that TER had difficulty differentiating the on and off states of the Gosund smart plug due to indistinct traffic signatures. In practice, this challenge can be solved by a one-time human intervention: the plug always switches between on and off alternately so that a confirmation of one event can make the following inferable. Overall, the traffic-based monitoring achieves an accuracy of 92.86% on all smart device events.

Although effective, traffic-based monitoring has an obvious shortcoming: it cannot capture activities of non-smart devices or human actions as no network traffic is generated. This limitation can be addressed by incorporating additional channels, such as audio.

7.2.2. Performance of Audio-Based Monitoring

The audio-based event recognition (AER) component effectively recognizes most events produced by dumb devices that generate unique audio signals, as shown in the AER columns in Table 8. For example, it achieves very high scores for events like “clock-alarm”, “door-ring”, “kettle-heating”, and “range-hood-on”, with F1-scores around 0.95 to 1.00. This performance reflects the ability of the audio channel to accurately identify these events based on distinct acoustic signatures. However, AER’s performance slightly dips for events with more subtle or shorter sound patterns, such as “mouse-click”, whose precision, recall, and F1-score are 0.76, 0.97, and 0.85, respectively. Our analysis of the results indicated that the main cause of low precision and high recall could be that some short, abrupt sounds in the environment, such as collision or tapping sounds, were mistakenly identified as “mouse-click” due to their similarities in acoustic characteristics. In addition, AER has very high performance for smart device events like “gas-detected (XG)” and “gas-alarm-stop (XG)”, with all metrics 0.97. The overall accuracy of audio-based monitoring on audible events is 96%. The audio channel cannot detect inaudible events such as “plug on/off (for HP/XP)”, “switch on/off (GP)”, and “camera on/off (XC)”, which is the Achilles heel of audio-based monitoring.

7.2.3. Performance of Dual-Channel Monitoring

The dual-channel monitoring approach, which combines TER and AER through the dual-channel aggregation component, demonstrates consistently high performance across all types of events, including those with which TER or AER alone struggle (as shown in the DCA columns in Table 8). For events generated by dumb devices, the dual-channel approach maintains high scores similar to those of AER alone, benefiting from AER’s strength in recognizing sounds while compensating for the scenarios where TER’s traffic analysis alone was insufficient. Taking advantage of both channels, IoTBystander consistently delivers superior recognition performance for events from both smart and dumb devices, obtaining a broader scope of monitoring and providing context-richer input for security-oriented decision-making. Higher precision and recall demonstrate that IoTBystander reduces false positives or missed detections.

We conducted a comparative experiment to evaluate IoTBystander against existing methods. Since prior methods do not utilize both audio and traffic channels for event recognition, we compared IoTBystander with two representative approaches: IoTAudMon [12], which relies solely on audio, and PingPong [58], which relies on network traffic. The comparative results, shown in Figure 8, highlight several key findings. IoTBystander achieves comparable performance to PingPong in monitoring smart devices, with precision, recall, and F1-scores of 0.9375, 0.9375, and 0.9375, respectively. Similarly, it performs on par with IoTAudMon in monitoring dumb devices, with precision of 0.955, recall of 0.9583, and F1-score of 0.965. However, IoTBystander consistently outperforms both methods in scenarios involving both smart and dumb devices. Specifically, it achieves a significant improvement in precision (0.948 vs. 0.6215 vs. 0.375), recall (0.9499 vs. 0.6234 vs. 0.375), and F1-score (0.954 vs. 0.6275 vs. 0.375) when monitoring a mixed environment. These results clearly demonstrate the advantages of combining both audio and traffic channels. By integrating these two sources of data, IoTBystander not only enhances overall event detection accuracy but also shows greater robustness and adaptability in multi-device environments, where traditional single-channel methods struggle.

7.2.4. Statistical Reliability Analysis

To evaluate the statistical reliability of the results in Table 8, we calculate the standard deviation (SD) and 95% confidence intervals (CIs) for the performance metrics (precision, recall, and F1-score) of each method (TER, AER, and DCA) across different device types (smart devices, dumb devices, and all devices). Table 9 shows the results.

The standard deviation reflects the variability in monitoring performance across different events. For smart devices, TER and DCA exhibit a higher standard deviation (0.1816), indicating more variability, while methods applied to dumb devices, such as AER (0.0371), show much smaller standard deviations, highlighting greater consistency for non-smart devices. The 95% confidence intervals for smart devices (TER, AER, and DCA) range from [0.8239, 1.0333], indicating stable performance. In contrast, for dumb devices, AER and DCA have narrower intervals ([0.9115, 0.9919] and [0.9257, 0.9727], respectively), suggesting highly consistent results. These findings demonstrate that AER provides more stable performance for non-smart devices compared to TER, which applies primarily to smart devices.

The results of the overall device category, which combines smart and dumb devices, demonstrate a wider range of variability. For example, TER applied to all devices has a confidence interval for precision between [0.3022, 0.6978], indicating greater variability in results across a mixed set of devices. This is further reflected in the larger standard deviations (e.g., TER: 0.4899). However, when comparing across methods, DCA provides the most stable results, with smaller confidence intervals (e.g., [0.883, 1.016] for F1-score) and lower standard deviations, particularly in the F1-score metric, which is critical for evaluating overall performance in real-world scenarios. This validates the superiority of our dual-channel approach.

7.2.5. Generalizability on Standard Datasets

To further assess the generalizability of IoTBystander, we test IoTBystander on additional datasets. Given the absence of datasets that contain both types of data for smart home scenarios, we selected the ESC-16 audio dataset and the PingPong dataset for network traffic. The ESC-16 dataset is a subset of the ESC-50 dataset [61], containing 5-second audio recordings of 16 relevant environmental sound classes, which were chosen to fit smart home applications. The PingPong dataset [58], on the other hand, comprises Wi-Fi traffic traces from thirteen different events involving seven IoT devices. Both datasets were divided into training, validation, and test sets in a 70%, 15%, 15% split, respectively.

The results from these experiments, presented in Table 10, highlight the robustness of IoTBystander across varied datasets. On the ESC-16 dataset, the system achieved an accuracy of 93.42% with a 5.18% false positive rate (FPR), demonstrating its effective recognition of environmental sounds relevant to smart home applications. In contrast, the PingPong dataset, which deals with IoT traffic, resulted in a higher accuracy of 97.71% with no false positives, indicating the model’s precision in identifying IoT events based on network traffic signatures. These results highlight that IoTBystander ’s dual-channel approach remains effective even when applied to external datasets, further demonstrating its generalizability and confirming its capability to adapt to various smart home environments without overfitting to the original testbed.

7.3. Efficiency of IoTBystander

We note that the computation overhead of the DCA module is negligible compared to AER and TER, which involve complex operations such as data processing and neural network forward propagation. Therefore, to evaluate the efficiency of IoTBystander, we test the computational overhead of AER and TER in terms of execution time. Due to the different underlying recognition methods, we compute the average execution times of AER and TER in different manners. We tested the average time for AER to handle each audio spectrogram from end to end and for TER to handle each network packet. Moreover, we perform the aforementioned efficiency evaluation on two different computing platforms: a laptop (Intel i7-13700H 2.4GHz CPU and 8GB memory) and a Raspberry Pi Model 4B (Aarch64 Cortex-A72 1.8GHz CPU and 4GB memory). The results shown in Table 11 demonstrate that IoTBystander is very efficient. On the laptop, the AER module identifies the event types for each 0.96-second audio frame within 0.0152 s, while the TER module recognizes the event types for each traffic packet in 0.04 s. On the Raspberry Pi, these processing times increase to 0.0289 s for AER and 0.08 s for TER, respectively. That is to say, it is practical to deploy IoTBystander on common home devices such as PCs, tablets, gateway devices, etc.

7.4. Case Study: Impact on Anomaly Detection

To demonstrate the effectiveness of IoTBystander in improving security monitoring, we conduct a case study using our testbed, outlined in Table 12. We simulate five different anomaly scenarios, such as device component malfunction, cyberattacks, and the lack of monitoring on non-smart devices. The results show that, without IoTBystander, most situations where event integrity is compromised remain unresolved. In contrast, IoTBystander effectively handles all scenarios by leveraging dual-channel fusion, cross-verifying event integrity across both audio and network traffic.

For example, in the case of the smart faucet (scenarios four and five), all methods detect that the faucet is left on for an extended period. However, IoTBystander provides a more context-aware decision by factoring in the presence of non-smart devices. If the homeowner is cooking nearby, the “faucet-on” event is deemed normal, whereas, if the homeowner is in another room working, it may indicate a potential water leak risk. This highlights how IoTBystander enhances the monitoring by incorporating both smart and non-smart devices, offering a more comprehensive view.

Although anomaly detection is used here as a representative task, IoTBystander is designed to serve as a versatile foundation for a variety of security applications, including attack detection, failure diagnosis, configuration debugging, and data provenance. By integrating insights from both audio and network traffic, IoTBystander improves the reliability of security monitoring, ensuring that event data are consistently verified and contextually informed for a broad range of security tasks.

8. Limitations and Ethical Considerations

Although IoTBystander shows significant improvements in smart home security monitoring, it is important to consider the ethical and practical implications of deploying such a system. This section highlights several key threats to validity and ethical concerns that arise due to the non-intrusive nature of the dual-channel monitoring system and discusses actionable solutions to mitigate these risks.

8.1. Precise Data Collection and Labeling

A limitation of IoTBystander is that its performance on event recognition depends on the quality of labeled data, which currently requires manual supervision. Inaccurate or incomplete labeling can negatively affect system performance, leading to misclassifications. To address this, we emphasize the importance of automating the labeling process and reducing the need for large training datasets through knowledge transfer.

In real-world smart home deployments, many IoT platforms provide APIs that stream ongoing IoT events and maintain system logs of historical events. These can be leveraged to automate the labeling of audio and network data. Events reported by IoT devices to platforms can be used to label the corresponding audio and network packets collected in real time. This approach can significantly streamline data labeling and improve accuracy.

For cases where such APIs are unavailable, or for non-smart devices, we propose using knowledge transfer techniques. Extracting traffic signatures for specific IoT events can be a one-time effort. Signatures for the same event type remain consistent across different deployments, making it feasible to create a global traffic signature database for various device models and event types offline. Similarly, audio recognition can be effectively improved through transfer learning, which reduces the manual effort to label audio data. By utilizing a pre-train + fine-tune strategy, we observe that users only need to manually label approximately two minutes of audio data per event type. This approach allows users to perform the initial labeling with minimal effort in a user-friendly manner.

8.2. Privacy Concerns

The use of IoT systems for monitoring smart homes inherently raises concerns about privacy. IoTBystander captures audio and network traffic to monitor events that occur within a smart home environment. Although these data are crucial for detecting anomalies and ensuring security, they can inadvertently record sensitive information about household members and their activities. This could include conversations, personal habits, or private interactions, leading to potential surveillance issues.

To address these concerns, we propose a series of mitigation strategies in future work. Data minimization will be prioritized to ensure that only the data required for anomaly detection are captured. By employing advanced filtering and preprocessing techniques, the system could be designed to focus on identifying relevant event-related signals while minimizing the risk of capturing sensitive content. Furthermore, encryption will be applied to all captured data, ensuring that data transmitted from the system are protected and inaccessible to unauthorized third parties. Furthermore, techniques such as local processing of sensitive data could be explored to ensure that personal information remains within the user’s local devices and is not exported to external servers unless explicitly permitted.

8.3. Consent and Transparency

A major ethical challenge in the deployment of IoTBystander is ensuring that homeowners and household members are fully informed of the system’s capabilities and the type of data it collects. There is a risk that users may not be fully aware that their behavior, conversations, and movements could be monitored in their own homes. This lack of informed consent could lead to ethical issues, especially if the system is deployed by third parties such as landlords or employers rather than the residents themselves.

To mitigate these issues, we propose the implementation of clear transparency protocols and consent management mechanisms. This includes providing users with an easy-to-understand summary of how the system works, what data it collects, and how it will be used. Users should be able to opt in or out of data collection at any time, and the system should offer clear options for deactivating or limiting data collection. Furthermore, clear data usage agreements should be established, outlining the ethical boundaries of data collection and usage. Such policies will ensure that IoTBystander is used only for its intended purposes, thus protecting user privacy and autonomy.

8.4. Bias in Detection

As with any machine-learning-based system, there is the potential for bias in detection if the training data are not representative of the full diversity of devices and activities in real-world smart home environments. If the model is trained predominantly on data from a specific type of device or user behavior, it may perform poorly for underrepresented devices or activities. For example, events from less common devices and events that occur sporadically may be misclassified.

To minimize the risk of biased detection, we will ensure that the training datasets used to train IoTBystander are as diverse and representative as possible. This includes incorporating data from a wide range of smart and non-smart devices, as well as ensuring that both typical and rare user behaviors are represented. In addition, the system can be periodically evaluated and updated with new data to capture emerging trends and scenarios. We also suggest integrating continuous learning mechanisms, where the system can learn from user feedback on detection accuracy and make adjustments over time to improve performance across all events.

8.5. Impact on Trust in IoT Systems

The implementation of IoTBystander could have a significant impact on user trust in IoT systems. If users perceive that their behaviors are constantly being monitored, even for security purposes, it could create feelings of unease or distrust toward IoT technologies. This is especially true if users are not fully aware of how their data are being used or if they fear that the system could be misused, for example, by landlords, employers, or other third parties. To address these concerns and maintain trust in IoT systems, we emphasize the responsible and ethical use of IoTBystander. The system will be designed with strong privacy protections and transparency features, as described above. Moreover, clear user agreements and opt-out mechanisms will empower users with control over what data are collected and how they are used. Furthermore, the system will focus solely on security-related events and avoid using the data for other non-security purposes, such as profiling or monitoring daily habits. These steps will help to preserve user autonomy and trust, ensuring that the system is perceived as a tool for enhancing security rather than an intrusive surveillance mechanism.

8.6. Accountability for Errors

One of the most critical ethical considerations in security systems like IoTBystander is the issue of accountability for errors. In scenarios where the system fails to detect a security threat or mistakenly classifies a harmless event as an anomaly, the consequences could be significant, potentially leading to property damage, false alarms, or missed security breaches. This is particularly concerning for high-stakes events, such as gas leaks or intrusions, where a false positive or negative could result in harm or significant loss.

To address accountability, a redundancy mechanism is incorporated (see Section 6.2), where data from both audio and network traffic channels are cross-verified, ensuring that errors in one channel are less likely to lead to incorrect conclusions. In addition, multi-faceted approaches are considered to be effective in further mitigating this issue. First, we will ensure that the system undergoes extensive validation and testing using diverse real-world data to assess its accuracy and reliability across a wide range of events. Moreover, designing mechanisms for users to report false positives or false negatives, allowing continuous learning and improvements based on real-world feedback, is an actionable strategy. Finally, the deployment of IoTBystander in high-risk scenarios (e.g., gas leak detection) will be coupled with manual oversight or additional safeguards to ensure that critical decisions are not solely based on the automated system.

9. Conclusions

In this research, we have conducted an exploration of the potential of leveraging two side channels: audio and network traffic, to develop a non-intrusive monitoring framework, IoTBystander, to improve the comprehensiveness and resilience of smart home security monitoring. Our experimental evaluations in a real-world context have demonstrated that IoTBystander is capable of achieving satisfactory performance in the detection and recognition of smart home events without the need to modify the existing infrastructure. This approach underscores the framework’s ability to provide a versatile and robust solution for smart home monitoring, aligning with the exigencies of modern security demands.

However, like any system, IoTBystander has limitations. One of the key challenges is the reliance on labeled data for training, which necessitates manual supervision. The quality of the labeled data directly impacts the system performance, and incomplete or inaccurate labeling can result in misclassifications. Moreover, this dependency on labeled datasets presents a significant barrier to real-world deployment, where such datasets may not always be readily available. To mitigate this, future research could explore automating the labeling process or employing semi-supervised learning techniques to reduce manual effort and improve the scalability of the approach.

While IoTBystander offers significant improvements in monitoring, it is essential to acknowledge that each channel has inherent strengths and weaknesses. For example, the audio channel may miss events that do not produce audible sounds, while network traffic analysis cannot capture events from non-smart devices. Therefore, expanding the system to integrate additional side channels is critical to achieving a more holistic view of smart home activities. Our future work will focus on identifying viable side channels and designing effective collaborative monitoring strategies to further enhance the robustness and comprehensiveness of the IoTBystander framework.

Author Contributions

Conceptualization, H.C.; Formal analysis, H.C.; Funding acquisition, H.C., Y.W. and J.Y.; Investigation, H.C., J.Y. and H.G.; Methodology, H.C. and Q.M.; Project administration, H.C., Y.W. and H.G.; Software, Q.M.; Supervision, H.G.; Validation, H.C., Y.W. and J.Y.; Visualization, Q.M.; Writing—original draft, H.C. and Q.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China (NSFC) under Grant No. 62302282 and 62472267, the Shanxi Province Science Foundation under Grant No. 202203021222005, 202203021222010 and 202403021212171, and Research Project Supported by Shanxi Scholarship Council of China under Grant No. 2024-018 and 2024-019.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the risks of sensitive information leakage.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Smart Home Market Size, Share & Trends Analysis Report. Available online: https://www.grandviewresearch.com/industry-analysis/smart-homes-industry (accessed on 20 April 2025).
Smart Home-Worldwide. Available online: https://www.statista.com/outlook/cmo/smart-home/worldwide (accessed on 20 April 2025).
Alam, M.M.; Rahman, A.M.; Wang, W. IoTHaven: An Online Defense System to Mitigate Remote Injection Attacks in Trigger-action IoT Platforms. In Proceedings of the IEEE 30th International Symposium on Local and Metropolitan Area Networks (LANMAN), Boston, MA, USA, 10–11 July 2024; pp. 15–20. [Google Scholar]
Xiao, J.; Xu, Z.; Zou, Q.; Li, Q.; Zhao, D.; Fang, D.; Li, R.; Tang, W.; Li, K.; Zuo, X.; et al. Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Barcelona, Spain, 25–29 August 2024; pp. 3551–3562. [Google Scholar]
Li, R.; Li, Q.; Huang, Y.; Zou, Q.; Zhao, D.; Zhang, Z.; Jiang, Y.; Zhu, F.; Vasilakos, A.V. SeIoT: Detecting Anomalous Semantics in Smart Homes via Knowledge Graph. IEEE Trans. Inf. Forensics Secur. 2024, 19, 7005–7018. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Y.; Zhu, H.; Du, S. Toward automatically generating privacy policy for smart home apps. In Proceedings of the IEEE Conference on Computer Communications Workshops, Vancouver, BC, Canada, 10–13 May 2021; pp. 1–7. [Google Scholar]
Fu, C.; Zeng, Q.; Du, X. HAWatcher: Semantics-Aware anomaly detection for appified smart homes. In Proceedings of the 30th USENIX Security Symposium (USENIX Security), Vancouver, BC, Canada, 11–13 August 2021; pp. 4223–4240. [Google Scholar]
Yu, Y.; Xu, Y.; Huang, K.; Liu, J. TAPFixer: Automatic Detection and Repair of Home Automation Vulnerabilities based on Negated-property Reasoning. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security), Philadelphia, PA, USA, 14–16 August 2024; pp. 4945–4962. [Google Scholar]
Xing, Y.; Hu, L.; Du, X.; Shen, Z.; Hu, J.; Wang, F. CCDF-TAP: A Context-Aware Conflict Detection Framework for IoT Trigger-Action Programming With Graph Neural Network. IEEE Internet Things J. 2024, 11, 31534–31544. [Google Scholar] [CrossRef]
Merlino, V.; Allegra, D. Energy-based approach for attack detection in IoT devices: A survey. Internet of Things 2024, 27, 101306. [Google Scholar] [CrossRef]
Zhang, W.; Meng, Y.; Liu, Y.; Zhang, X.; Zhang, Y.; Zhu, H. Homonit: Monitoring smart home apps from encrypted traffic. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), Toronto, ON, Canada, 15–19 October 2018; pp. 1074–1088. [Google Scholar]
Chi, H.; Ma, Q.; Zhang, Y.; Fu, C.; Wang, Y.; Geng, H.; Du, X. Audio-Assisted Smart Home Security Monitoring with Few Samples. In Proceedings of the GLOBECOM 2024-2024 IEEE Global Communications Conference, Cape Town, South Africa, 8–12 December 2024; pp. 2413–2418. [Google Scholar]
Fu, C.; Zeng, Q.; Chi, H.; Du, X.; Valluru, S.L. IoT Phantom-delay attacks: Demystifying and Exploiting IoT Timeout Behaviors. In Proceedings of the 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Baltimore, MD, USA, 27–30 June 2022. [Google Scholar]
Chi, H.; Fu, C.; Zeng, Q.; Du, X. Delay wreaks havoc on your smart home: Delay-based automation interference attacks. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 22–26 May 2022; pp. 285–302. [Google Scholar]
Chi, H.; Zeng, Q.; Du, X. Detecting and Handling IoT Interaction Threats in Multi-Platform Multi-Control-Channel Smart Homes. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 1559–1576. [Google Scholar]
Zhang, L.; Meng, Y.; Yu, J.; Xiang, C.; Falk, B.; Zhu, H. Voiceprint mimicry attack towards speaker verification system in smart home. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM), Toronto, ON, Canada, 6–9 July 2020; pp. 377–386. [Google Scholar]
Li, J.; Meng, Y.; Zhou, L.; Zhu, H. Securing app behaviors in smart home: A human-app interaction perspective. In Proceedings of the IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS), Hong Kong, 2–4 December 2020; pp. 308–315. [Google Scholar]
Zhou, W.; Jia, Y.; Yao, Y.; Zhu, L.; Guan, L.; Mao, Y.; Liu, P.; Zhang, Y. Discovering and understanding the security hazards in the interactions between IoT devices, mobile apps, and clouds on smart home platforms. In Proceedings of the 28th USENIX Security Symposium (USENIX Security 19), Santa Clara, CA, USA, 14–16 August 2019; pp. 1133–1150. [Google Scholar]
Jia, Y.; Xing, L.; Mao, Y.; Zhao, D.; Wang, X.; Zhao, S.; Zhang, Y. Burglars’ iot paradise: Understanding and mitigating security risks of general messaging protocols on iot clouds. In Proceedings of the 2020 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 18–21 May 2020; pp. 465–481. [Google Scholar]
Celik, Z.B.; McDaniel, P.; Tan, G. Soteria: Automated IoT safety and security analysis. In Proceedings of the USENIX Annual Technical Conference (ATC), Boston, MA, USA, 11–13 July 2018; pp. 147–158. [Google Scholar]
Celik, Z.B.; Tan, G.; McDaniel, P.D. IoTGuard: Dynamic enforcement of security and safety policy in commodity IoT. In Proceedings of the Network and Distributed Systems Security (NDSS) Symposium, San Diego, CA, USA, 24–27 February 2019. [Google Scholar]
Chi, H.; Zeng, Q.; Du, X.; Yu, J. Cross-app interference threats in smart homes: Categorization, detection and handling. In Proceedings of the 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Valencia, Spain, 29 June–2 July 2020; pp. 411–423. [Google Scholar]
Kapitanova, K.; Hoque, E.; Stankovic, J.A.; Whitehouse, K.; Son, S.H. Being smart about failures: Assessing repairs in smart homes. In Proceedings of the ACM Conference on Ubiquitous Computing (UbiComp), Pittsburgh, PA, USA, 5–8 September 2012; pp. 51–60. [Google Scholar]
Choi, J.; Jeoung, H.; Kim, J.; Ko, Y.; Jung, W.; Kim, H.; Kim, J. Detecting and identifying faulty IoT devices in smart home with context extraction. In Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Luxembourg, 25–28 June 2018; pp. 610–621. [Google Scholar]
Hela, S.; Amel, B.; Badran, R. Early anomaly detection in smart home: A causal association rule-based approach. Artif. Intell. Med. 2018, 91, 57–71. [Google Scholar] [CrossRef] [PubMed]
Roesch, M. Snort: Lightweight intrusion detection for networks. In Proceedings of the Lisa, Seattle, WC, USA, 7–12 November 1999. [Google Scholar]
Suricata: An Open Source Threat Detection Engine. Available online: https://suricata.io/ (accessed on 20 April 2025).
Jia, W.; Shukla, R.M.; Sengupta, S. Anomaly detection using supervised learning and multiple statistical methods. In Proceedings of the 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; pp. 1291–1297. [Google Scholar]
Wang, S.; Balarezo, J.F.; Kandeepan, S.; Al-Hourani, A.; Chavez, K.G.; Rubinstein, B. Machine learning in network anomaly detection: A survey. IEEE Access 2021, 9, 152379–152396. [Google Scholar] [CrossRef]
Chalapathy, R.; Chawla, S. Deep learning for anomaly detection: A survey. arXiv 2019, arXiv:1901.03407. [Google Scholar]
Nguyen, T.T.; Armitage, G. A survey of techniques for internet traffic classification using machine learning. IEEE Commun. Surv. Tutorials 2008, 10, 56–76. [Google Scholar] [CrossRef]
Ahsan, M.; Nygard, K.E. Convolutional Neural Networks with LSTM for Intrusion Detection. In Proceedings of the 35th International Conference on Computers and Their Applications (CATA), San Francisco, CA, USA, 23–25 March 2020; Volume 69, pp. 69–79. [Google Scholar]
Chen, S.; Wang, R.; Wang, X.; Zhang, K. Side-channel leaks in web applications: A reality today, a challenge tomorrow. In Proceedings of the IEEE Symposium on Security and Privacy (SP), Oakland, CA, USA, 16–19 May 2010; pp. 191–206. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Hershey, S.; Chaudhuri, S.; Ellis, D.P.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
Elizalde, B.; Badlani, R.; Shah, A.; Kumar, A.; Raj, B. NELS–Never-Ending Learner of Sounds. arXiv 2018, arXiv:1801.05544. [Google Scholar]
McLoughlin, I.; Zhang, H.; Xie, Z.; Song, Y.; Xiao, W. Robust sound event classification using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 540–552. [Google Scholar] [CrossRef]
Cakir, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1291–1303. [Google Scholar] [CrossRef]
Laput, G.; Ahuja, K.; Goel, M.; Harrison, C. Ubicoustics: Plug-and-play acoustic activity recognition. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST), Berlin, Germany, 14 October 2018; pp. 213–224. [Google Scholar]
De Miguel, K.; Brunete, A.; Hernando, M.; Gambao, E. Home camera-based fall detection system for the elderly. Sensors 2017, 17, 2864. [Google Scholar] [CrossRef] [PubMed]
Kim, K.; Jalal, A.; Mahmood, M. Vision-based human activity recognition system using depth silhouettes: A smart home system for monitoring the residents. J. Electr. Eng. Technol. 2019, 14, 2567–2573. [Google Scholar] [CrossRef]
Fu, C.; Du, X.; Zeng, Q.; Zhao, Z.; Zuo, F.; Di, J. Seeing Is Believing: Extracting Semantic Information from Video for Verifying IoT Events. In Proceedings of the 17th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec), Seoul, Republic of Korea, 27–29 May 2024; pp. 101–112. [Google Scholar]
Gupta, S.; Reynolds, M.S.; Patel, S.N. ElectriSense: Single-point sensing using EMI for electrical event detection and classification in the home. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing (UbiComp), Copenhagen, Denmark, 26–29 September 2010; pp. 139–148. [Google Scholar]
Mari, S.; Bucci, G.; Ciancetta, F.; Fiorucci, E.; Fioravanti, A. A New NILM System Based on the SFRA Technique and Machine Learning. Sensors 2023, 23, 5226. [Google Scholar] [CrossRef]
Ramadan, R.; Huang, Q.; Zalhaf, A.S.; Bamisile, O.; Li, J.; Mansour, D.E.A.; Lin, X.; Yehia, D.M. Energy Management in Residential Microgrid Based on Non-Intrusive Load Monitoring and Internet of Things. Smart Cities 2024, 7, 1907–1935. [Google Scholar] [CrossRef]
Li, G.; Yang, Z.; Su, S.; Li, Y.; Wang, Y. Human activity recognition based on multienvironment sensor data. Inf. Fusion 2023, 77, 58–72. [Google Scholar] [CrossRef]
Guarino, F.; Vitale, F.; Flammini, F.; Faramondi, L.; Mazzocca, N.; Setola, R. A Two-Level Fusion Framework for Cyber-Physical Anomaly Detection. IEEE Trans. Ind.-Cyber-Phys. Syst. 2023, 7, 456–467. [Google Scholar] [CrossRef]
Leal-Junior, L.; Avellar, W.; Blanc, A.; Frizera, A.; Marques, C. Opto-Electronic Smart Home: Heterogeneous Optical Sensors Approaches and Artificial Intelligence for Novel Paradigms in Remote Monitoring. IEEE Internet Things J. 2024, 11, 342–355. [Google Scholar] [CrossRef]
Copos, B.; Levitt, K.; Bishop, M.; Rowe, J. Is anybody home? Inferring activity from smart home network traffic. In Proceedings of the IEEE Security and Privacy Workshops (SPW), San Jose, CA, USA, 22–26 May 2016; pp. 245–251. [Google Scholar]
Acar, A.; Fereidooni, H.; Abera, T.; Sikder, A.K.; Miettinen, M.; Aksu, H.; Conti, M.; Sadeghi, A.R.; Uluagac, S. Peek-a-boo: I see your smart home activities, even encrypted! In Proceedings of the 13th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec), Linz, Austria, 8–10 July 2020; pp. 207–218. [Google Scholar]
Apthorpe, N.; Reisman, D.; Feamster, N. Closing the blinds: Four strategies for protecting smart home privacy from network observers. arXiv 2017, arXiv:1705.06809. [Google Scholar]
Apthorpe, N.; Huang, D.Y.; Reisman, D.; Narayanan, A.; Feamster, N. Keeping the smart home private with smart (er) iot traffic shaping. arXiv 2018, arXiv:1812.00955. [Google Scholar] [CrossRef]
Sivanathan, A.; Gharakheili, H.H.; Loi, F.; Radford, A.; Wijenayake, C.; Vishwanath, A.; Sivaraman, V. Classifying IoT devices in smart environments using network traffic characteristics. IEEE Trans. Mob. Comput. 2018, 18, 1745–1759. [Google Scholar] [CrossRef]
Sivanathan, A.; Sherratt, D.; Gharakheili, H.H.; Radford, A.; Wijenayake, C.; Vishwanath, A.; Sivaraman, V. Characterizing and classifying IoT traffic in smart cities and campuses. In Proceedings of the IEEE Conference on Computer Communications Workshops, Atlanta, GA, USA, 1–4 May 2017; pp. 559–564. [Google Scholar]
Trimananda, R.; Varmarken, J.; Markopoulou, A.; Demsky, B. Packet-level signatures for smart home devices. In Proceedings of the Network and Distributed Systems Security (NDSS) Symposium, San Diego, CA, USA, 23–26 February 2020. [Google Scholar]
Tshark Reference Documentation. 2025. Available online: https://www.wireshark.org/docs/man-pages/tshark.html (accessed on 20 April 2025).
Newbold, P.; Carroll, W.L.; Thorne, B. Statistics for Business and Economics, 9th ed.; Pearson: London, UK, 2013. [Google Scholar]
ESC-50 Dataset. 2024. Available online: https://gitcode.com/karolpiczak/ESC-50 (accessed on 20 April 2025).

Figure 1. Emerging smart home architecture.

Figure 2. Interactions among entities in smart homes.

Figure 3. System overview of IoTBystander. The ellipsis (“…”) indicates that the event sequence continues with additional events following the same pattern.

Figure 4. Pipeline of audio-based physical event recognition.

Figure 5. Testbed setup for the traffic-based recognition.

Figure 6. Perceptible events via different channels.

Figure 7. Partial smart devices used in our testbed.

Figure 8. Results of comparison with related approaches: PingPong [58] and IoTAudMon [12].

Table 1. Summary of security monitoring in IoT.

	Methodology	Data Type	Application	Multi-Platform System Considered?
[7]	Rule Hypothesis Generation	IoT Events	Anomaly Detection	No
[11]	DFA Modeling	IoT Traffic	Malicious App Detection	No
[20]	Model Checking	Source Code	Security Analysis	No
[21]	Dynamic Policy Enforcement	Source Code	Security Enforcement	No
[22]	Satisfiability Modulo Theories	Source Code	Inter-App Interaction Detection	No
[23]	Classification	IoT Event	Sensor Failure Detection	No
[24]	State Transition Analysis	IoT Event	Faulty Device Detection	No
[25]	Causal Association Rules	IoT Event	Anomaly Detection	No
Ours	Machine Learning+Rule Matching	Audio+Traffic	Event Monitoring	Yes

Table 2. Summary of side-channel-based monitoring.

	Side Channel	Methodology	Application
[26]	Traffic	Association Rules Matching	Traffic Anomaly Detection
[31]	Traffic	CNN-based Framework	Traffic Anomaly Detection
[32]	Traffic	CNN + LSTM	Traffic Anomaly Detection
[33]	Traffic	Sensitive Information Inference	HTTPS and WPA/WPA2 Traffic Analysis
[38]	Audio	CNN-based Architectures	Audio Event Recognition/Detection
[41]	Audio	RCNN(RNN + CNN)	Audio Event Recognition
[42]	Audio	Vggish Model	Environmental Activity Recognition
[43]	Vision	KNN	Fall Detection in Elderly Individuals
[44]	Vision	Hidden Markov Model	Daily Activity Recognition for Elderly
[45]	Vision	Siamese Neural Network	IoT Device Anomaly Detection
[46]	Load	EMI Signature	Device Identification
[47]	Load	SVM	Power Status Monitoring (ON/OFF)
[48]	Load	Neural Networks + PSO	Device Identification/Monitoring
[52]	Traffic	Statistical Inference	Inferring user presence and activity
[53]	Traffic	Machine Learning	Identifying fine-grained user activities
[54]	Traffic	Empirical Analysis	Protecting User Privacy
[49]	Multiple Sensors	Multi-Sensor Fusion	Human Activity Recognition
[50]	Sensors+Traffic	Low/High-level Fusion Framework	Detection of Cyber–physical Anomaly
[51]	Optical Sensors	AI-driven Heterogeneous Sensing	Remote Monitoring in Smart Homes
Ours	Audio+Traffic	Machine Learning+Rule Matching	Event Monitoring

Table 3. Notations.

Symbol	Description
AER	Audio-based Event Recognition
TER	Traffic-based Event Recognition
DCA	Dual-Channel Aggregation
$E S_{A E R}$	Event sequence obtained based on AER
$E S_{T E R}$	Event sequence obtained based on TER
$E S_{r e s}$	Event sequence obtained by aggregating $E S_{A E R}$ and $E S_{T E R}$
$k s_{A E R}$	prior-knowledge set for audio channel
$k s_{T E R}$	prior-knowledge set for traffic channel

Table 4. Architecture of the neural network in the Feature Extractor.

Component	Channel Change
Conv2d	1 → 64
Max_Pool2d	64 → 64
Conv2d	64 → 128
Max_Pool2d	128 → 128
Conv2d	128 → 256
Conv2d	128 → 256
Max_Pool2d	256 → 256
Conv2d	256 → 512
Conv2d	256 → 512
Max_Pool2d	512 → 512
Component	Dimension Change
Linear	12,288 → 4096
Linear	4096 → 4096
Linear	4096 → 128

Table 5. Traffic signatures for different events.

Device	Event	Signature
huawei smart plug	on	C-322 S-139 C-54
huawei smart plug	off	C-198 S-169 C-54
xiaomi smart plug	on	S-223 C-207 S-54 … C-223 S-143 C-54
xiaomi smart plug	off	S-223 C-207 S-54 … C-111 S-111 C-54
gosund smart plug	on	S-202 C-170 S-186 C-106
gosund smart plug	off	S-202 C-170 S-186 C-106
xiaomi camera	on	S-219 C-192 S-110 C-52 … S-[161-170] C-46
xiaomi camera	off	S-219 C-192 S-110 C-52 … S-46 C-46
xiaomi gas sensor	stop	C-111 S-111 C-54
xiaomi gas sensor	alarm	S-223 C-207 S-54 C-223 S-143 C-54
xiaomi door sensor	on	C-179 S-116 C-54 … C-179 S-116 C-54
xiaomi door sensor	off	C-179 S-116 C-54 … C-91 S-91 C-54
xiaomi smoke sensor	stop	C-91 S-91 C-54
xiaomi smoke sensor	alarm	C-282 S-116 C-54

Table 6. Smart devices in our testbed. Each smart device supports one or more communication protocols.

Device Model (`Abbreviation`)	Wi-Fi	Bluetooth	Zigbee
Xiaomi Smart Plug (`XP`)	√
Xiaomi Smart Camera (`XC`)	√
Xiaomi Smart Kettle (`XK`)	√
Gosund Smart Plug (`GP`)	√
Huawei Chint Plug (`HP`)	√
Xiaomi Smoke/Fire Alarm Detector (`XD`)		√
Xiaomi Door and Window Sensor (`XS`)		√
Xiaomi Gas Detector (`XG`)	√	√
Xiaomi Smart Multi-Mode Gateway (`XM`)	√	√	√

Table 7. Detailed information of the dataset HomeSound-13.

Event Type	Total Length (s)	Event Type	Total Length (s)
clock-alarm	200	door-ring	497
vacuum-cleaner	200	washing-on	200
faucet-on	224	gas-stove-on	363
kettle-heating	451	keyboard-typing	200
microwave-heating	295	mouse-clicking	200
range-hood-on	430	gas-alarm	200
toilet-flushing	200

Table 8. Experimental results of IoTBystander. See Table 6 for the abbreviations of device models. “–” denotes cannot recognize.

	Event Type (Device Model)	Precision			Recall			F1-Score
	Event Type (Device Model)	TER	AER	DCA	TER	AER	DCA	TER	AER	DCA
Smart Device	plug on/off (`HP/XP`)	1.00	–	1.00	1.00	–	1.00	1.00	–	1.00
	switch on/off (`GP`)	0.50	–	0.50	0.50	–	0.50	0.50	–	0.50
	camera on/off (`XC`)	1.00	–	1.00	1.00	–	1.00	1.00	–	1.00
	door on/off (`XS`)	1.00	–	1.00	1.00	–	1.00	1.00	–	1.00
	smoke-stop (`XD`)	1.00	–	1.00	1.00	–	1.00	1.00	–	1.00
	smoke-detected (`XD`)	1.00	–	1.00	1.00	–	1.00	1.00	–	1.00
	gas-alarm-stop (`XG`)	1.00	0.97	1.00	1.00	0.97	1.00	1.00	0.97	1.00
	gas-detected (`XG`)	1.00	0.97	1.00	1.00	0.97	1.00	1.00	0.97	1.00
Dumb Device	clock-alarm	–	0.95	0.95	–	0.91	0.91	–	0.93	0.93
	door-ring	–	1.00	1.00	–	0.99	0.99	–	1.00	1.00
	faucet-on	–	0.95	0.95	–	0.90	0.90	–	0.92	0.92
	gas-stove-on	–	0.99	0.99	–	0.97	0.97	–	0.98	0.98
	kettle-heating	–	0.98	0.98	–	0.99	0.99	–	0.98	0.98
	keyboard-typing	–	0.92	0.92	–	0.94	0.94	–	0.93	0.93
	microwave-heating	–	0.98	0.98	–	0.98	0.98	–	0.98	0.98
	mouse-click	–	0.76	0.76	–	0.97	0.97	–	0.85	0.85
	range-hood-on	–	0.99	0.99	–	0.99	0.99	–	0.99	0.99
	toilet-flush	–	0.95	0.95	–	0.89	0.89	–	0.92	0.92
	vacuum-cleaner	–	1.00	1.00	–	0.95	0.95	–	0.97	0.97
	washing-on	–	0.94	0.94	–	0.91	0.91	–	0.93	0.93

Table 9. Statistical stability of the performance results in Table 8.

Type	Metric	Precision			Recall			F1-Score
Type	Metric	TER	AER	DCA	TER	AER	DCA	TER	AER	DCA
Smart Devices	SD	0.1816	–	0.1816	0.1816	–	0.1816	0.1816	–	0.1816
Smart Devices	CI	[0.8239, 1.0333]	–	[0.8239, 1.0333]	[0.8239, 1.0333]	–	[0.8239, 1.0333]	[0.8239, 1.0333]	–	[0.8239, 1.0333]
Dumb Devices	SD	–	0.0633	0.0633	–	0.0371	0.0371	–	0.0413	0.0413
Dumb Devices	CI	–	[0.9115, 0.9919]	[0.9115, 0.9919]	–	[0.9257, 0.9727]	[0.9257, 0.9727]	–	[0.9580, 1.0086]	[0.9580, 1.0086]
All Devices	SD	0.4899	0.4042	0.2915	0.4899	0.4816	0.3339	0.4899	0.4846	0.137
All Devices	CI	[0.3022, 0.6978]	[0.4135, 0.7795]	[0.8058, 1.0632]	[0.3022, 0.6978]	[0.3586, 0.7321]	0.718, 0.978]	[0.3022, 0.6978]	[0.3166, 0.7080]	[0.883, 1.016]

Table 10. Performance of IoTBystander on public datasets.

Dataset	Accuracy	FPR
ESC-16	93.42%	5.18%
PingPong	97.71%	0

Table 11. Computation overhead of IoTBystander.

Module	Laptop	Raspberry Pi
AER	0.0152 s/spectrogram	0.0289 s/spectrogram
TER	0.04 s/packet	0.08 s/packet

Table 12. Comparison of different monitoring approaches in anomaly-detection scenarios.

Anomaly Scenario	Simulation Method	Without IoTBystander	With IoTBystander
Anomaly Scenario	Simulation Method	Without IoTBystander	Audio	Traffic	Dual
`XG` cannot make audible alarm due to malfunction (e.g., speaker broken)	Removing the speaker	×	×	×	√
gas-detected events from `XG` intercepted.	ARP spoofing attack	×	√	×	√
The dumb faucet stays on for 5 min.	The dumb device does not generate events.	×	√	×	√
The homeowner seems to have forgotten to turn off the smart faucet while being deeply immersed in work.	No device reports the working-related events.	√	√	×	√
The smart faucet staying on is fine since the homeowner is cooking nearby.	No device reports the cooking-related events.	×	√	×	√

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chi, H.; Ma, Q.; Wang, Y.; Yang, J.; Geng, H. IoTBystander: A Non-Intrusive Dual-Channel-Based Smart Home Security Monitoring Framework. Appl. Sci. 2025, 15, 4795. https://doi.org/10.3390/app15094795

AMA Style

Chi H, Ma Q, Wang Y, Yang J, Geng H. IoTBystander: A Non-Intrusive Dual-Channel-Based Smart Home Security Monitoring Framework. Applied Sciences. 2025; 15(9):4795. https://doi.org/10.3390/app15094795

Chicago/Turabian Style

Chi, Haotian, Qi Ma, Yuwei Wang, Jing Yang, and Haijun Geng. 2025. "IoTBystander: A Non-Intrusive Dual-Channel-Based Smart Home Security Monitoring Framework" Applied Sciences 15, no. 9: 4795. https://doi.org/10.3390/app15094795

APA Style

Chi, H., Ma, Q., Wang, Y., Yang, J., & Geng, H. (2025). IoTBystander: A Non-Intrusive Dual-Channel-Based Smart Home Security Monitoring Framework. Applied Sciences, 15(9), 4795. https://doi.org/10.3390/app15094795

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IoTBystander: A Non-Intrusive Dual-Channel-Based Smart Home Security Monitoring Framework

Abstract

1. Introduction

2. Related Work

2.1. Security Monitoring in IoT

2.2. Side-Channel-Based Monitoring

3. Background: Emerging Multi-Heterogeneous-Platform Smart Homes

4. Design Overview of IoTBystander

5. Event Recognition

5.1. Audio-Based Event Recognition (AER)

5.2. Traffic-Based Event Recognition (TER)

6. Dual-Channel Aggregation

6.1. Event Alignment

6.2. Event Verification and Fusion

7. Experimental Results

7.1. Testbed Setup and Evaluation Metrics

7.2. Effectiveness of IoTBystander

7.2.1. Performance of Traffic-Based Monitoring

7.2.2. Performance of Audio-Based Monitoring

7.2.3. Performance of Dual-Channel Monitoring

7.2.4. Statistical Reliability Analysis

7.2.5. Generalizability on Standard Datasets

7.3. Efficiency of IoTBystander

7.4. Case Study: Impact on Anomaly Detection

8. Limitations and Ethical Considerations

8.1. Precise Data Collection and Labeling

8.2. Privacy Concerns

8.3. Consent and Transparency

8.4. Bias in Detection

8.5. Impact on Trust in IoT Systems

8.6. Accountability for Errors

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI