Double Deep Q-Network Next-Generation Cyber-Physical Systems: A Reinforcement Learning-Enabled Anomaly Detection Framework for Next-Generation Cyber-Physical Systems

Zhang, Yinjun; Jamjoom, Mona; Ullah, Zahid

doi:10.3390/electronics12173632

Open AccessArticle

Double Deep Q-Network Next-Generation Cyber-Physical Systems: A Reinforcement Learning-Enabled Anomaly Detection Framework for Next-Generation Cyber-Physical Systems

by

Yinjun Zhang

¹,

Mona Jamjoom

^2,*

and

Zahid Ullah

³

¹

School of Mechanical and Electrical Engineering, Guangxi Science and Technology Normal University, Liuzhou 545004, China

²

Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh 11564, Saudi Arabia

³

Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(17), 3632; https://doi.org/10.3390/electronics12173632

Submission received: 19 June 2023 / Revised: 2 August 2023 / Accepted: 23 August 2023 / Published: 28 August 2023

(This article belongs to the Section Networks)

Download

Browse Figures

Versions Notes

Abstract

:

In this work, we considered the problem of anomaly detection in next-generation cyber-physical systems (NG-CPS). For this, we used a double deep Q-network-enabled framework, where an agent was trained to detect anomalies in the traffic that does not match the behavior of the legitimate traffic at the end side. Furthermore, the proposed paradigm recognizes known and unknown anomalies by directly engaging with a simulation environment. Given that, it progressively develops its interpretation of anomalies to encompass new, previously unrecognized classes of anomalies by proactively exploring probable anomalies in data that have not been labeled. The method achieves this by concurrently optimizing the use of a limited amount of labeled abnormality data for better understanding (exploitation) and the identification of infrequent, unlabeled anomalies (exploration). During analysis, we observed that the proposed model achieves significant results in the context of average and greedy catching of anomalies in the presence of comparative models.

Keywords:

anomaly detection; next-generation-CPS; deep Q-learning; reinforcement learning; intrusion detection systems; neural networks; outlier detection

1. Introduction

Anomaly detection has great significance in numerous key domains and applications that range from general to specific areas [1]. These fields include cybersecurity, encompassing intrusion detection systems, firewalls, and intrusion detection and prevention systems. Further applications extend into healthcare with early disease identification systems, finance sectors with fraudulent transaction detection systems, networks with fake packet detection, etc. From these domains, we evaluated different datasets, each of them containing unique features and instances. Our evaluation particularly focused on characteristics related to non-anomaly, anomaly, and partial-anomaly scenarios. Therefore, anomaly detection comes with a wide array of sources that direct to unlike categories with unique characteristics [2]. To illustrate this, the literature demonstrated various kinds of anomalies that are completely different in inherent behaviors [3]. Given that, anomalies, by their nature, emerge infrequently and unpredictably in a dataset, which makes it extremely challenging to gather labeled training data that encompass all possible anomaly categories. Following that, the literature confirmed that this is why fully supervised machine learning-enabled algorithms are unable to fully counter anomaly-based security threats. In contrast, unsupervised machine-learning-enabled algorithms have been overseen in the field for many years [4]. However, in many applications, there exists a small group of understood anomalies for crucial classes, which sometimes creates problems in matching and understanding [5]. Despite the fact that they are limited in number, they provide invaluable prior information that can significantly enhance accuracy compared to unsupervised techniques. Therefore, the task herein is to actually employ these meager anomalies and achieve big objectives in the context of results.

In several real-world applications, we are dealing with extensive unlabeled data that might contain a variety of similar or dissimilar anomalies. Given that, it is essential to utilize such unlabeled data for both identified and unidentified anomaly detection. Keeping in view these important considerations, in [6], the authors focused on anomaly detection by taking into account partially labeled data. In this model, the authors considered most of the data as normal along with a small number of labeled data that represent all potential anomaly classes. In the literature [7], the authors talk about unsupervised anomaly detection techniques that can usually catch a wide range of anomalies. But sometimes, they are not convenient and effective, because of the lack of information followed by the low understanding of actual anomalies, and as a result, they yield false positives.

In [8,9], the authors discussed the most relevant studies associated with semi- supervised approaches that, in general, use label data for anomaly detection in different applications. The authors have noted that while the trained model generally performs well in detecting anomalies, it sometimes struggles to identify new ones during the testing phase. Although these are very balanced studies for basic knowledge, further investigation is needed to ensure and design reliable anomaly detection models for future networks. Furthermore, to address the anomaly detection issues in future networks, existing unsupervised methods could be useful up to some extent to determine some “pseudo anomalies" from unlabeled data, but they will not fulfill the needs of upcoming technologies. Therefore, the authors also mentioned that pseudo anomalies techniques, together with labeled anomalies techniques can learn more generalized abnormalities using semi-supervised models. However, they do not have the capability to maintain accuracy.

To address this problem, in this work, we proposed a double deep Q-network-enabled anomaly detection framework that works adaptively to catch both known and unknown anomalies in raw data at the client side of NG-CPS. In this paradigm, an agent is empowered through neural networks to detect anomalies by utilizing labeled data to enhance detection precision. Furthermore, the agent was trained in such a manner that it only allows the legitimate traffic observed during the training phase and blocks all other traffic. The agent accomplishes this by proactively interacting with a virtual environment formed from both labeled and unlabeled data. Given that, in many real-world anomaly detection scenarios, there is no sequence of decisions to be made (like in the case of tabular data), and thus, no interactive environment is available for anomaly detection. In contrast, our framework allows an agent to continuously interact with the environment and effectively use the small batch of labeled anomalous instances while purposefully exploring vast amounts of unlabeled data for potential anomalies of new types. Moreover, we used a reward function that utilizes guidance from both labeled and unlabeled anomalies to ensure accuracy while achieving the desired task. This aims to establish a balance between exploration (seeking new types of anomalies) and exploitation (leveraging known anomalies).

The main contributions of this work are summarized below:

In this work, we aimed to address the anomaly detection issue with the help of partially labeled anomalous data by considering both familiar and unfamiliar anomalies.
For this, we used the DDQN paradigm, where an agent was enabled to adaptively and interactively detect known and unknown abnormalities in the network. Moreover, we ensured that the proposed framework only allows defined traffic for further processing in the network, which will not only improve congestion statistics but also helps to enrich the operation accuracy of NG-CPS.
Finally, we extensively evaluated the suggested model on well-known datasets to imitate the scenarios of the known anomaly, non-anomaly, and partial anomaly. The result statistics demonstrated that our model is quite accurate while classifying anomalies in complex and unique datasets. Based on the result statistics, we trust that the suggested model could be very useful for real-world NG-CPS.

The rest of the paper is organized as follows: In Section 1, we familiarize the readers with NG-CPS anomaly detection. Section 2 contains the related work. In Section 3, we formulate the problem statement, which is followed by an evaluation of the proposed model in Section 4. Section 5 contains the experimental results, while Section 6 concludes and summarizes the paper. For a visual representation, we include Figure 1.

2. Related Work

In this section, we will talk about the existing approaches that are being utilized to detect anomalies in next-generation networks. The concept of anomaly detection in NG-CPS originates from the broader issue of identifying outliers. Thus, this is an important issue to be addressed in current and future networks. In present networks, most of the time, data originate statically rather than dynamically. The term “statically” in this context refers to data that are generated or collected at a fixed rate by a machine or sensor over operational time. Conversely, future networks are expected to collect data “dynamically”, which means that data will vary over time, often in response to user actions or system events. Therefore, it is easy to detect anomalies in these networks, as discussed in [2], but when it comes to NG-CPS networks, this issue raises several challenges for the research community.

In [10], the authors presented intelligent video anomaly detection and classification using the faster region-based convolutional neural network (RCNN) and deep Q-learning model. Given that, the authors enabled the RCNN as a baseline model to detect anomalies, while the DQL algorithm was used to classify these anomalies. Although this framework effectively detects anomalies in resource-limited networks, it tends to create issues related to energy consumption, complexity, and memory in these networks. Duan et al. [11] proposed a sequential anomaly detection paradigm utilizing Q-learning algorithms. In [12], the authors proposed a novel reinforcement-learning-enabled adversary anomaly detection technique for modern networks. This model was designed by taking into account the intrusion detection system (IDS) paradigm to address the security concerns of these networks. Given that, it has been observed that the suggested approach is only applicable for the centralized setup, which minimizes its use in real deployment.

In [13], the authors proposed a DQN-enabled paradigm to localize anomalies in videos by encouraging an agent to understand how to catch and identify abnormalities in video streams. This idea was taken from multiple instance learning (MIL) techniques by considering the achievement of a unique objective. In [14], the authors proposed a semi-supervised learning algorithm for anomaly detection and data segmentation. In this model, the authors used a classifier that takes the data sample as an input in the context of loss profile via an autoencoder, which generates a sequence of anomalous and non-anomalous data during training. De La Bourdonnaye et al. [15] proposed a binocular fixation technique for anomaly detection utilizing the informative reward rather than prior information. For this, the authors used a convolutional autoencoder with a weakly supervised learning algorithm. In [16], the authors discussed the application of reinforcement learning algorithms in anomaly detection. Furthermore, they underlined how the environment’s behavior can help in the learning and understanding process of anomalies. In [16], the authors proposed an SARSA-based reinforcement learning algorithm in coordination with a deep neural network (DNN) to detect malicious anomalies in an IDS. The basic objective of this framework was to improve the detection precision of trendy and complex attacks in the network environment.

In [17], the authors reviewed existing state-of-the-art anomaly detection techniques that have been used in cloud base network infrastructure. Moreover, the authors highlighted the advantages and disadvantages of each considered paradigm. Based on their comprehensive evaluations, the authors suggested possible research directions in the context of underlining the open challenges. In [18], the authors discussed the role of multi-agent reinforcement learning in intrusion detection systems by taking into account several use cases of anomaly detection in a network. In [19], the authors described the generic navigation algorithm that has been used for data collection sensors in drones to guide them in a specific direction. The main objective of this work was the consideration of hazardous and critical situations where accurate problem location can help to prevent devastation or improve a situation rapidly. For that, the authors used the proximal policy optimization DRL algorithm in coordination with incremental curriculum learning followed by long short-term memory neural networks. In [20], the authors discussed the application of deep learning, reinforcement learning, deep reinforcement learning, and other advanced algorithms in the biological domain. Given that, the authors underlined how important a role these algorithms are playing now in this domain and how much they could be productive in the future. But, they did not discuss the open problems followed by future research insights, which somehow kept this work in a fuzzy state.

In [21], the authors proposed a general reinforcement learning-enabled security framework for software-defined networks (SDNs). In this paradigm, the authors targeted network threats with the help of a neural-fitted Q-learning agent to ensure the security of these networks. In [22], the authors presented a detailed tutorial on the application of deep learning algorithms by taking into account anomaly detection techniques. Moreover, the authors discussed how these algorithms work in different scenarios and how they effectively depict anomalies, which was followed by different model suitability use cases. In [23], the authors comprehensively discussed the applications of deep learning and reinforcement learning in the economic sector by providing an in-depth insight into existing state-of-the-art models. Furthermore, they also investigated and highlighted the complexities, robustness, performance, accuracy, computational tasks, risk constraints, profitability, etc., of these models.

3. Preliminaries and Problem Formulation

In this section, we talked about the required preliminaries by outlining the issue of active detection using a meta-policy. Given that, we delved into the fundamentals of the Markov decision process (MDP) and double deep reinforcement learning (DDRL) while designing the proposed model. Following this, we used a straightforward approach to train the meta-policy using DDRL and discussed its potential advantages and disadvantages. The symbols and notation utilized in this paper are summarized in Table 1.

3.1. Problem Formulation

In this part, we formulated and targeted the anomaly detection dilemma by representing a set of models such as

X = {x_{1}, x_{2}, x_{3}, x_{4}, x_{5}, x_{6}, \dots, x_{n}}

, where x

\in R^{n \times d}

. In the given scenario,

x_{n}

is used for the representation of the total number of entities, while d is used to symbolize the traffic features. Following this, each model

x_{i}

has a d-type vector

{x_{i, 1}, x_{i, 2}, x_{i, 3}, x_{i, 4}, x_{i, 5}, x_{i, 2}, \dots x_{i, d}}

, where the feature

x_{i, j}

assumes real-valued traffic. Let us assume that

y_{i} \in R^{n}

corresponds to the

x_{n}

entities in any datasets, where the value of

y_{i} \in {- 1, 1}

or true/ false. Herein, −1 or false signifies the anomaly instance, while 1 or true signifies the normal instance. Given that, we need to partition the instances into two groups, such as normal sets

A = x_{1}, x_{2}, x_{3}, x_{4}, x_{5}, \dots, x_{a}

and abnormal sets as

N = y_{1}, y_{2}, y_{3}, y_{4}, y_{5}, \dots, y_{b}

, where a and b represent the number of normal and abnormal samples, respectively.

Likewise, traditional unsupervised anomaly detection techniques use anomaly attribute scores, which are denoted by

c \in R^{n}

with respect to set X. This entails the learning function f and maps X with c in the hypothesis that suggests lower scores and adopts the higher score likelihood instance. Upon acquiring these anomaly scores, it is possible to establish an anomaly ranking, in which anomalous models are anticipated to rank higher than normal models. However, this ranking system is often imperfect, as manipulating a large number of instances can cause difficulty in distinguishing between higher and lower ranks. As a result, some instances that receive lower rankings might be displayed to be anomalous. Therefore, in practical applications, it is necessary to engage the efforts of an analyst, such as a human, to scrutinize these irritabilities.

Keeping in view the aforementioned notations and symbols, we officially present the anomaly detection dilemma with meta-policy as follows: Let us suppose, in a dataset X, the meta-policy chooses an element

x_{i}

of the entity x for inquiry at each step. Subsequently, a human evaluator assigns a label to

x_{i}

to verify whether

x_{i}

is a non-anomaly, anomaly, or partial anomaly. To further clarify, we introduce y as a state vector, where each entry in this vector is represented by

y_{i}

∈

R^{n}

entities in the dataset. Each

y_{i}

adopts one of the values from the set 0, −1, 1 for categorization. Herein, −1 indicates that the entity is confirmed to be an anomaly after an inquiry, while 1 signifies the entity is identified as normal after an inquiry. The value 0 represents an entity that is identified as a partial anomaly.

Initially, y is composed of zeros for all entities, and no entity has been selected for an inquiry at the outset. Given that, we considered the state of the chosen entities to be updated to 1 or −1 for each step, where human feedback was the first factor. Considering the traffic scenario use cases, our objective is to develop a policy that would be trained in a manner to determine which entities are inquired at each step. Following this, they will effectively a map c: X × y, whereas c: X × y ⇒ 1, 2, 3, 4, ...., n, in the context of counting the genuine anomalies identified among the chosen entities until the inquiry traffic is exhausted.

Markov Decision Process

The Markov decision process (MDP) is a well-recognized framework that works by making a sequential decision. Furthermore, this framework has been generalized by the quintuple M = (S, A,

P_{T}

, R,

γ

), where every element has a specific name as “Defined in Table 1”. The state transaction function is summarized as ‘

P_{T}

: S × A × S ⇒ R+’, where R+ symbolizes the set of positive real numbers. The immediate reward function is generalized as R: S ⇒ R, and the discount factor is considered as

γ

∈ (0, 1) to establish a balance between short- and long-term rewards. Given that, at each time step t, an agent carries out an action

a_{t}

∈ A with respect to the current state (

s_{t}

). Likewise, the agent follows this information for the next state movement, such as

s_{t} + 1

, by receiving the reward R, which is generalized for each step as

r_{t}

= R(

s_{t}

+ 1). The basic idea behind this process is to execute policy

π

with the objective of maximizing the expected discounted cumulative reward

E c_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t}]

. DRL, DQN, and DDQN are a subfield of machine learning and embody a compendium of algorithms designed specifically for solving the Markov decision process utilizing deep neural networks. Currently used DRL algorithms often learn a

V (s_{t})

= E_{a_{t}, s_{t + 1}, s_{t + 2}, . . .} [\sum_{l = 0}^{\infty} γ^{l} R (s_{t + 1})]

or

Q (s_{t} \in S, a_{t} \in A) = E_{s_{t + 1}, a_{t + 1}, . . . . . . . . . s_{t + n}, a_{t + n}} [\sum_{l = 0}^{\infty} γ^{l} R (s_{t + l})]

, utilizing DQN and DDQN. The principal objective of these functions is to identify the next best action with maximum reward.

4. Proposed Model

In this section, we talked about the implementation of the proposed model known as the double deep Q-learning partially labeled anomaly detection technique (DDQLPADT). All submodules utilized in this paradigm are comprehensively explained in the subsequent sections.

4.1. DDQN-Enabled Anomaly Detection Agent with Agent A

In the suggested paradigm, an Agent A first seeks to understand the best anomaly detection-focused action-value function, also known as the Q-value function [24], thereafter, it uses the considered value functions as an approximation function for anomaly detection in the traffic.

\begin{matrix} Q^{*} (S, O, A) = max_{π} E ⋉ [ & R_{t} + γ R_{t + 1} + γ^{2} R_{t + 2} + γ^{2} R_{t + 3} \\ + γ^{2} R_{t + 4} + γ^{2} R_{t + 5} + \dots | S_{t} \in S, A_{t} \in A, π], \end{matrix}

(1)

Equation (1) represents the apex expected return commencing from a particular observation O of a state S while executing an action A on a set of elements

a_{0}, a_{1}

by adhering to a behavioral policy

π

=

P (a | s)

. Moreover, the total rewards obtained during the process are summarized as

r_{t}

with

γ

’x at each progression of time step t. Given that, we applied the DDQN framework to calculate and comprehend the optimal action-value function such as

Q^{*} (s, a)

with parameters

θ

:Q(s, a;

θ

). Following this, we ensured that the parameters

θ

would be learned incrementally to reduce the loss. For generality, we simplified it as below:

\begin{matrix} L_{j} (θ_{j}) = {E ⋉}_{(S, A, O, R, S^{'})} \sim [ & (R \in (γ R_{t + 1} + γ^{2} R_{t + 2} + \dots + n) \\ + {γ max_{A^{'}} Q (S^{'}, A^{'}; θ_{j}^{-}) - Q (S, A; θ_{j}))}^{2}] U (E_{l}) \end{matrix}

(2)

In Equation (2),

E_{l_{1}}

signifies a collection of an agent learning experiences, each of which is stored as

E_{l_{t}}

= (

S_{t}

, O,

A_{t}

,

R_{t}

,

S_{t + 1}

). Following this, we evaluated the loss function with the help of minibatch samples, which are randomly selected from the stored experiences. The symbol

θ_{j}

in Equation (2) denotes the parameters of the Q-network by taking into account the jth iterations. Likewise, we utilized the network possessing parameters

θ^{- j}

in the context of the target network, where

θ^{- j}

iterations are updated with

θ_{j}

at every Kth step.

4.2. Learning Phase

In this section, we talked about the transferable features, which are represented by G, obtained from the preceding section, along with action space A, where A = 1, 2, 3, 4, 5, ..., n. Given that, herein, we know that the size of the space is linked/proportional to the dataset size over the defined policy. To facilitate the training over the transferable policy, we used the data stream. Specifically, we utilize the transferable features of a training dataset, G-train, and its associated labels, such as y-train. In each episode, we randomly shuffled G-train and y-train to generate a perturbation, which is denoted as G-train′ and y-train′. Rather than providing the policy for all the data at once, we sequentially input one entity at a time. In this setting S, A, O, and R follow the Markov decision process (MDP) in the undermentioned way.

(1): State S: In the transferable sampling features, the current observed entity pertains as $G_{t r a i n_{i}}$ , and this $G_{t r a i n_{i}}$ resides in the domain $R_{l}$ , whereas i signifies the index of the entity.
(2): Action A: The possible actions are generalized with 1 and 0, where 1 implies that the current instance should be included, while Action 0 indicates that the current entity should be disregarded.
(3): Reward R: When a policy probes an entity, it is assigned a +ve reward of 1 if the entity is truly anomalous, and then it is assigned a minor −ve reward of −0.1. If the policy overlooks an entity, then no reward is allocated. Therefore, the reward function plays a pivotal role in the learning processing while outlining the expected behaviors.

Considering these all, the MDP outlines an active and efficient learning process in the traffic environment, where the policy is intuitively incentivized to carry out Step 1 if the queried entity is anomalous and proceed with Step 0 if it is normal. Consequently, the policy is trained in a manner to detect more anomalies within a specified budget. However, it is important to note that the meta-policy trained on traffic streams could exhibit suboptimality when implemented in a batch setting, given the differing objectives of the two MDPs. Regardless, we have observed that the potential issues are substantially overshadowed by the advantages offered by traffic streaming. It dramatically diminishes the scope of the state space followed by the action spaces with the consideration of training and policy implementation.

4.3. Proximity-Based Observation and Sampling

In this section, we talked about proximity-based observations and how they work in the proposed model. In the considered environment (

E_{n}

), the sampling function (

δ

) is composed of two functions such as

δ_{A}

and

δ_{u}

. These two functions are used to maintain a balance between exploitation and exploration while evaluating a dataset D. Given that,

δ_{A}

is used to uniformly sample the current state and future state

S_{t + 1}

from a

D_{A}

aimlessly such as

s_{t + 1, r} \sim U (D_{A})

. In contrast, the

δ_{u}

function is used as a mechanism that helps while selecting

s_{t + 1}

from

D_{u}

in the context of current proximity observation. To facilitate an efficient and effective exploration of

D_{u}

and

δ_{u}

, we used the following setup:

δ_{u} (S_{t + 1} | S_{t}, O, A_{t}; θ_{e b}) = \{\begin{matrix} arg max_{s \in S} d (S_{t}, o, s; θ_{e b}), & if A_{t} = a_{0} \\ arg min_{s \in S} d (S_{t}, o, s; θ_{e b}), & if A_{t} = a_{1} \end{matrix}

(3)

In Equation (3), we used S to represent the subsamples randomly of a function

D_{u}

. However, the parameters of the feature embedding function

ψ (\cdot; θ_{e} b)

are denoted by

θ_{e} b

. This function is emanated from the DQN’s last hidden layer. To calculate the Euclidean distance between

ψ (s_{t}; θ_{e} b)

and

ψ (s; θ_{e} b)

, we used function d, which captures the sensed distance of an agent within the representation space.

Specifically,

δ_{u}

returns the most immediate neighbor of

S_{t}

when an agent senses the current observation O of

s_{t}

as an anomaly and carries a step

a_{1}

. This procedure authorizes the agent to explore the observation space similar to the doubtful anomaly observations in the labeled data

D_{u}

. When the agent interprets

s_{t}

as a normal observation and carries a step

a_{0}

in the presence of

δ_{u}

, then it returns the farthest neighbor of

s_{t}

that leads an agent to explore further observations. Given that, both of the strategies encourage active exploration of anomalies by taking into account large

D_{u}

.

Furthermore, we used

θ_{e b}

as a subset parameter in DDQLPADT for the DDQN parameter

θ

. To achieve efficiency, we sampled and approximated the nearest observations followed by the farthest neighbors with subsamples instead of the

D_{u}

function. Empirical evidence suggests that this approximation is very useful to accurately calculate the

δ_{u}

functions on

D_{u}

.

During the interaction between the agent and the environment, both

δ_{a}

and

δ_{u}

work in coordination. In this coordination,

δ_{a}

works with a probability of p, and

δ_{u}

works with a probability of

1 - p

. Given that, the agent was enabled to effectively exploit the observation by taking into account the exploration and observations of the environment (large dataset).

4.4. Integration of Rewards: External and Intrinsic

In this section, we talked about the external and intrinsic reward functions for an agent, which are generalized as

r_{e t}

. This reward signal is contingent upon the agent’s effectiveness in identifying established anomalies within the labeled anomaly data.

R_{e_{t}} = h (S_{t}, A_{t}) = \{\begin{matrix} 1 & if A_{t} = a_{1} and S_{t} \in D_{a}, \\ 0 & if A_{t} = a_{0} and S_{t} \in D_{u}, \\ - 1 & otherwise . \end{matrix}

(4)

From Equation (4), we can see that an agent obtains a positive

R_{e t}

reward only when it accurately categorizes known anomalies as “anomalous". The agent does not receive any reward for correctly identifying normal observations, while it incurs a penalty in the form of a negative reward for either false-negative or false-positive detections. Consequently,

R_{e t}

explicitly motivates the agent to utilize the labeled data

D_{a}

to its fullest extent. To optimize the return, an agent is motivated to engage in interactive learning of the known anomalies, with an objective to achieve high true-positive detection rates by minimizing the instances of both false-negative and false-positive anomalies. This method allows our model to outperform existing semi-supervised methodologies used for this task in the literature. Our proposed DDQLPADT paradigm is shown in Figure 2.

4.5. Theoretical Evaluation of the Proposed Model

During the training phase, an agent A in our model is taught in a continuous manner to reduce the loss function. The Q-network, which is represented by

Q (S, A, O; θ^{*})

, becomes a manifestation of learned parameters

θ^{*}

after the training. In the inference phase,

Q (S^{'}, s_{t}, A, a_{t}; θ^{*})

generates an anticipated value for the execution that should be utilized by the agent for an action

A_{0}

or

A_{1}

based on the observation results. Given that,

A_{1}

symbolizes the action of the agent with new state (data) tagging

S^{'}

as “anomalous", such as

Q (S^{'}, O, A_{1}; θ^{*})

, by counting an anomaly score. This scoring/reward concept can be further expounded as follows. If

π

is a policy generated from the Q-values, then the anticipated retrieval for a move

A_{1}

over given observation

O, S^{'}

under policy

π

, should be indicated as

q_{π} (S^{'}, A_{1})

.

q_{π} (S^{'}, A_{1}) = {E ⋉}_{π} [\sum_{n = 0}^{\infty} γ^{n} R_{t + n + 1} | S^{'}, O, A_{1}] .

(5)

Let us consider

S^{i}

as a labeled anomaly and

S^{j}

as an unlabeled anomaly followed by

S^{k}

as an unlabeled normal observation. Following this, the relationships hold as

h (S^{i}, O, A_{1})

>

h (s^{j}, O, A_{1})

>

h (S^{k}, O, A_{1})

. Next, it has been observed that

f (S^{i}; θ_{e v}

)

\approx f (S^{j}

;

θ_{e n}

)

>

f(

S^{i}

;

θ_{e n}

), assuming that a function f accurately identifies the abnormality in the traffic during the observations. In the equations,

R_{t}

is the sum of the outputs from the h and f, leading to

q π (S^{i}, O, A_{1}) > q π (S^{j}, O, A_{1}) > q π (S^{k}, O, A_{1})

, under a specific policy

π

.

Hence, once an agent accurately approximates the Q-values, after a significant number of training iterations, the evaluated retrievals would be

Q (S^{i}, O, A_{1}; θ^{*}) > Q (s^{j}, O, A_{1}; θ^{*})

> Q (S^{k}, O, A_{1}; θ^{*})

. This indicates that observations with a large

Q (S^{'}, O, A_{1}; θ^{*})

represent the anomalies we are interested in.

5. Experiment Settings and Results Analysis

In this section, we assessed the proposed model through simulations to answer the considered questions. Given that, we focused on the undermentioned scenarios. To demonstrate the universality of our proposed DDQLPADT in the context of the ratio-enabled anomaly detection approach, we utilized distinct datasets of varying sizes, features, and dimensions obtained from the ODDS2 database. Moreover, we summarized the characteristics of these datasets comprehensively in Table 2. For the assessment metric, we employed the anomaly discovery curve methodology, as illustrated in [17]. This approach essentially plots the quantity of identified anomalies against the number of queries. An ideal outcome is represented by a line exhibiting a slope of 1, suggesting that all queries are anomalous. Conversely, the worst-case scenario is depicted by a line possessing a slope of 0, indicating all queries are normal.

For our comparison, we selected a series of leading-edge techniques, and an unsupervised baseline, which includes active anomaly detection, feedback-guided isolation forest, semi-supervised detection of outliers, and unsupervised. In active anomaly detection, we tried to evaluate node re-weighting, while in the feedback-guided isolation forest, we focused on the online optimization of anomaly detection. Likewise, we used the semi-supervised detection of outliers (SSDO) to detect anomalies on the operating on-point. Despite that, we checked the unsupervised to ensure the result evaluations.

To train the model, we used the DDQN framework with the DQL algorithm. Rollout steps (T) were set to 256, and the entropy coefficient was set to 0.001. The learning rate was set at 0.0001, with a value function coefficient of 0.5,

λ

= 0.98, and a clip range of 0.25. We used

γ

= 0.7 as a hyper-parameter to balance the long-term and short-term rewards. The meta-policy underwent training with the designated datasets. To evaluate these datasets, the procedure was reversed. The meta-policy was trained for 500 timesteps with consistent hyper-parameters across all datasets, and the episode length was set to 5000.

5.1. Use Case Study Scenario: “WBC”

In this section, we talked about the WBC dataset, how it has been evaluated and visualized with the help of DDQN, and how this model could be effective in anomaly detection, especially when it comes to its implementation in NG-CPS networks. Following this, we can see in Table 2 that WBC is a small dataset with even dimensions. Given that, we used the pre-trained policy to check and demonstrate how effectively it classifies anomaly data, partial-anomaly data, and non-anomalous data. For this, we performed comprehensive simulations and obtained Figure 3 results. From Figure 3, showing the simulation statistics, we can see that the clean data are categorized as non-anomalies, the anomalies are demonstrated with black stars, and partial anomalies are shown in light pink color circles. To summarize, we can clearly see how efficiently the proposed model will classify data in real applications and how it can improve the network performance by blocking unnecessary data at the edge side of the networks.

5.2. Use Case Study Scenario: “Arrhythmia”

In this section, we evaluated the result statistics of “Arrhythmia” datasets for the considered metrics such as anomaly, non-anomaly, and partial-anomaly data. Given that, we observed the proposed framework functions very well to classify raw data, and the agent is capable of only allowing clear and transparent data in the network. To support this argument, we performed simulations with the supposed dataset “Arrhythmia” and captured the undermentioned result statistics (Figure 4).

5.3. Use Case Study Scenarios: “Cardio and Breast” Datasets

In this subsegment, we further evaluated the proposed model with the objective of testing its effectiveness across different scenarios. We evaluated these datasets, each containing different instances, with a particular focus on characteristics related to non-anomaly, anomaly, and partial-anomaly scenarios. During simulation, the model exhibited robust performance while consistently identifying anomalies with a high degree of accuracy. This indicates that our suggested benchmark is not only capable of adequately identifying anomalies, but it is also capable of successfully adapting to varying data distributions, a quality that is vital for real-world applications such as NG-CPS. These results reinforce the model’s proficiency in detecting anomalies effectively and also demonstrate its potential for many applications, where this framework could be useful. In future work, we are aiming to optimize the model’s ability to adjust with real-time applications by further enhancing its robustness in detecting anomalies across a more extensive spectrum of datasets and domains. The results obtained during analysis are shown in Figure 5 and Figure 6.

5.4. Use Case Study Scenarios: “Shuttle and Speech” Datasets

To further analyze the proposed model’s significance, we extended our evaluation scope to two more datasets, such as “Shuttle and Speech datasets”. Herein, we are interested in complex and simple datasets; therefore, we considered the Shuttle dataset because it has high dimensionality, while the Speech dataset is very unique with temporal dynamics, which as a whole, offered us the opportunity to examine the model’s performance across diverse data environments. In the case of the Shuttle dataset, our proposed approach showed commendable proficiency while detecting anomalies, even amidst high-dimensional data. Considering that, our proposed model has demonstrated its robustness and adaptability by accurately identifying and classifying outlier features in such a complex dataset.

Next, we proceeded to evaluate the Speech dataset and discovered that our model exhibited impressive adaptability. During the evaluation, we observed that our model exhibited impressive flexibility, readily adapting to new data and effectively managing its temporal characteristics. Despite the fact that anomalies within speech signals often result in subtle, context-dependent, and challenging tasks, our model displayed a notable capability to recognize such anomalies, which further attests to its robustness. The results for both datasets are shown in Figure 7 and Figure 8.

5.5. Use Case Study Scenarios: “Ionosphere and Pendigits” Datasets

In this section, we moved one step forward by considering the Ionosphere and Pendigits datasets for evaluation. The basic reason for this evaluation was the unique characteristics and peculiarities of these datasets, which provided us with some additional insight that helped us to overview the model’s performance and adaptability in diverse anomaly detection scenarios. We know that the Ionosphere dataset consists of radar data, which has its own kind of complex structure and noise that introduces a different set of anomalies compared to traditional datasets. Therefore, we also considered it during evaluation as a use case to ensure and check the adaptability of our proposed model for different circumstances. Similarly, we used the Pendigits dataset, which is the representation of handwritten digit recognition in the context of anomalies to verify how effectively our model classifies different patterns in terms of anomalies. Given that, the findings of our proposed model indicated that it has the ability to handle the complexities of the Ionosphere dataset and accurately detect anomalous radar readings. Likewise, in the case of the Pendigits dataset, our model demonstrated commendable performance while identifying anomalous digits. This signifies and shows that our model has the capability to operate effectively in an image-based anomaly detection scenario. Moreover, the results captured during the analysis are shown in Figure 9 and Figure 10.

5.6. Summary of Discussion

In this section, we talked about the result statistics of the proposed model by checking and evaluating its performance in different scenarios. Given that, we have rigorously evaluated the model’s performance against different datasets, where each of them has unique properties in terms of type and complexity. During statistical analysis of anomaly detections, we noticed that the proposed model is very effective while classifying raw data. Furthermore, we recorded that the classified data such as non-anomalous data are very clear, transparent, and limited in all use cases, which somehow will ensure reliable, delay-sensitive, and cost-effective transmission in the network. Because only non-anomalous data are allowed to be forwarded for transmission in the network, we are confident that this model could prove to be beneficial in future NG-CPS networks.

6. Conclusions

In this paper, we proposed an intelligent anomaly detection technique for next-generation cyber-physical systems (NG-CPS) utilizing the DDQN paradigm. Given that, we have checked different datasets for anomaly detection with three different categories, such as anomaly, non-anomaly, and partial anomaly. During statistical analysis, we noted that the proposed framework is quite consistent and reliable while identifying anomalies in the considered datasets. Furthermore, we underscored that it has the capability to manage complex and unique data environments while managing anomalies. Despite that, it has been also observed that it only allows non-anomaly data for further processing. As a result of that, the communication metrics such as latency, reliable transmission, and communication cost of the NG-CPS network could be significantly improved. With these considerations in mind, we are confidently sure that the proposed model holds considerable potential for future NG-CPS networks and could be useful in many applications, such as healthcare with early disease identification systems, finance sectors with fraudulent transaction detection systems, networks with fake packet detection, etc.

Author Contributions

Y.Z. and Z.U. contributed to the simulation segment followed by the initial draft, while M.J. worked on the proofreading to finalize the work. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Princess Nourah bint Abdulrahman University Researchers Supporting Project (PNURSP2023R104), Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare no conflict of interest.

References

Carletti, M.; Terzi, M.; Susto, G.A. Interpretable Anomaly Detection with DIFFI: Depth-based feature importance of Isolation Forest. Eng. Appl. Artif. Intell. 2023, 119, 105730. [Google Scholar] [CrossRef]
Yan, S.; Shao, H.; Xiao, Y.; Liu, B.; Wan, J. Hybrid robust convolutional autoencoder for unsupervised anomaly detection of machine tools under noises. Robot. Comput. Integr. Manuf. 2023, 79, 102441. [Google Scholar] [CrossRef]
Adil, M.; Jan, M.A.; Mastorakis, S.; Song, H.; Jadoon, M.M.; Abbas, S.; Farouk, A. Hash-MAC-DSDV: Mutual Authentication for Intelligent IoT-Based Cyber—Physical Systems. IEEE Internet Things J. 2021, 9, 22173–22183. [Google Scholar] [CrossRef] [PubMed]
Xu, H.; Pang, G.; Wang, Y.; Wang, Y. Deep isolation forest for anomaly detection. IEEE Trans. Knowl. Data Eng. 2023. [Google Scholar] [CrossRef]
Dorigo, T.; Fumanelli, M.; Maccani, C.; Mojsovska, M.; Strong, G.C.; Scarpa, B. RanBox: Anomaly detection in the copula space. J. High Energy Phys. 2023, 2023, 8. [Google Scholar] [CrossRef]
Wang, Y.; Peng, J.; Zhang, J.; Yi, R.; Wang, Y.; Wang, C. Multimodal Industrial Anomaly Detection via Hybrid Fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8032–8041. [Google Scholar]
Sun, S.; Gong, X. Hierarchical Semantic Contrast for Scene-aware Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22846–22856. [Google Scholar]
Cevikalp, H.; Uzun, B.; Salk, Y.; Saribas, H.; Köpüklü, O. From anomaly detection to open set recognition: Bridging the gap. Pattern Recognit. 2023, 138, 109385. [Google Scholar] [CrossRef]
Adil, M.; Song, H.; Khan, M.K.; Farouk, A.; Jin, Z. 5G/6G-Enabled Metaverse Technologies: Taxonomy, Applications, and Open Security Challenges with Future Research Directions. arXiv 2023, arXiv:2305.16473. [Google Scholar]
Mansour, R.F.; Escorcia-Gutierrez, J.; Gamarra, M.; Villanueva, J.A.; Leal, N. Intelligent video anomaly detection and classification using faster RCNN with deep reinforcement learning model. Image Vis. Comput. 2021, 112, 104229. [Google Scholar] [CrossRef]
Duan, X.; Ying, S.; Yuan, W.; Cheng, H.; Yin, X. QLLog: A log anomaly detection method based on Q-learning algorithm. Inf. Process. Manag. 2021, 58, 102540. [Google Scholar] [CrossRef]
Ma, X.; Shi, W. Aesmote: Adversarial reinforcement learning with smote for anomaly detection. IEEE Trans. Netw. Sci. Eng. 2020, 8, 943–956. [Google Scholar] [CrossRef]
Aberkane, S.; Elarbi, M. Deep reinforcement learning for real-world anomaly detection in surveillance videos. In Proceedings of the 2019 6th International Conference on Image and Signal Processing and Their Applications (ISPA), Mostaganem, Algeria, 24–25 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Chu, W.H.; Kitani, K.M. Neural batch sampling with reinforcement learning for semi-supervised anomaly detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 751–766. [Google Scholar]
de La Bourdonnaye, F.; Teuliere, C.; Chateau, T.; Triesch, J. Learning of binocular fixations using anomaly detection with deep reinforcement learning. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 760–767. [Google Scholar]
Caminero, G.; Lopez-Martin, M.; Carro, B. Adversarial environment reinforcement learning algorithm for intrusion detection. Comput. Netw. 2019, 159, 96–109. [Google Scholar] [CrossRef]
Erhan, L.; Ndubuaku, M.; Di Mauro, M.; Song, W.; Chen, M.; Fortino, G.; Bagdasar, O.; Liotta, A. Smart anomaly detection in sensor systems: A multi-perspective review. Inf. Fusion 2021, 67, 64–79. [Google Scholar] [CrossRef]
Servin, A.; Kudenko, D. Multi-agent reinforcement learning for intrusion detection: A case study and evaluation. In Proceedings of the Multiagent System Technologies: 6th German Conference, MATES 2008, Kaiserslautern, Germany, 23–26 September 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 159–170. [Google Scholar]
Hodge, V.J.; Hawkins, R.; Alexander, R. Deep reinforcement learning for drone navigation using sensor data. Neural Comput. Appl. 2021, 33, 2015–2033. [Google Scholar] [CrossRef]
Mahmud, M.; Kaiser, M.S.; Hussain, A.; Vassanelli, S. Applications of deep learning and reinforcement learning to biological data. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2063–2079. [Google Scholar] [CrossRef] [PubMed]
Akbari, I.; Tahoun, E.; Salahuddin, M.A.; Limam, N.; Boutaba, R. ATMoS: Autonomous threat mitigation in SDN using reinforcement learning. In Proceedings of the NOMS 2020—2020 IEEE/IFIP Network Operations and Management Symposium, Budapest, Hungary, 20–24 April 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–9. [Google Scholar]
Pang, G.; Cao, L.; Aggarwal, C. Deep learning for anomaly detection: Challenges, methods, and opportunities. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual Online, 8–12 March 2021; pp. 1127–1130. [Google Scholar]
Mosavi, A.; Faghan, Y.; Ghamisi, P.; Duan, P.; Ardabili, S.F.; Salwana, E.; Band, S.S. Comprehensive review of deep reinforcement learning methods and applications in economics. Mathematics 2020, 8, 1640. [Google Scholar] [CrossRef]
Prathiba, S.B.; Raja, G.; Anbalagan, S.; Arikumar, K.S.; Gurumoorthy, S.; Dev, K. A Hybrid Deep Sensor Anomaly Detection for Autonomous Vehicles in 6G-V2X Environment. IEEE Trans. Netw. Sci. Eng. 2022, 10, 1246–1255. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the paper [5,6,7,8,9,10,11,12,13,14,15,16,17,18].

Figure 2. Visual representation of our proposed DDQN paradigm.

Figure 3. Results statistical analysis of WBC dataset.

Figure 4. Results statistical analysis of Arrhythmia dataset.

Figure 5. Results statistical analysis of cardio dataset.

Figure 6. Results statistical analysis of breast dataset.

Figure 7. Results statistical analysis of Shuttle dataset.

Figure 8. Results statistical analysis of Speech dataset.

Figure 9. Results statistical analysis of Ionosphere dataset.

Figure 10. Results statistical analysis of Pendigits dataset.

Table 1. Symbol and Notations representation with their full description.

Full Description	Symbol/Notation	Full Description	Symbol/Notation
Action Space	A	Reward Function	R
Discount Factor	⋎	State of Markov Decision Processes	S
Anomaly Score	$c \in R^{n}$	State Vector (y)	$y^{i} \in R^{n}$
Number of Instances	n	Feature Dimension	d
Observations	O	Transition Probability	$P_{t}$
state value function	$V (s_{t})$	State-Action-Value Function	$Q (s_{t}, a_{t})$

Table 2. Different dataset statistical analysis.

Datasets	Points	Dimension	Anomalies %
Arrhythmia	452	274	16.0
Cardio	1831	21	11.2
Breast	683	10	33
Mammography	11,183	7	3.4
Pendigits	6870	16	4.2
Shuttle	49,097	9	5.0
Satellite	6435	36	34.0
Speech	3686	400	2.3
Vertebral	240	7	13.1
Ionosphere	351	33	36
Wbc	278	30	7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Jamjoom, M.; Ullah, Z. Double Deep Q-Network Next-Generation Cyber-Physical Systems: A Reinforcement Learning-Enabled Anomaly Detection Framework for Next-Generation Cyber-Physical Systems. Electronics 2023, 12, 3632. https://doi.org/10.3390/electronics12173632

AMA Style

Zhang Y, Jamjoom M, Ullah Z. Double Deep Q-Network Next-Generation Cyber-Physical Systems: A Reinforcement Learning-Enabled Anomaly Detection Framework for Next-Generation Cyber-Physical Systems. Electronics. 2023; 12(17):3632. https://doi.org/10.3390/electronics12173632

Chicago/Turabian Style

Zhang, Yinjun, Mona Jamjoom, and Zahid Ullah. 2023. "Double Deep Q-Network Next-Generation Cyber-Physical Systems: A Reinforcement Learning-Enabled Anomaly Detection Framework for Next-Generation Cyber-Physical Systems" Electronics 12, no. 17: 3632. https://doi.org/10.3390/electronics12173632

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Double Deep Q-Network Next-Generation Cyber-Physical Systems: A Reinforcement Learning-Enabled Anomaly Detection Framework for Next-Generation Cyber-Physical Systems

Abstract

1. Introduction

2. Related Work

3. Preliminaries and Problem Formulation

3.1. Problem Formulation

Markov Decision Process

4. Proposed Model

4.1. DDQN-Enabled Anomaly Detection Agent with Agent A

4.2. Learning Phase

4.3. Proximity-Based Observation and Sampling

4.4. Integration of Rewards: External and Intrinsic

4.5. Theoretical Evaluation of the Proposed Model

5. Experiment Settings and Results Analysis

5.1. Use Case Study Scenario: “WBC”

5.2. Use Case Study Scenario: “Arrhythmia”

5.3. Use Case Study Scenarios: “Cardio and Breast” Datasets

5.4. Use Case Study Scenarios: “Shuttle and Speech” Datasets

5.5. Use Case Study Scenarios: “Ionosphere and Pendigits” Datasets

5.6. Summary of Discussion

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI