Detection of Cyber-Attacks in a Discrete Event System Based on Deep Learning

Ding, Sichen; Liu, Gaiyun; Yin, Li; Wang, Jianzhou; Li, Zhiwu

doi:10.3390/math12172635

Open AccessArticle

Detection of Cyber-Attacks in a Discrete Event System Based on Deep Learning

by

Sichen Ding

¹,

Gaiyun Liu

²

,

Li Yin

^1,*

,

Jianzhou Wang

¹ and

Zhiwu Li

¹

Institute of Systems Engineering, Macau University of Science and Technology, Macao SAR, China

²

Groupe de Recherche en Electrotechnique et Automatique du Havre, Université Le Havre Normandie, 76600 Le Havre, France

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(17), 2635; https://doi.org/10.3390/math12172635 (registering DOI)

Submission received: 31 July 2024 / Revised: 18 August 2024 / Accepted: 21 August 2024 / Published: 25 August 2024

(This article belongs to the Special Issue Discrete Event Dynamic Systems and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

This paper addresses the problem of cyber-attack detection in a discrete event system by proposing a novel model. The model utilizes graph convolutional networks to extract spatial features from event sequences. Subsequently, it employs gated recurrent units to re-extract spatio-temporal features from these spatial features. The obtained spatio-temporal features are then fed into an attention model. This approach enables the model to learn the importance of different event sequences, ensuring that it is sufficiently general for identifying cyber-attacks, obviating the need to specify attack types. Compared with traditional methods that rely on synchronous product computations to synthesize diagnosers, our deep learning-based model circumvents state explosion problems. Our method facilitates real-time and efficient cyber-attack detection, eliminating the necessity to specifically identify system states or distinguish attack types, thereby significantly simplifying the diagnostic process. Additionally, we set an adjustable probability threshold to determine whether an event sequence has been compromised, allowing for customization to meet diverse requirements. Experimental results demonstrate that the proposed method performs well in cyber-attack detection, achieving over

99.9 %

accuracy at a

1 %

threshold and a weighted F1-score of 0.8126, validating its superior performance.

Keywords:

discrete event system; cyber-attack detection; graph convolutional network; gate recurrent unit; attention model

MSC:

93-05; 93-08; 93BXX

1. Introduction

As a contemporary technology-intensive system with comprehensive computing and physical capabilities, a cyber–physical system (CPS) can coordinate computational and physical resources through three cutting-edge technologies of computing, communication, and control. CPSs have attracted much research interest due to their wide range of applications, such as autonomous driving, smart grids, nuclear power plants, and chemical processes. By the nature of a CPS (usually operating under a networked environment), it is vulnerable to malicious attacks from intruders who may want to extract critical information or interfere with the system to cause havoc, and ultimately in many cases this results in great financial losses, or even threatens human life. The detection of cyber-attacks under a CPS architecture is of particular importance. However, detecting and locating a cyber-attack swiftly and precisely is challenging due to the inherently intertwined nature of digital and physical components. Cyber-attack detection has received increasing attention in discrete event systems [1,2,3].

The research in [2,4,5] proposes methods to compute the probability of a reachable state by using a labeled continuous-time Markov model to develop a tool for quantitative assessment of the attack risk. Attack risk is considered from the attacker’s point of view, where the synthesis of system attack strategies modeled by a probabilistic automaton is addressed in [6]. Specifically, this method can quantify an attack strategy based on the probability of successfully reaching an unsafe state. A cyber-attack detection model is established by varying the input and output behavior of a plant through a permutation matrix that alters the transmitted signals to disrupt the attack and by comparing the observed behavior with the expected behavior of the system to detect cyber-attacks in [7]. Similarly, this particular study involves comparing observed behavior with expected behavior, but we distinguish the proposed approach by employing probabilistic methods to detect deviations at each step. Furthermore, to detect attacks, an attack dictionary needs to be constructed in advance and transformed into a classical state estimation problem or a fault diagnosis problem [8]. However, it is well known that attacks not included in the attack dictionary cannot be detected, presenting a significant limitation in these approaches.

The problem of state estimation in a timed probabilistic setting is studied in [2]. A procedure for computing an observer for the state probability of labeled continuous-time Markov models, treated as functions of time, is designed. It requires explicit information including the system states, the system events feasible at each state, and the time of the events triggered at each state. In our cyber-attack detection model, a timed probability setting is also used but not in continuous-time, to distinguish the sequence of event occurrences. The probability of an event occurrence is computed rather than the probability of the state in which the event occurs. In [9], an observer is introduced to detect cyber-attacks in a controller by using a hidden Markov model determining the likelihood of cyber-attacks in the controller. In this paper, we extend the detection algorithm by introducing deep learning-related methods to provide different ideas for attack detection.

In the context of discrete event systems (DESs), it is often crucial to evaluate the stealthiness of attacks and determine whether attack actions result in abnormal event sequences. The works presented in [10,11,12,13,14] address intrusion detection and mitigation under four primary attack types: actuator enablement attacks, actuator disablement attacks, sensor erasure attacks, and sensor insertion attacks. Traditional methods typically require the construction of a diagnoser based on attack dictionaries. These diagnosers distinguish attack types and estimate system states for precise cyber-attack detection by synthesizing the diagnoser’s automaton through synchronous product computations. These methods often struggle with state explosion problems, particularly in large-scale systems, which complicates real-time attack detection.

To address these issues, this research employs deep learning techniques [15] to develop a cyber-attack detection model through simulation and data-driven training, effectively addressing the challenges posed by large-scale DES applications. By integrating graph convolutional networks (GCNs) [16,17] with gated recurrent units (GRUs) [18,19,20], the proposed approach eliminates the dependency on traditional synchronous product computations for diagnoser synthesis, thereby avoiding the state explosion issue. Our model is capable of capturing spatio-temporal features from event sequences and leverages an attention model to enhance the learning of key information, resulting in improved accuracy and efficiency in attack detection.

Once trained, the proposed cyber-attack detection model requires only the input of an event sequence to efficiently assess whether the system has been compromised and to pinpoint the attacked event. Additionally, the developed model does not necessitate state estimation or differentiation between actuator and sensor attacks, significantly simplifying the design of cybersecurity strategies and supervisory control policies. Unlike the extensive existing research on cybersecurity within DES frameworks, this study introduces a novel approach. It focuses on extracting spatio-temporal features from event sequences to evaluate potential cyber-attacks and detect anomalous events, offering a fresh perspective and methodology for cyber-attack detection. Therefore, this research not only enhances the efficiency of attack detection but also introduces an expandable and efficient technological route for safety studies of DESs. This innovative application of deep learning addresses the traditional challenges encountered in DES cybersecurity studies.

This research is motivated by the prediction problem of intelligent transportation systems, which also exhibit spatio-temporal dependence as described in [21,22]. Cyber-attack detection within the framework of DESs is a challenging yet intriguing topic in the control and automation community. In this research, we explore the spatial features of system behaviors, which are crucial for detecting cyber-attacks targeting a controller, as demonstrated in [9]. Considering that our model is based on an automaton capable of performing tasks such as node classification [23], node prediction [24,25], link prediction [26,27], and clustering at the graph level, we employ graph neural network methods to detect and localize cyber-attacks. Furthermore, as real-world systems under DESs can be abstracted into numerous nodes and edges, graph convolution is naturally employed for feature extraction. This graph convolution process includes the application of filters to perform weighted and normalized aggregation of nodes and edges features.

The paper is organized as follows. Section 2 introduces the preliminaries and Section 3 describes the methodology of the cyber-attack detection models. Section 4 is an experimental study, where a cyber-attack detection model is obtained for quantitative analysis by training a sequence of events generated by a virtually constructed deterministic automaton model. Section 5 concludes this paper.

2. Preliminaries

This section defines a probabilistic discrete event system (PDES) generating event sequences that serve as the input to a cyber-attack detection model. The cyber-attack detection model consists of a graph convolutional network model, a gated recurrent unit model, an attention model, and a focal loss function. This section focuses on the notion of a probabilistic automaton that models a probabilistic DES.

Definition 1.

A deterministic finite probabilistic automaton (DFPA) is a six-tuple

G = (Q, Σ, δ, q_{0}, Q_{m}, ρ),

where Q is a finiteset of states, Σ is a finite set of events,

δ : Q \times Σ \to Q

is a partial transition function,

q_{0} \in Q

is an initial state,

Q_{m} \subseteq Q

is a set of marker states, and

ρ : Q \times Σ \to [0, 1]

is a state-wise event probability distribution. Given a state

q \in Q

and an event

e \in Σ

,

ρ (q, e)

indicates the firing probability of e from q such that

ρ (q, e) = 0

if

δ (q, e)

is not defined and

ρ (q, e) > 0

if

δ (q, e)

is defined, denoted by

δ (q, e)!

. The set of events enabled at a state

q \in Q

is defined as

E n (q) = {e \in Σ | ρ (q, e) > 0}

with

\sum_{e \in E n (q)} ρ (q, e) \leq 1

.

Let

Σ^{*}

denote the set of all finite strings s over

Σ

, including the empty string

ε

. The transition function can be extended to

δ : Q \times Σ^{*} \to Q

[28] by defining

δ (q, ε) = q

and

δ (q, s σ) = δ (δ (q, s), σ)

, for

q \in Q

,

s \in Σ^{*}

and

e \in Σ

, where

δ (q, s)

and

δ (δ (q, s), e)

are both defined. Given a state

q \in Q

and a string

s \in Σ^{*}

, write

δ (q, s)!

if

δ

is defined for the string s at the state q.

A DFPA G is non-terminating if for all reachable states

q \in Q

,

\sum_{σ \in Σ} ρ (q, σ) = 1

, while G is said to be terminating if there exists at least one state q such that

\sum_{σ \in Σ} ρ (q, σ) < 1

[28]. Upon entering a state q, the probability that the system terminates at that state is

1 - \sum_{σ \in Σ} ρ (q, σ)

. In this paper, only non-terminating DFPAs are considered. The language generated by a DFPA

G = (Q

,

Σ

,

δ

,

q_{0}

,

Q_{m}

,

ρ)

is defined as

L (G) = {s \in Σ^{*} | δ (q_{0}, s)!}

[28], its marked language is defined as

L_{m} (G) = {s \in Σ^{*} | δ (q_{0}, s) \in Q_{m}}

, and the probabilistic language generated by G is defined as

L_{ρ} (G) (ε) = 1

L_{ρ} (G) (s σ) = \{\begin{matrix} L_{ρ} (G) (s) \times ρ (δ (q_{0}, s), σ) & if δ (q_{0}, s)!, \\ 0 & otherwise . \end{matrix}

Informally,

L_{ρ} (G) (s)

is the probability of the string s generated by G. We have

L_{ρ} (G) (s) > 0

iff

s \in L (G)

holds.

Example 1.

As shown in Figure 1,

G = (Q

, Σ, δ,

q_{0}

,

Q_{m}, ρ)

is a DFPA with

Σ = {a, b, c, d}

and

Q_{m} = {q_{1}}

, where

q_{1}

is double-cycled. The evaluation of the probability distribution ρ of the events in G is represented by the number immediately following each event. For instance, edge

\overset{a : 0.6}{\to}

exiting from state

q_{0}

indicates that the probability of event a occurs at state

q_{0}

is

0.6

, i.e., for all

s \in Σ^{*}

, if there exists a state

q \in Q

such that

δ (q, s) = q_{0}

, then

ρ (δ (q, s), a) = 0.6

holds. For example, given a string

s σ = c d c d a

with

σ = a

, the probability of generating

s σ

from

q_{0}

is

L_{ρ} (G) (s σ) = ρ (q_{0}, c) \times ρ (q_{2}, d) \times ρ (q_{0}, c) \times ρ (q_{2}, d) \times ρ (q_{0}, a) = 0.4 \times 0.2 \times 0.4 \times 0.2 \times 0.6 = 0.00384

.

In addition, the event set is divided into two disjoint subsets based on the sensor deployment of a system, i.e.,

Σ = Σ_{o} \cup Σ_{u o}

, where

Σ_{o}

is the set of observable events and

Σ_{u o}

is the set of unobservable events. This event partition leads to the definition of a natural projection function

P_{o} : Σ^{*} \to Σ_{o}^{*}

defined as follows. Given

s \in Σ^{*}

and

σ \in Σ

,

P_{o} (ε) = ε and P_{o} (s σ) = \{\begin{matrix} P_{o} (s) σ & if σ \in Σ_{o} \\ P_{o} (s) & if σ \in Σ_{u o} \end{matrix} .

Example 2.

Consider again the DFPA G in Figure 1. Suppose that

Σ_{o} = {a, b, d}

and

Σ_{u o} = {c}

. If

s = c

, we have

P_{o} (s) = ε

; if

s = c d c d a

, it follows that

P_{o} (s) = ε d ε d a = d d a

.

3. Methodology

In this research, we record the event sequences in a DFPA to form the feature vectors of these event sequences that serve as the input of deep learning so as to obtain the spatial features of events through a graph convolutional network (GCN). Then, the features derived from the GCN are used as the input of the GRU to find the spatio-temporal features, which are subsequently processed by the attention model, making the model able to distinguish the importance of multiple event combinations. Finally, the output of the attention model is mapped to all the defined events of the system by the Softmax function, in order to obtain the probability of occurrence of any event in the system. The schematic of this methodology is shown in Figure 2. Specifically, the system can detect the types and locations of events, where an attack exists in a finite number of steps by setting a probability threshold. In what follows, let

N_{+}

be the set of all positive integers.

3.1. Problem Definition

Definition 2.

Given

w \in L (G)

,

s_{n e t} = P_{o} (w) \in Σ_{o}^{*}

is called an observation of G. The observable language of G is a collection of all of its observations, i.e.,

L_{o} (G) = {s_{n e t} \in Σ_{o}^{*} ∣ \exists w \in L (G) : s_{n e t} = P_{o} (w)}

.

Given an observation

s_{n e t} = σ_{1} σ_{2} \dots σ_{n}

(

σ_{i} \in Σ_{o}

for all

i = 1, 2, \dots, n

) and

m \leq n

, it is divided into multiple equal-length subsequences

s_{1} = σ_{1} σ_{2} \dots σ_{m}

,

s_{2} = σ_{2} σ_{2} \dots σ_{m + 1}

…,

s_{k} = σ_{n - m + 1} σ_{n - m} \dots σ_{n}

for some

k \in N

, where the length of each subsequence is m. In other words, the t-th subsequence of

s_{n e t}

starts with

σ_{t}

and ends with

σ_{m + t - 1}

given

s_{n e t} = σ_{1} σ_{2} \dots σ_{n}

. To facilitate the development of the methodology proposed in the research, we transform an observation subsequence into a directed graph called an event-graph.

Definition 3.

Given an observation subsequence

s = σ_{1} σ_{2} \dots σ_{m}

, a directed graph associated with s, denoted by

G (s) = (V, E)

, is said to be an event-graph if

V = {σ_{1}, σ_{2}, \dots, σ_{m}}

is the set of nodes and

E = {(σ_{i}, σ_{i + 1}) ∣ i \in {1, 2, \dots, m - 1}}

is the set of edges.

As defined, an event-graph derived from an observation subsequence

s = σ_{1} σ_{2} \dots σ_{m}

is actually a path starting from node

σ_{1}

and ending with

σ_{m}

, as schematically demonstrated in Figure 3. Given a finite number of event-graphs

G (s_{1})

,

G (s_{2})

, …,

G (s_{h})

, let

G_{seq}

be the sequence of the event-graphs, denoted by

G_{seq} = G (s_{1}) G (s_{2}) \dots G (s_{h})

. By a slight abuse of notation, write

G (s_{t}) \in G_{seq}

if

G (s_{t})

is in

G_{seq}

,

t = 1, 2, \dots, h

. This research collects a group of event-graphs as a block as shown in Figure 4 in order to accurately estimate the occurrence probability of an event and furthermore to predicate the detection accuracy of cyber-attacks.

Given

Δ \in Σ_{o}^{*}

, it is said be to a full observation permutation if

| Δ | = | Σ_{o} |

and for all

σ \in Σ

,

σ \in Δ

, i.e., every symbol in

Σ_{o}

appears once in

Δ

. For

Σ_{o}

, its full observation permutations are not unique. One trivial possibility is to produce such a permutation in lexicographical ordering. Let

Δ [i]

denote the i-th element in

Δ

and let us fix a full observation permutation

Δ

. Then,

W (Δ [i]) = i

is called the feature value of element

Δ [i]

of the fixed

Δ

, where

i \in {1, 2, \dots, | Σ_{o} |}

. Analogously, given an observation

s \in Σ_{o}^{*}

,

s [i]

denotes its i-th element.

Definition 4.

The feature vector of an event-graph

G (s)

,

| s | = m

, with respect to a fixed full observation permutation Δ is defined as an m-dimensional vector

X (s) = [x_{1}, x_{2}, \dots, x_{m}]

for which

x_{i} = W (Δ [j])

if

s [i] = Δ [j]

, where

i \in {1, 2, \dots, m}

and

j \in {1, 2, \dots, | Σ_{o} |}

. In

X (s)

,

x_{i}

is said to be the vertex feature of

s [i]

.

Given a finite number of feature vectors

X (s_{1})

,

X (s_{2})

, …,

X (s_{h})

, let

X_{seq}

be the sequence of the vertex features, denoted by

X_{seq} = X (s_{1}) X (s_{2}) \dots X (s_{h})

. By a slight abuse of notation, write

X (s_{t}) \in X_{seq}

if

X (s_{t})

is in

X_{seq}

,

t = 1, 2, \dots, h

, where t can also be understood as a time step.

Example 3.

Consider the DFPA G in Figure 1 with

Σ_{o} = {a, b, d}

. Obviously,

s_{n e t} = d d d a

is an observation of G. Let

m = 3

. Then,

s_{n e t}

can be divided into two subsequences

s_{1} = d d d

and

s_{2} = d d a

. Fix

Δ = a d b

, i.e.,

Δ [1] = a

,

Δ [2] = d

, and

Δ [3] = b

. We have

W (Δ [1]) = 1

,

W (Δ [2]) = 2

, and

W (Δ [3]) = 3

. By the definition of the feature vector of an event-graph,

X (s_{1}) = [2, 2, 2]

and

X (s_{2}) = [2, 2, 1]

.

Given an event-graph

G (s)

, an m-dimensional vector is assigned to store all the vertex information in the graph; a two-dimensional array is defined to show the incidence relationship between vertices, which is called the adjacency matrix of

G (s)

.

Definition 5.

The adjacency matrix of an event-graph

G (s) = (V, E)

is defined as

A_{s} [i, j] = \{\begin{matrix} 1 & if (σ_{i}, σ_{j}) \in E; \\ 0 & if (σ_{i}, σ_{j}) \notin E . \end{matrix}

where

i, j \in {1, 2, \dots, | s |}

.

By definition, the event-graphs of any two divided observation subsequences are isomorphic and thus they share the same adjacency matrices. For

G (s)

, its adjacency matrix

A_{s}

and feature vector

X (s)

serve as the inputs to a GCN, which aggregates and updates the features of each vertex in

G (s)

. Then, as shown in Figure 2, a block consisting of I event-graphs that are separately processed by corresponding GCNs and GRUs is considered as the input to an attention model. For simplicity, suppose that the whole process is represented by a function g obtained by deep learning, which expresses the spatial correlation of each event-graph and the temporal relationship of the I event-graphs. By doing so, it provides a possibility to predict the future T event-graphs’ features, represented by

\begin{matrix} [X (s_{t}), . . ., X (s_{t + T})] \\ = g (A_{s}, (X (s_{t - I}), . . ., X (s_{t - 2}), X (s_{t - 1}))) . \end{matrix}

(1)

Note that T is the length of the predicted event-graphs.

3.2. Methodology Overview

This paper proposes a novel method for detecting whether a PDES is subject to attacks by a cyber-attacker (sometimes called an intruder), as well as detecting the location of the attacked events. By observing the evolution of a DFPA, an observable event sequence

s_{n e t}

is obtained, which is divided into multiple observation subsequences with the same length. For any observation subsequence s, its feature vector

X (s)

and adjacency matrix

A_{s}

are fed to a GCN and then a GRU. We denote the output of each GRU as a feature matrix. Then, we input I feature matrices, which are the output of I GRUs into an attention model, as shown in Figure 2. After being processed by an attention model, they are input to a multi-layer perceptron fusion and adjusted to a new feature vector as output. This output is compared with the

X (s)

of the

(I + 1)

-th feature vector, i.e.,

X (s_{t})

in Figure 2; a loss function is used to measure the difference between the output and the feature vector of next observation subsequence. Such a computing procedure is then back-propagated to optimize the related parameters in our cyber-attack detection model to obtain a more accurate output in the following iteration.

3.3. Spatial Correlation Model

The exploration of spatial correlation features of event-graphs is critical as the correlation information hidden in the event-graphs is particularly interesting and useful, as to be exposed in what follows. A GCN model constructs a filter in the Fourier domain and acts on the vertices of a graph. The spatial correlation features between the vertices are obtained through the first-order neighborhood of the graph. Then, the GCN model is built by stacking multiple convolutional layers [19], which is defined as

X {(s_{t})}^{(l + 1)} = R e L U ({\tilde{D}}^{- \frac{1}{2}} \tilde{A_{s}} {\tilde{D}}^{- \frac{1}{2}} X {(s_{t})}^{(l)} W^{(l)})

(2)

where

X {(s_{t})}^{(l)}

is a feature vector input of the

(l + 1)

-th GCN layer,

X {(s_{t})}^{(l + 1)}

is the feature vector output of the

(l + 1)

-th GCN layer,

\tilde{A_{s}} = A_{s} + I_{N}

is an adjacency matrix of the graph with self-loops added,

I_{N}

is an identity matrix with appropriate dimensions,

\tilde{D}

is a degree matrix with

\tilde{D} [i, i] = \sum_{j} \tilde{A_{s}} [i, j]

and

\tilde{D} [i, j] = 0

for

i \neq j

, and

W^{(l)}

is a weight matrix (with appropriate dimensions) of this GCN layer. We use

R e L U

to denote an activation function. Specifically, given a vector

V = [v_{1}, v_{2}, \dots, v_{m}]

,

R e L U (V) = [R e L U (v_{1}), R e L U (v_{2}), \dots, R e L U (v_{m})]

and

R e L U (v_{i})

is defined as

R e L U (v_{i}) = \{\begin{matrix} v_{i} & if v_{i} \geq 0 for i = 1, 2, . . ., m; \\ 0 & if v_{i} < 0 for i = 1, 2, . . ., m . \end{matrix}

(3)

By a slight abuse of notation, write

v_{i} \in V

, if

v_{i}

is an entry of V,

i \in {1, 2, \dots, m}

.

The operation of a multi-layer GCN is shown in Equation (2). In our GCN model, a two-layer GCN is selected to obtain spatial correlation features, which is defined as

f (A_{s}, X (s_{t})) = \hat{A} R e L U (\hat{A} X (s_{t}) W_{0}) W_{1}

(4)

where

\hat{A_{s}} = {\tilde{D}}^{- \frac{1}{2}} \tilde{A_{s}} {\tilde{D}}^{- \frac{1}{2}}

,

f (A_{s}, X (s_{t})) \in R^{m \times T}

is the output of feature vectors of T event-graphs,

W_{0} \in R^{m \times w_{h}}

is a weight matrix from an input layer to a hidden layer with m being the number of elements in the feature vector

X (s_{t})

,

w_{h}

is the number of hidden units, and

W_{1} \in R^{w_{h} \times T}

is a weight matrix from the hidden layer to an output layer.

The basic idea of a GCN is to average a graph’s neighboring features (including itself) and then pass them into linear layers (linear combination of inputs and weight matrices). The topological relationship between vertices and adjacent vertices on the event-graph

G (s_{t})

is based on the GCN model. Thus, the spatial correlation features of

G (s_{t})

is obtained.

Example 4.

Given an event sequence

s_{t} = σ_{3} σ_{5} σ_{1} σ_{4} σ_{2}

, the event-graph

G (s_{t}) = (V, E)

constructed by

s_{t}

contains five vertices and four edges, which is shown in Figure 5, where

V = {σ_{1}, σ_{2}, σ_{3}, σ_{4}, σ_{5}}

is a set of vertices. Then, the feature vector of this event sequence is defined as

X (s_{t}) = [3, 5, 1, 4, 2]

, if a full observation permutation

Δ = σ_{3} σ_{5} σ_{1} σ_{4} σ_{2}

is fixed. We show the computational procedure of

\hat{A_{s}}

as follows.

The adjacency matrix

A_{s}

, the identity matrix

I_{N}

, and the new adjacency matrix

\tilde{A}

are, respectively:

\begin{matrix} A_{s} = (\begin{matrix} 0 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 \end{matrix}), & I_{N} = (\begin{matrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{matrix}), \end{matrix}

\tilde{A_{s}} = A_{s} + I_{N} = (\begin{matrix} 1 & 0 & 0 & 1 & 1 \\ 0 & 1 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 1 \\ 1 & 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 \end{matrix}) .

The new degree matrix

\tilde{D}

is:

\tilde{D} = (\begin{matrix} 3 & 0 & 0 & 0 & 0 \\ 0 & 2 & 0 & 0 & 0 \\ 0 & 0 & 2 & 0 & 0 \\ 0 & 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 0 & 3 \end{matrix}) .

We compute

{\tilde{D}}^{- \frac{1}{2}}

from the degree matrix

\tilde{D}

:

{\tilde{D}}^{- \frac{1}{2}} = (\begin{matrix} 1 / \sqrt{3} & 0 & 0 & 0 & 0 \\ 0 & 1 / \sqrt{2} & 0 & 0 & 0 \\ 0 & 0 & 1 / \sqrt{2} & 0 & 0 \\ 0 & 0 & 0 & 1 / \sqrt{3} & 0 \\ 0 & 0 & 0 & 0 & 1 / \sqrt{3} \end{matrix}) .

Normalize all rows and columns by

\hat{A_{s}} = {\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}}

(approximate to four decimal places):

\hat{A_{s}} = (\begin{matrix} 0.3333 & 0 & 0 & 0.3333 & 0.3333 \\ 0 & 0.5000 & 0 & 0.4082 & 0 \\ 0 & 0 & 0.5000 & 0 & 0.4082 \\ 0.3333 & 0.4082 & 0 & 0.3333 & 0 \\ 0.3333 & 0 & 0.4082 & 0 & 0.3333 \end{matrix}) .

The basic idea of a GCN is to average a graph’s neighboring features (including itself) and then pass them into linear layers (linear combination of inputs and weight matrices). The topological relationship between vertices and adjacent vertices on the event-graph

G (s_{t})

is based on the GCN model. Thus, the spatial correlation features of

G (s_{t})

is obtained. The obtained matrix

\hat{A_{s}}

is the essence of a GCN model, and then

\hat{A_{s}}

can be brought into Equation (4) to compute a two-layer GCN model, as shown in Figure 6. After the spatial correlation features of the event-graph composed of the event sequence are extracted by the GCN, the cyber-attack detection model further recognizes the structure of the DFPA by computing the temporal correlation features of these event-graphs. We can choose the GRU to extract the temporal correlation features of the event-graphs.

3.4. Temporal Correlation Model

The GRU is a kind of recurrent neural network, which is proposed to solve problems such as long-term memory and gradients in back-propagation [19]. To better describe the use of temporal correlation features in our cyber-attack detection model, given a vertex feature sequence

X_{seq}

, let a vertex feature subsequence

X (s_{t - I}) \dots X (s_{t - 2}) X (s_{t - 1}) \in X_{seq}

be the input into the GCN model to obtain a sequence of corresponding outputs

f (A_{s}, X (s_{t - I})) \dots f (A_{s}, X (s_{t - 2})) f (A_{s}, X (s_{t - 1}))

, which is written as

F_{seq}

. By a slight abuse of notation, write

f (A_{s}, X (s_{i})) \in F_{seq}

if

i \in {t - 1, t - 2, \dots, t - I}

. Moreover, we tentatively express the output of each GRU as a feature matrix. Given a sequence

F_{seq}

as input, our GRU model can process each function sequentially and generate a corresponding feature matrix as the output. Each output is passed to the next

f (A_{s}, X (s_{i}))

for

i \in {t - 1, t - 2, \dots, t - I}

to compute the next output of a GRU. This is an iterative process. For instance,

f (A_{s}, X (s_{t - I}))

serves as an input to the GRU and its output will be used as the input to the GRU together with

f (A_{s}, X (s_{t - I + 1}))

to compute a new feature matrix. In this way, our cyber-attack detection model has temporal correlation features. Each feature matrix extracts the features of a function sequence

F_{seq}

, which also means that our GRU model can integrate the temporal correlation features of the event-graphs of I groups in

G_{seq}

like it has memories of past event-graphs. A feature matrix is also said to be a hidden state and the hidden state is denoted as

h_{t}

. The subscript t corresponds to notation t in

X (s_{t})

and has an ordering relation inherited from

X_{seq}

. The subscript t can also reflects a temporal sequential relation. Thus, t can also be thought of as a time step; thus

h_{t - 1}

is a hidden state at time step

t - 1

.

In the GRU model, we need two kinds of activation functions: Sigmod and Tanh, which are, respectively, defined as follows,

S i g (a_{i j}) = \frac{1}{1 + e^{- a_{i j}}},

(5)

t a n h (b_{i j}) = \frac{e^{b_{i j}} - e^{- b_{i j}}}{e^{b_{i j}} + e^{- b_{i j}}},

(6)

where

a_{i j}

is the element of the i-th row and j-th column of a matrix A as the input of a Sigmod function, and

b_{i j}

is the element of the i-th row and j-th column of a matrix B as the input of a Tanh function.

The GRU model has two gates and a candidate hidden state: one gate is an “update gate”, and the other is a “reset gate”. The update gate is responsible for determining the relative weight between a previous hidden state and a current input by employing an activation function. Its output is a value matrix which governs the extent of the hidden state influenced by the previous hidden state. The reset gate controls the fusion between the previous hidden state and the current input. It assists in deciding which historical information is discarded and which is retained. We define the specific operational steps of the GRU model as:

u_{t} = S i g (W_{u} * [f (A_{s}, X (s_{t})), h_{t - 1}] + b_{u})

(7)

r_{t} = S i g (W_{r} * [f (A_{s}, X (s_{t})), h_{t - 1}] + b_{r})

(8)

c_{t} = t a n h (W_{c} * [f (A_{s}, X (s_{t})), (r_{t} * h_{t - 1})] + b_{c})

(9)

h_{t} = u_{t} * h_{t - 1} + (1 - u_{t}) * c_{t}

(10)

where

u_{t}

is the update gate;

r_{t}

is the reset gate;

c_{t}

is the candidate hidden state at time t of the GRU model;

f (A_{s}, X (s_{t}))

is the output of the GCN model as shown in Equation (4); weight matrices

W_{u}

,

W_{r}

, and

W_{c}

are generated by random initialization in the update gate, reset gate, and computational candidate hidden state process, respectively; deviations

b_{u}

,

b_{r}

, and

b_{c}

are generated by random initialization in the update gate, reset gate, and computational candidate hidden state process, respectively;

h_{t}

is the hidden state of an output at time step t; and

[\cdot, \cdot]

is an operation that executes concatenation vertically on matrices along rows. To concatenate matrices, they must have compatible dimensions. In other words, when we concatenate matrices vertically, they must have the same number of columns. Such a vertical concatenation operation is denoted by

[\cdot, \cdot]

or

C a t (\cdot, \cdot)

. As an illustration, given two matrices A and B, are:

A = (\begin{matrix} a_{1} & a_{2} \\ a_{3} & a_{4} \end{matrix}), B = (\begin{matrix} b_{1} & b_{2} \\ b_{3} & b_{4} \\ b_{5} & b_{6} \end{matrix}) .

Then, the vertical concatenation of A and B is

F = {(\begin{matrix} a_{1} & a_{3} & b_{1} & b_{3} & b_{5} \\ a_{2} & a_{4} & b_{2} & b_{4} & b_{6} \end{matrix})}^{T},

i.e., F is a matrix by operating the vertical concatenation.

The GRU model computes a hidden state

h_{t}

based on the previous hidden state

h_{t - 1}

as input. The computation expressed by Equations (7)–(10) is diagrammatically shown in Figure 7, where ⊙ represents the multiplication of the corresponding elements of matrices. The process of using our GRU model is shown in Figure 8, and the whole process of combining GCN and GRU is shown in Figure 9. Due to the two gates and the candidate hidden state, the GRU model retains the changing trend of the occurrence features of historical event-graphs when capturing the function f’s. Therefore, this GRU model can extract dynamic time-varying features. Namely, the GRU model is introduced in our research to learn the temporal correlation features of the occurrences of the events. The features obtained through the GRU model can be applied to an attention model to calculate attention scores. The attention model assesses the degree to which features deserve to receive more attention and then increases the weight coefficient of

h_{t}

. The attention model is introduced as below.

3.5. Attention Model

An attention model is usually divided into soft attention and hard attention. In the case of a hard attention model, only the representation of a selected position is transmitted or utilized, whereas in a soft attention model, the representations of all positions are weighted and summed up. In this research, a soft attention model is assigned to learn the weight coefficient of

h_{t}

, which expresses the trends of event-graphs evolution [22].

The primary motivation for introducing the attention model is to enhance the model’s ability to focus on relevant features across different time steps, thereby improving its generalization capability. In this study, the event sequences in DESs may contain complex dependencies, and not all events contribute equally to predicting the next event. The attention model allows the model to assign different weights to the hidden states, thus emphasizing the most critical events in the sequence. This selective focus is crucial for accurately predicting the occurrence probability of future events, which is the main goal of our cyber-attack detection model.

Given a finite number of hidden states

h_{t - I}, \dots, h_{t - 2}, h_{t - 1}

, the vertical concatenation of the hidden states called an integrated feature

H_{t - I}

is defined as

H_{t - I} = C a t (h_{t - I}, . . ., h_{t - 2}, h_{t - 1}) = {[h_{t - I}, . . ., h_{t - 2}, h_{t - 1}]}^{T}

, where

C a t

is an operation that executes the vertical concatenation defined previously. An attention scoring function is designed to compute the score (weight) of each hidden state by a multi-layer perceptron. Let us provide an explanation of the multi-layer perceptron. Multi-layer perceptron is a neural network model composed of multiple neurons organized in a hierarchical structure. Through training and adjusting the connection weights between neurons, multi-layer perceptron can learn and solve data prediction problems; thus each hidden state’s attention score is considered as a discrete neuron. The attention scoring function by a multi-layer perceptron is defined as

[e_{t - I}, . . ., e_{t - 2}, e_{t - 1}] = w_{(2)} (w_{(1)} H_{t - I} + b_{(1)}) + b_{(2)}

(11)

where

[e_{t - I}, \dots, e_{t - 2}, e_{t - 1}]

is a vector composed of the attention scores of I hidden states, with

e_{i}

(

i \in {t - I, \dots, t - 2, t - 1}

) being the attention score of a hidden state representing the importance of this hidden state at time step i; weight matrices

w_{(1)}

and

w_{(2)}

are generated by random initialization in the first and the second linear layers of the multi-layer perceptron, respectively; deviations

b_{(1)}

and

b_{(2)}

are generated by random initialization in the first and the second linear layers of the multi-layer perceptron, respectively; and

H_{t - I}

is an integrated feature matrix. Afterward, the output of the attention score function is calculated by the Softmax function, defined as

α_{i} = \frac{e x p (e_{i})}{\sum_{k = i - I}^{i - 1} e x p (e_{k})}

(12)

where

e x p (\cdot)

is an exponential operation, I is the number of hidden states, and

α_{i}

is the output of the Softmax function that is a normalization of

e_{i}

, called a normalized attention score.

More specifically, given a finite number of hidden states

h_{t - I}, \dots, h_{t - 2}, h_{t - 1}

, let

H_{seq}

be the sequence of the hidden states, denoted by

H_{seq} = h_{t - I} \dots h_{t - 2} h_{t - 1}

. Our goal is to more accurately predict the occurrence probability of various events in the next time step, and this weight coefficient reflects how much a hidden state can contribute to the accuracy of our prediction goal in its corresponding

H_{seq}

. Then, we use the attention score function computing each hidden state’s attention score, which can be understood as a weight coefficient of importance about a prediction target in

H_{seq}

(Equation (11)). To facilitate the subsequent calculations, the normalized attention scores are obtained by normalizing using the Softmax function; see Equation (12).

Finally, the attention model is designed to cover a context vector C that reflects the global information of a hidden state sequence

H_{seq}

. The context vector C is defined as

C = \sum_{i - I}^{i - 1} α_{i} * h_{i}

(13)

where C is a globally I-accumulated vector (context vector),

α_{i}

is the output of the Softmax function at time step i,

i \in {t - I, \dots, t - 2, t - 1}

, and

h_{i}

is a hidden state at time step i. The whole process is shown in Figure 10.

3.6. Loss Function

The event labels as features to construct a dataset from a DFPA may be relatively uneven; there may exist samples that are difficult to classify and samples that are easy to classify. The focal loss improves the accuracy of predicting the labels of the events in our cyber-attack detection model by slightly modifying the cross-entropy [29]. Consequently, the focal loss is assigned as a loss function to complete the classification task of our cyber-attack detection model. Given a parameter

p_{f} \in [0, 1]

[29] as an intermediate element that expresses the predicted probability in an output matrix processed by the Softmax function, the focal loss function

F l

is defined as

F l (p_{f}) = - \sum_{f = 1}^{O} {(1 - p_{f})}^{l} l o g (p_{f})

(14)

where O is the number of event labels in samples, f is the event label of a sample, and l is the hyperparameter, called a focus parameter. In fact, the larger the focus parameter, the more significant the difference between the difficulty and easiness to classify samples. Specifically, when the focus parameter is set to 0, the loss function degenerates to a cross-entropy loss.

3.7. Cyber-Attack Detection Model

A cyber-attack detection model is extracted by a GCN for spatial correlation features and then output to a GRU for the extraction of temporal correlation features to obtain the trend of the types of events that occur over time. The process of the GCN and GRU in the cyber-attack detection model is shown in Figure 9. Finally, as shown in Figure 10, the output including the spatial correlation features and the temporal correlation features is used as the input of an attention model for generalization. The cyber-attack detection model can pay more attention to the features of important event sequences.

Suppose that we have an event sequence

s_{n e t}

handled as multiple event-graphs to form an event-graph sequence

G_{seq}

. Each sequence

\begin{matrix} G (s_{t - I}) . . . G (s_{t - 3}) \\ G (s_{t - 2}) G (s_{t - 1}) \in G_{seq} \end{matrix}

is inputted into the GCN model and then fed into the GRU model to obtain I hidden states, denoted by

h_{t - I} \dots h_{t - 2} h_{t - 1}

, which covers the spatio-temporal features. Then, the hidden states’ integrated feature

H_{t - I}

by concatenation is inputted into the attention model to compute a context vector C covering the global I hidden states’ information. Specifically, a multi-layer perception and the Softmax function are assigned to compute the normalized attention scores of all hidden states, which are denoted as

(a_{t - I}, \dots, a_{t - 2}, a_{t - 1})

. The context vector C can be calculated by utilizing the normalized attention scores

(a_{t - I}, \dots, a_{t - 2}, a_{t - 1})

and the corresponding hidden states. Finally, we use a focal loss as a loss function to compute the loss of feature vectors of the predicted event-graph in the next time step with the actual feature vectors in the next time step. The loss is used to update the weight matrices in our cyber-attack detection model by back-propagation. In this way, the cyber-attack detection model can predict each event’s occurrence probability by reverting feature vectors of event-graphs to event types in

s_{n e t}

. For instance, if we have a predicted probability of an event-graph’s feature vector

X (s_{t})

, we are aware of the occurrence probability of each event in the next time of

s_{t - 1}

.

4. Experiments

4.1. Data Description

Most existing datasets are not typical enough to represent the relationship between events and states in a DFPA. We construct a PDES dataset for the cyber-attack detection model and this PDES dataset can provide a visual and concise representation of the cyber-attack detection capabilities of various deep learning methods. The PDES dataset, available through GitHub in [30], was constructed by randomly generating a sequence of events via simulating the actual situation in a DFPA G, which is shown in Figure 11. In this PDES dataset, the event sequence

s_{n e t}

contains 39,991 events. An event sequence

G_{seq}

constructed from

s_{n e t}

by Definition 3 has a total of 39,991 event-graphs, and the number of vertices in each event-graph

G (s_{t})

is set to be eight. The PDES dataset contains a total of 319,928 related data, including 39,991 rows of event-graphs.

In other words, we use a DFPA G as a source of event-graph sequences

G_{seq}

, which forms the PDES dataset. In fact, we do not have the structure of the DFPA G in the experiments, while only the event sequence

s_{n e t}

is captured. Therefore, we call this DFPA G a hidden DFPA G. To form the PDES dataset, let us consider the hidden DFPA G in Figure 11 with inital state

q_{0}

. We have

Σ_{o} = {a, b, c, d, e, g}

and

Σ_{u o} = {f}

. Every eight events’ features form an event-graph

G (s_{t})

by Definition 3. If

X (s_{t}) \in X_{seq}

,

t = 1, 2, \dots, h

, the event-graph feature vector of each event-graph is defined as

X (s_{t}) = X_{8 \times 1}^{t} = [x_{1}, x_{2}, . . ., x_{8}]

. The PDES dataset consists of 39,991

X (s_{t})

’s.

Considering the requirement to evaluate the effectiveness of the cyber-attack detection model for optimization, we construct an attack DFPA

G_{a}

, as shown in the red dashed box in Figure 11. To form a dataset attacked by the attack DFPA

G_{a}

, we replicate the structure of the DFPA G to the attack DFPA

G_{a}

, with the same number of observable and unobservable events. It is assumed that the attacker has complete knowledge of the structure of the hidden DFPA G and have stolen the privileges of the initial state

q_{0}

and the state

q_{10}

in DFPA G, which means the attacker can enter and return to the DFPA G without being observed in

q_{0}

and

q_{10}

, respectively. By enabling the unobservable event f in the initial state

q_{0}

, the attacker can induce the DPFA G into the designed attack DFPA

G_{a}

. The attacker can also stop the attacks at state

q_{26}

in the attack DFPA

G_{a}

and return to the state

q_{10}

in the DFPA G by enabling the unobservable event f.

Another dataset obtained by the attack DFPA

G_{a}

, which contains 1500

s_{n e t}

’s corrupted by the attack, is called an attacked dataset. The event sequences corrupted by the attacks generated by

G_{a}

and constructed as the event-graph sequences are more difficult to be detected by the cyber-attack detection model than the attacks generated by a non-probabilistic automaton. Thus, the cyber-attack detection through the attack events generated by

G_{a}

can reflect the reliability of our cyber-attack detection model.

4.2. Evaluation Metrics

The minimum probability threshold for the occurrence of each event is defined as

P_{a}

, implying that if the event probability of occurrence is lower than the threshold, we believe that the event is corrupted by the attacks; otherwise the event is a normal event. Consider that the occurrence probability of a normal event in the DFPA G is at least

1 %

. Thus, the initial probability threshold

P_{a}

is also set to

1 %

. We classify the detection situation into four categories:

(1) If the current event is an event corrupted by an attack, and the cyber-attack detection model predicts that the occurrence probability of the current event is less than

P_{a}

, then the cyber-attack detection model can make a correct judgment.

(2) If the current event is a normal event, and the cyber-attack detection model predicts that the occurrence probability of the current event is less than

P_{a}

, then the cyber-attack detection model can make an incorrect judgment.

(3) If the current event is an event corrupted by an attack, and the cyber-attack detection model predicts that the occurrence probability of the current event is higher than

P_{a}

, then the cyber-attack detection model can make an incorrect judgment.

(4) If the current event is a normal event, and the cyber-attack detection model predicts that the occurrence probability of the current event is higher than

P_{a}

, then the cyber-attack detection model can make a correct judgment.

We denote the number of events that are correctly judged as

n_{c o r r e t}

and the total number of events that are input into the cyber-attack detection model for prediction as

n_{t o t a l}

. The accuracy rate of detecting whether an event is corrupted by an attacker is defined as

A c c = \frac{n_{c o r r e t}}{n_{t o t a l}},

where

A c c

can reflect whether the cyber-attack detection model is able to detect events that are attacked.

In both datasets, the PDES dataset and the attacked dataset, we utilize the same cyber-attack detection model. The cyber-attack detection model can detect events in the PDES dataset as normal events, and events that were tampered in the attacked dataset as attacked events. First, our cyber-attack detection model must ensure that all attack-tampered events generated by DFPA

G_{a}

can be detected, since the attack-tampered events are more dangerous for a PDES system, and then ensure that normal events generated by DFPA G are correctly identified as normal events. The above description ensures the usability of the cyber-attack detection model. In the following subsection, we analyze all the possible attack-tampered events in DFPA

G_{a}

by enumeration. Here, we need to first analyze the relationship between the attack threshold

P_{a}

and the accuracy rate

A c c

in the PDES dataset. As shown in Figure 12, the contents of the red rectangular box represent when

1 % \leq P_{a} \leq 5 %

, the accuracy

A c c

of the cyber-attack detection model reaches over

99.9 %

in the PDES dataset, which means that almost all the events in the DFPA G are correctly predicted under this probability threshold. In particular, when

1 % \leq P_{a} \leq 2 %

, the accuracy

A c c

reaches

100 %

in the PDES dataset. Moreover, the relationship between the attack threshold

P_{a}

and the accuracy rate

A c c

in our attacked dataset is shown in Figure 13. From the results obtained, it becomes evident that a decrease or increase in the threshold value of

1 %

leads to a corresponding decrease in the accuracy rate

A c c

. In order to impose a more stringent constraint, the probability threshold

P_{a}

is set to

1 %

in the subsection.

Beyond accuracy, we have also evaluated the overall performance of the cyber-attack detection model using receiver operating characteristic (ROC) curves, as shown in Figure 14. The ROC curves illustrate the model’s ability to differentiate between different event categories. The model addresses a six-class classification task, where the occurrence probability of each event is computed to further determine whether the event has been tampered. The area under the curve (AUC) values range from 0.71 to 0.93, indicating strong performance across most event categories, with Class 3 (c) achieving the highest AUC value of 0.93.

The overall performance of the proposed cyber-attack detection model is summarized by the average F1-score and weighted F1-score, which are 0.7049 and 0.8126, respectively. The weighted F1-score considers the support of each event category, making it a more reliable measure of performance, particularly in scenarios with imbalanced class distributions. These metrics indicate that our model not only maintains high accuracy but also achieves balanced performance across all event categories. To further validate the effectiveness of the model, we have compared it against models using GRU or long short-term memory (LSTM) [19] alone. Both GRU and LSTM models are trained under the same experimental settings, and their ROC curves are illustrated in Figure 15, highlighting the advantages of our approach.

As observed from Table 1, both GRU and LSTM models perform slightly inferior in distinguishing different event categories compared to the proposed model. Specifically, the GRU model achieves an average F1-score of 0.6277 and a weighted F1-score of 0.7061, while the LSTM model obtains an average F1-score of 0.6313 and a weighted F1-score of 0.6979. In contrast, the proposed cyber-attack detection model achieves a significantly higher average F1-score of 0.7049 and a weighted F1-score of 0.8126, outperforming both GRU and LSTM models.

These results indicate that while GRU and LSTM have advantages in handling sequential data, their performance in the cyber-attack detection task is still inferior to the proposed model. The model that we use in this paper not only exhibits superior AUC values but also achieves better F1-scores, particularly when considering class imbalance. These experimental findings further confirm the superiority of our proposed cyber-attack detection model in the task of detecting cyber-attacks.

4.3. Experimental Results

Considering the DFPA G and the attack DFPA

G_{a}

shown in Figure 11, we can find that if the first attack-tampered event occurs and an observable event sequence

s_{n e t}

ends in the

G_{a}

, by denoting as

S_{a t t a c k}

the set of these observable event sequences, one has

\begin{matrix} S_{a t t a c k} = { & b, g e, g a d, g d d, g g c, g a b a, g a b d, g a b b a, \\ g a b c e, g a b b c e, g a b c a d, g a b b c a d, g a b c d g c, \\ g a b c d e d g c, g a b b c d e d g c} . \end{matrix}

Specially, if an event sequence

ω

is attacked, then an observable string

ω^{'}

starting from the first event of

ω

to the first-tampered event of

ω

is contained in

S_{a t t a c k}

. In this example, each attacked category of the attack DFPA

G_{a}

is contained in

S_{a t t a c k}

, since

S_{a t t a c k}

is the set of all preceding strings of attack-tampered event sequences generated by the attack DPFA

G_{a}

. For instance, the string

g a d \in S_{a t t a c k}

represents a category of event sequences with event g as the starting event and event d as the first attack-tampered event. The percentages of each event sequence

s_{a t t (i)} \in S_{a t t a c k}

(

i \in 1, 2, . . ., 15

) of the attacked dataset are shown in Figure 16.

Among the total 1500 event sequences in the attacked dataset with attack-tampered events, 657

s_{a t t (1)}

’s contain b, denoted as

b \in S_{a t t a c k}

. We input

s_{n e t} = ε

to the cyber-attack detection model to predict the probability of event b. The probability of

s_{n e t} = b

is found to be

3.1141 \times 10^{- 11}

as shown in Table 2, which is lower than

1 %

. Therefore, it can be decided that the event b was tampered with by the attacker at these situations.

The remaining 14 possibilities are also listed in Table 2, where the probabilities of attack-tampered events are highlighted in bold, and their probabilities are all lower than

1 %

. Thus, all possible attacks generated in

G_{a}

can be detected. Considering that

1 %

is a relatively strict condition, the probability of occurrence of all the attacks detected by the cyber-attack detection model is lower than

1 %

; then, they must be lower than the value and higher than

1 %

. Combining with Figure 12, we can find that our threshold

P_{a}

can be relaxed to

5 %

based on the actual situation.

In the analysis of a test set comprising 7997 groups of data, each group contains eight events, with each event having six possible outcomes. The aforementioned chart (Figure 17) provides a visual representation of the predicted probability distribution for each possible outcome. More specifically, each subplot reflects the model’s predicted probability for a given event category to be the next occurring event. These probability values reveal the model’s confidence in its predictions: taller probability bars indicate a stronger confidence in the occurrence of a specific event.

Within each subplot (Figure 17), the height of the bars represents the average predicted probability for that event as the true next event within the test set taken into account by the model. The error bars represent the standard deviation of the predicted probabilities, providing important information regarding the stability of the prediction distribution. For instance, in the subplot for event c, the model demonstrates a higher predicted probability along with a significant standard deviation, suggesting that the model is indeed confident in predicting this event, albeit with lower consistency in its predictions. Conversely, the subplot for event a indicates lower predicted probabilities, yet higher consistency in the distribution of probabilities, as evidenced by the relatively smaller error bars.

4.4. Experimental Complexity Analysis

In this section, we provide a detailed analysis of the computational complexity of our proposed method. Our approach mainly involves the two-layer GCN and GRU model. Let M be the number of nodes (events), L be the number of layers in the GCN, H be the hidden state dimension of the GRU, and I be the number of event-graphs of the GRUs’ input. The time complexity of GCN feature extraction can be expressed as

O (L \cdot (2 M - 1))

. As we process I graphs as one input sequence, the time complexity of GRU feature extraction is

O (I^{2} \cdot M \cdot H)

. Specifically, our model requires an average prediction time of merely 0.0281 seconds to detect attacks during the testing phase, a speed advantage that enables it to process large-scale event data in real-time and swiftly provide an assessment of security threats.

Furthermore, we have evaluated the computational efficiency of the model on a test set, where the total inference time for the entire dataset was 3.0438 seconds. This results in an average inference time per sample of 0.000396 seconds, demonstrating the model’s ability to handle individual event predictions with minimal latency. The throughput is measured at 2523.14 samples per second, highlighting the model’s capability to process a substantial volume of event data in real-time scenarios. These metrics underscore the model’s suitability for deployment in environments where rapid response and high processing capacity are critical.

5. Conclusions

This paper proposes a cyber-attack detection model within the framework of DES, modeled using a DFPA. The model employs deep learning techniques to extract and train features from event sequences. Traditional attack detection methods face challenges with state explosion when dealing with large-scale systems, which complicates real-time and effective computation. In contrast, our study automates the learning and recognition of features from event sequences, effectively reducing the computational burden caused by a large number of states and complex synchronous product computations.

A limitation of the study is the use of a case-by-case training model without the inclusion of real-world datasets, which limits the model’s generalizability to other DFPAs. Nevertheless, the reported approach aims to reduce the dependence on comprehensive knowledge of all system states, unobservable events, and state transition functions. By doing so, it addresses some challenges associated with attack detection under conditions of unclear system states, lack of sufficient prior knowledge, and frequent state explosions. While this model demonstrates potential in identifying whether a DES has been subjected to a cyber-attack and in locating anomalous events, thereby improving diagnostic precision and practicality, further validation with more diverse datasets would be beneficial to fully establish its robustness.

In future work, we aim to integrate considerations of opacity—an aspect of DESs focused on ensuring the confidentiality of certain information within the system to prevent unauthorized access. Incorporating opacity into the cyber-attack detection model will address potential ethical concerns, such as balancing effective detection with the protection of sensitive information.

Additionally, the methods developed in this research are also applicable to fields such as fault diagnosis and opacity analysis. They are capable of processing actual operational event sequences, providing predictive insights into all events that might be triggered post-attack and the potential events at each step of any event sequence. Overall, the cyber-attack detection model introduced in this paper offers a novel and efficient solution for enhancing security and monitoring within DESs.

Author Contributions

Conceptualization, S.D., L.Y. and J.W.; methodology, S.D. and L.Y.; software, S.D.; validation, G.L.; data curation, S.D.; writing—original draft preparation, S.D. and G.L.; writing—review and editing, S.D., L.Y. and Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Fund, FDCT, Macau SAR, under grants 0101/2022/A and 0029/2023/RIA1.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Cardenas, A.; Amin, S.; Sastry, S. Secure Control: Towards Survivable Cyber-Physical Systems. In Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems Workshops, Beijing, China, 17–20 June 2008; pp. 495–500. [Google Scholar]
Lefebvre, D.; Seatzu, C.; Hadjicostis, C.N.; Giua, A. Correction to: Probabilistic state estimation for labeled continuous time Markov models with applications to attack detection. Discret. Event Dyn. Syst. 2022, 32, 539–544. [Google Scholar] [CrossRef]
Zacchia Lun, Y.; D’Innocenzo, A.; Smarra, F.; Malavolta, I.; Di Benedetto, M.D. State of the art of cyber-physical systems security: An automatic control perspective. J. Syst. Softw. 2019, 149, 174–216. [Google Scholar] [CrossRef]
Keroglou, C.; Hadjicostis, C.N. Verification of detectability in Probabilistic Finite Automata. Automatica 2017, 86, 192–198. [Google Scholar] [CrossRef]
Lefebvre, D.; Seatzu, C.; Hadjicostis, C.N.; Giua, A. Probabilistic verification of attack detection using logical observer. IFAC-PapersOnLine 2020, 53, 95–100. [Google Scholar] [CrossRef]
Meira-Góes, R.; Kwong, R.; Lafortune, S. Synthesis of sensor deception attacks for systems modeled as probabilistic automata. In Proceedings of the 2019 American Control Conference (ACC), Philadelphia, PA, USA, 9–11 July 2019; pp. 5620–5626. [Google Scholar]
Fritz, R.; Zhang, P. Modeling and detection of cyber attacks on discrete event systems. IFAC-PapersOnLine 2018, 51, 285–290. [Google Scholar] [CrossRef]
Gao, C.; Seatzu, C.; Li, Z.; Giua, A. Multiple attacks detection on discrete event systems. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 2352–2357. [Google Scholar]
Desgeorges, L.; Georges, J.P.; Divoux, T. Detection of cyber-attacks in network control planes using Hidden Markov Model. IFAC-PapersOnLine 2022, 55, 66–72. [Google Scholar] [CrossRef]
Carvalho, L.K.; Wu, Y.C.; Kwong, R.; Lafortune, S. Detection and mitigation of classes of attacks in supervisory control systems. Automatica 2018, 97, 121–133. [Google Scholar] [CrossRef]
Lin, L.; Su, R. Synthesis of covert actuator and sensor attackers. Automatica 2021, 130, 109714. [Google Scholar] [CrossRef]
Su, R. Supervisor synthesis to thwart cyber attack with bounded sensor reading alterations. Automatica 2018, 94, 35–44. [Google Scholar] [CrossRef]
Meira-Góes, R.; Kang, E.; Kwong, R.H.; Lafortune, S. Synthesis of sensor deception attacks at the supervisory layer of cyber–physical systems. Automatica 2020, 121, 109172. [Google Scholar] [CrossRef]
Shi, D.; Elliott, R.J.; Chen, T. Event-based state estimation of discrete-state hidden Markov models. Automatica 2016, 65, 12–26. [Google Scholar] [CrossRef]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar]
Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-gcn: A temporal graph convolutional network for traffic prediction. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3848–3858. [Google Scholar] [CrossRef]
Lavrova, D.; Zegzhda, D.; Yarmak, A. Using GRU neural network for cyber-attack detection in automated process control systems. In Proceedings of the 2019 IEEE International Black Sea Conference on Communications and Networking (BlackSeaCom), Sochi, Russia, 3–6 June 2019; pp. 1–3. [Google Scholar]
Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv 2017, arXiv:1707.01926. [Google Scholar]
Bai, J.; Zhu, J.; Song, Y.; Zhao, L.; Hou, Z.; Du, R.; Li, H. A3t-gcn: Attention temporal graph convolutional network for traffic forecasting. ISPRS Int. J. Geo-Inf. 2021, 10, 485. [Google Scholar] [CrossRef]
Altman, N.S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 1992, 46, 175–185. [Google Scholar] [CrossRef]
Fu, G.; Han, G.Q.; Lu, F.; Xu, Z.X. Short-term traffic flow forecasting model based on support vector machine regression. Hua Nan Li Gong Da Xue Xue Bao. Zi Ran Ke Xue Ban 2013, 41, 71–76. [Google Scholar]
Wu, C.H.; Ho, J.M.; Lee, D.T. Travel-time prediction with support vector regression. IEEE Trans. Intell. Transp. Syst. 2004, 5, 276–281. [Google Scholar] [CrossRef]
Yin, H.; Wong, S.; Xu, J.; Wong, C. Urban traffic flow prediction using a fuzzy-neural approach. Transp. Res. Part C Emerg. Technol. 2002, 10, 85–98. [Google Scholar] [CrossRef]
Sun, S.; Zhang, C.; Yu, G. A Bayesian network approach to traffic flow forecasting. IEEE Trans. Intell. Transp. Syst. 2006, 7, 124–132. [Google Scholar] [CrossRef]
Lawford, M.; Wonham, W.M. Supervisory control of probabilistic discrete event systems. In Proceedings of the 36th Midwest Symposium on Circuits and Systems, Detroit, MI, USA, 16–18 August 1993; pp. 327–331. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Ding, S. Cyber Attack Detection, GitHub Repository. Available online: https://github.com/dingsichen/Cyber_Attack_Detection (accessed on 29 April 2024).

Figure 1. A DFPA G.

Figure 2. The general schematic of the cyber-attack detection model.

Figure 3. An event-graph

G (s)

derived from

s = σ_{1} σ_{2} \dots σ_{n}

.

Figure 3. An event-graph

G (s)

derived from

s = σ_{1} σ_{2} \dots σ_{n}

.

Figure 4. A block consisting of I ordered event-graphs by the index of an event-graph sequence.

Figure 5. The features model of a

G (s_{t})

.

Figure 5. The features model of a

G (s_{t})

.

Figure 6. A framework for the two-layer GCN model.

Figure 7. A GRU model.

Figure 8. A group of GRUs.

Figure 9. The GCN and GRUs processing event sequences.

Figure 10. An attention model framework.

Figure 11. A DFPA G, the source of the PDES dataset (upper part), and an attack DFPA

G_{a}

(lower part with a red dashed box).

Figure 11. A DFPA G, the source of the PDES dataset (upper part), and an attack DFPA

G_{a}

(lower part with a red dashed box).

Figure 12. The relationship between the attack threshold

P_{a}

and the accuracy rate

A c c

in the PDES dataset.

Figure 12. The relationship between the attack threshold

P_{a}

and the accuracy rate

A c c

in the PDES dataset.

Figure 13. The relationship between the attack threshold

P_{a}

and the accuracy rate

A c c

in the attacked dataset.

Figure 13. The relationship between the attack threshold

P_{a}

and the accuracy rate

A c c

in the attacked dataset.

Figure 14. ROC curves of the cyber-attack detection model.

Figure 15. ROC curves for GRU and LSTM models.

Figure 16. The distribution of different

s_{a t t (i)}

in the attacked dataset.

Figure 16. The distribution of different

s_{a t t (i)}

in the attacked dataset.

Figure 17. Cyber-attack detection model’s prediction of probability distribution for each possible next event.

Table 1. Comparison of the cyber-attack detection model, GRU, and LSTM models.

Model	Average F1-Score	Weighted F1-Score
Cyber-Attack Detection Model	0.7049	0.8126
GRU	0.6277	0.7061
LSTM	0.6313	0.6979

Table 2. The probability of attacked event (bolded indication) in the event sequences in

S_{a t t a c k}

.

Table 2. The probability of attacked event (bolded indication) in the event sequences in

S_{a t t a c k}

.

Sequence of Events in $S_{attack}$	Predicted Probability of the Next Event
Sequence of Events in $S_{attack}$	$a$	$b$	$c$	$d$	$e$	$g$
$s_{n e t} = ε$	2.7246 × 10⁻¹	3.1141 × 10⁻¹¹	1.0720 × 10⁻¹²	4.2394 × 10⁻¹	9.3816 × 10⁻¹³	3.0360 × 10⁻¹
$s_{n e t} = g$	1.9997 × 10⁻¹	5.9786 × 10⁻¹⁷	7.6584 × 10⁻²⁰	6.2751 × 10⁻¹	3.8545 × 10⁻⁹	1.7252 × 10⁻¹
$s_{n e t} = g a$	3.9776 × 10⁻¹⁰	7.4531 × 10⁻¹	2.5437 × 10⁻¹	1.2297 × 10⁻¹⁴	6.8716 × 10⁻⁹	3.1642 × 10⁻⁴
$s_{n e t} = g d$	3.3856 × 10⁻³	2.9502 × 10⁻¹³	5.5116 × 10⁻¹²	1.3036 × 10⁻⁴	8.0290 × 10⁻¹	1.9358 × 10⁻¹
$s_{n e t} = g g$	1.2152 × 10⁻¹	5.3249 × 10⁻²⁷	3.4663 × 10⁻³⁰	7.8465 × 10⁻¹	3.9487 × 10⁻⁴	9.3434 × 10⁻²
$s_{n e t} = g a b$	8.0296 × 10⁻²⁸	5.0607 × 10⁻¹	4.9393 × 10⁻¹	1.5381 × 10⁻³³	7.0198 × 10⁻³⁷	2.8788 × 10⁻²⁶
$s_{n e t} = g a b b$	1.1120 × 10⁻²¹	5.0890 × 10⁻¹	4.9110 × 10⁻¹	1.8804 × 10⁻²⁷	1.3971 × 10⁻¹⁵	2.6832 × 10⁻²²
$s_{n e t} = g a b c$	7.7555 × 10⁻²	6.3743 × 10⁻⁶	4.0476 × 10⁻⁵	8.6314 × 10⁻¹	1.0257 × 10⁻⁴	5.9153 × 10⁻²
$s_{n e t} = g a b b c$	1.3825 × 10⁻¹	1.3482 × 10⁻¹⁴	4.3889 × 10⁻¹³	7.0813 × 10⁻¹	2.0630 × 10⁻²⁵	1.5362 × 10⁻¹
$s_{n e t} = g a b c a$	3.1315 × 10⁻⁴¹	5.1623 × 10⁻¹	4.8377 × 10⁻¹	0.0000	1.0692 × 10⁻³⁸	9.8392 × 10⁻⁴¹
$s_{n e t} = g a b b c a$	3.9102 × 10⁻³	2.4155 × 10⁻¹	7.4626 × 10⁻¹	5.5154 × 10⁻³	6.7057 × 10⁻¹⁰	2.7653 × 10⁻³
$s_{n e t} = g a b c d g$	6.9098 × 10⁻⁵	5.2647 × 10⁻¹⁰	3.6647 × 10⁻⁶	1.3030 × 10⁻⁴	7.0593 × 10⁻¹	2.9387 × 10⁻¹
$s_{n e t} = g a b c d e d g$	3.5789 × 10⁻⁴	1.7910 × 10⁻¹³	1.8230 × 10⁻⁹	3.1316 × 10⁻⁴	9.2634 × 10⁻¹	7.2987 × 10⁻²
$s_{n e t} = g a b b c d e d g$	3.7429 × 10⁻⁴	1.5784 × 10⁻¹³	1.6159 × 10⁻⁹	3.2754 × 10⁻⁴	9.2359 × 10⁻¹	7.5707 × 10⁻²

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, S.; Liu, G.; Yin, L.; Wang, J.; Li, Z. Detection of Cyber-Attacks in a Discrete Event System Based on Deep Learning. Mathematics 2024, 12, 2635. https://doi.org/10.3390/math12172635

AMA Style

Ding S, Liu G, Yin L, Wang J, Li Z. Detection of Cyber-Attacks in a Discrete Event System Based on Deep Learning. Mathematics. 2024; 12(17):2635. https://doi.org/10.3390/math12172635

Chicago/Turabian Style

Ding, Sichen, Gaiyun Liu, Li Yin, Jianzhou Wang, and Zhiwu Li. 2024. "Detection of Cyber-Attacks in a Discrete Event System Based on Deep Learning" Mathematics 12, no. 17: 2635. https://doi.org/10.3390/math12172635

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Detection of Cyber-Attacks in a Discrete Event System Based on Deep Learning

Abstract

1. Introduction

2. Preliminaries

3. Methodology

3.1. Problem Definition

3.2. Methodology Overview

3.3. Spatial Correlation Model

3.4. Temporal Correlation Model

3.5. Attention Model

3.6. Loss Function

3.7. Cyber-Attack Detection Model

4. Experiments

4.1. Data Description

4.2. Evaluation Metrics

4.3. Experimental Results

4.4. Experimental Complexity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI