STPA-RL: Integrating Reinforcement Learning into STPA for Loss Scenario Exploration

Chang, Jiyoung; Kwon, Ryeonggu; Kwon, Gihwon

doi:10.3390/app14072916

Open AccessArticle

STPA-RL: Integrating Reinforcement Learning into STPA for Loss Scenario Exploration

by

Jiyoung Chang

^1,†

,

Ryeonggu Kwon

^2,*,†

and

Gihwon Kwon

^2,†

¹

Department of SW Safety and Cyber Security, Kyonggi University, Suwon-si 154-42, Gyeonggi-do, Republic of Korea

²

Department of Computer Science, Kyonggi University, Suwon-si 154-42, Gyeonggi-do, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(7), 2916; https://doi.org/10.3390/app14072916

Submission received: 20 February 2024 / Revised: 21 March 2024 / Accepted: 26 March 2024 / Published: 29 March 2024

Download

Browse Figures

Versions Notes

Abstract

:

Experience-based methods like reinforcement learning (RL) are often deemed less suitable for the safety field due to concerns about potential safety issues. To bridge this gap, we introduce STPA-RL, a methodology that integrates RL with System-Theoretic Process Analysis (STPA). STPA is a safety analysis technique that identifies causative factors leading to unsafe control actions and system hazards through loss scenarios. In the context of STPA-RL, we formalize the Markov Decision Process based on STPA analysis results to incorporate control algorithms into the system environment. The agent learns safe actions through reward-based learning, tracking potential hazard paths to validate system safety. Specifically, by analyzing various loss scenarios related to the Platform Screen Door, we assess the applicability of the proposed approach by evaluating hazard trajectory graphs and hazard frequencies in the system. This paper streamlines the RL process for loss scenario identification through STPA, contributing to self-guided loss scenarios and diverse system modeling. Additionally, it offers effective simulations for proactive development to enhance system safety and provide practical assistance in the safety field.

Keywords:

STPA; loss scenarios; reinforcement learning; hazard analysis; Platform Screen Door

1. Introduction

Modern systems and processes are becoming increasingly complex and have a growing impact on safety-related issues [1]. Safety analysis is an important step in ensuring the reliability of the system, but existing safety analysis methods have limitations in addressing the various hazardous problems that can arise in very complex systems [2]. Recently, reinforcement learning (RL) has emerged as one of the Machine Learning techniques that provide high levels of automation and decision-making power [3]. However, the application of trial-and-error-based techniques, such as RL, in the safety field is relatively uncommon [4]. Consequently, there is a growing need for research to effectively incorporate RL into safety analysis. This study endeavors to bridge this gap by exploring the integration of RL with System-Theoretic Process Analysis (STPA).

STPA is a safety analysis methodology that originates from the idea that risks in a system arise from control problems. It identifies unsafe control actions to derive appropriate safety constraints for system operation and focuses on analyzing the faulty interactions among components to prevent the occurrence of such actions [1,5]. Establishing similarities between STPA’s model and RL is quite intuitive [6]. In the flow of control problems, the controller’s Process Model can be mapped to the Markov Decision Process (MDP) model that the agent must learn, and the control algorithm can be mapped to a policy that the agent must devise [7]. RL rewards an agent for making the correct actions while resolving problems through experience-based learning [8]. RL can also be applied to control systems to train the control algorithm to prevent risks by implementing the actual system as an environment. It models the changing appearance of a system based on feedback from its components like the real world.

The key activity of STPA is identifying loss scenarios, which explain unsafe control measures and possible causes of risk [9]. STPA has the advantage of improving the overall safety and reliability of the system by proactively identifying system vulnerabilities and preventing safety accidents [10]. A human safety analyst’s trial-and-error approach to examining combinations of risk variables by hand, especially for highly complex systems, becomes impractical [11]. In addressing this challenge, the utilization of RL agents in lieu of humans can efficiently contribute to the automated derivation of loss scenarios.

Furthermore, similar to potential bugs in software, systems involve faults and hazards. There are techniques that utilize RL for automating the detection of sequences leading to system crashes or halts, resembling bug scenarios, as reported by Durmaz et al. [12]. In this context, our objective was to leverage RL to derive loss scenarios related to potential hazards in systems analyzed through STPA. If the agent’s learning is carried out at a tolerable level, it can acquire the ability to track the previous system state and the interaction that caused the situation where risks occurred in the system. As loss scenarios represent possible events or sequences of events that may result in undesired outcomes or compromised safety in a system, analysts must assess the likelihood and consequences of each potential hazard. In this regard, RL provides a more advantageous approach by enabling the evaluation of all these factors.

In this paper, we present an approach named STPA-RL. By deriving loss scenarios and tracking changes in the system state, we identify the flow in which control commands are performed incorrectly and help inform design decisions and manage risk to ensure the safety of complex systems. By implementing an RL environment with MDP and rewarding safe actions, our approach allows for the identification of self-guided loss scenarios that lead to hazardous actions, enabling the discovery of potential loss scenarios at the system level. It provides the ability to simulate system behavior under different environmental conditions using RL before applying it to the actual system.

In summary, our contribution is threefold: First, we present STPA-RL, which combines the safety analysis technique STPA with reinforcement learning. Second, we implement the RL environment using MDPs based on the system analysis results obtained by STPA. Third, we derive loss scenarios for the Platform Screen Door system from the state hazard trajectories and compare the probability of hazard occurrence.

The structure of this paper is as follows: Section 2 provides background information. In Section 3, we detail the methodology of STPA-RL. Section 4 and Section 5 present a case study involving the Platform Screen Door system and showcase experimental results by applying our proposed methodology. The subsequent Section 6 and Section 7 delve into the discussion of our findings, leading to the paper’s conclusion in Section 8.

2. Background

2.1. System-Theoretic Process Analysis

STPA is a systems-thinking-based hazard analysis and risk assessment method that can be used to proactively identify control and feedback failures within safety-critical systems [13]. It aims to identify unsafe controls that may pose risks in control relationships by considering interactions rather than individual components of the system. Therefore, in STPA, the system is considered a collection of interacting control loops [5].

STPA is composed of four stages, as shown in Figure 1: First, we define the accidents and hazards to determine the scope of the system that is controllable. A hazard is a system state or condition that may cause accidents. System-level safety constraints refer to the state or behavior of the system to prevent accidents or losses, corresponding to the previously defined hazards. Second, the subject and object of the control relationship of the system is configured, along with the control and response, using the schematic of the control structure. A control algorithm and Process Model are defined for the controller to control the controlled process. The Process Model contains information that the controller uses to command control actions, e.g., the state of the controlled process, the state of the environment, and the state of other system components [14]. Each control action is assessed for its potential contribution to the hazard. Third, Unsafe Control Actions (UCA), an unsafe form of control commands that can cause system risks, are identified. STPA views the hazard state as a result of UCAs. These UCAs may or may not exist in the actual system. This is a hypothesis that should be validated or rejected based on an investigation of actions when the system is designed and constructed. Therefore, to ensure a comprehensive assessment, each control action must be examined in turn [15]. Fourth, a loss scenario is derived to track the causal factors of the UCA performed, creating a scenario that caused the loss based on the factors causing improper performance or failure of the control command provided [16]. To identify loss scenarios in STPA, analysts have generally used context tables [1], which offer a tabular framework for organizing system states, control actions, controlled variables, and hazardous states. Each row of the context table corresponds to a specific scenario, describing potential sequences of events or conditions leading to hazardous situations [17]. By populating the table with relevant information about the system’s behavior, the analysis identifies causal factors and their combinations contributing to hazardous states.

In this paper, we focus on the fourth stage of STPA and utilize reinforcement learning as a distinct approach to trace and derive loss scenarios by identifying causes from state transitions, rather than relying on the analyst’s manual suggestions.

2.2. Reinforcement Learning

RL is a sub-field of Machine Learning that involves an agent learning to make decisions by interacting with an environment [18]. RL is based on the concept of trial and error, where the agent receives feedback in the form of rewards or penalties based on its actions.

MDP is used to mathematically define and model RL problems. And it is defined as a tuple consisting of a set of states S, a set of actions A, a reward function

R (s, a, s^{'})

, and a transition probability function

P (s^{'} | s, a)

[19]. MDP supports decision-making by modeling the probabilistic relationship between the state and behavior of the system, and STPA explores the causal relationship between cause and effect to analyze system safety [7].

We conduct a safety analysis of the system by considering it as an MDP problem through information obtained with STPA. In STPA, events and analyzed conditions, including system states, are mapped to states of MDPs that are in specific situations. The action that an agent can take in the system of MDP is mapped to a control action or UCA in STPA. Reward in MDP is related to the agent’s performance in each state, and it maps to account for the effectiveness and risk reduction of safety measures in STPA. The transition probability represents the probability of performing a particular action in a specific state and then moving to the next state, and STPA defines the system state change as the transition probability by considering the interaction.

The goal of RL is to maximize the expected cumulative reward. R is the cumulative reward, T is the time horizon or the number of steps in the episode,

r_{t}

is the reward obtained at time t, and

γ

is a discount factor that determines the importance of future rewards relative to immediate rewards [20].

R = \sum_{t = 0}^{T} γ^{t} r_{t}

RL agents have explicit goals, can sense aspects of their environments, and can choose actions to influence their environments [21]. The agent operates in a dynamic environment, where the state of the environment can change as a result of the agent’s actions. This requires the agent to continuously adapt its behavior in response to new information.

The Advantage Actor–Critic (A2C) in Figure 2 is an RL technique that combines the benefits of both policy-based and value-based methods [22]. In the A2C algorithm, the MLPPolicy refers to the policy based on Multi-Layer Perception (MLP). This policy function outputs the probabilities for all possible actions given a state, providing a stochastic way of selecting actions based on the given state. The MLPPolicy is part of the Actor (Policy Network), which represents the policy [23].

The actor is responsible for choosing actions and provides the advantage of computing continuous actions without the need for optimization procedures on a value function [24]. The critic (value network) estimates the expected return and provides low-variance knowledge of the performance [25]. The actor is updated using the policy gradient method. It collects episodes using the current policy and then computes gradients of the returns and probabilities of actions selected at each state.

θ_{a c t o r}

and

θ_{c r i t i c}

are the parameters of each network.

α

is the learning rate.

J (θ_{a c t o r})

is the objective function for the average reward of the policy.

θ_{a c t i o r} \leftarrow θ_{a c t o r} + α_{a c t o r} \nabla_{θ_{a c t o r}} J (θ_{a c t o r})

The critic is updated using TD (temporal difference) learning. TD learning updates the value of the current state using the value of the next state. It is trained to predict the state–value function, and its parameters are updated to minimize the TD error [26].

L (θ_{c r i t i c})

is the objective function for the TD error.

θ_{c r i t i c} \leftarrow θ_{c r i t i c} + α_{c r i t i c} \nabla_{θ_{c r i t i c}} L (θ_{c r i t i c})

A2C is particularly useful for solving problems with more complex and large state spaces, as it can learn a good policy without exhaustively searching the state space [27]. Moreover, it can be extended to handle continuous action spaces, making it suitable for many real-world applications. Thus, applying A2C to identify loss scenarios enables the learning model to determine the cause-and-effect relationships in the environment. Furthermore, since it can handle partial observability, it can evaluate the effectiveness of existing controls in preventing or mitigating losses.

3. Proposed Approach: STPA-RL

We leveraged RL to derive loss scenarios for STPA, taking advantage of its usability and benefits. The initial three stages of STPA involve defining the system’s control structure and safety constraints by human analysts based on the data, while the fourth stage entails deriving a loss scenario to prepare and reduce countermeasures against potential system risks. RL is used to examine the potential risks that may occur in the actual system environment and changes in the state of the hazards by learning and testing. The control command of the system is designated as an action, and an RL environment is implemented. Then, we select RL algorithms suitable for the system environment, and in this paper, the A2C model is used to train it. Finally, a loss scenario is derived by storing paths that reach hazards through testing. The procedure of applying RL to STPA in the process of deriving loss scenarios is outlined below. We call this approach “STPA-RL” in Figure 3.

3.1. System Environment Modeling

3.1.1. Definition of Process Model for State Representation

In STPA, the Process Model represents the system state information necessary for the controller to provide control actions. It defines the environment’s states based on the available information. The Process Model specifies the system’s operational characteristics, input and output variables, operational constraints, and other relevant factors. If the Process Model for constructing the system’s learning environment is not fully defined, additional Process Model definitions can be derived from specific situations or contexts obtained during the derivation of UCAs to complement the existing STPA analysis results.

We model our problem as an MDP. Following the MDPs formulation, the state space is denoted by S. The Process Model defines the S, which represents the system’s different states. It can be denoted as

S = {s_{1}, \dots, s_{m}}

, where each s represents the individual state of the system.

3.1.2. Definition of Control Actions for Action Representation

Control Actions play a crucial role in adjusting or modifying the system’s behavior to maintain a safe state. Once the Control Actions are determined, they become the set of actions within the RL environment. Actions can be defined as system control variables or operational commands, and they will be learned by the RL algorithm to achieve optimal performance.

The action space is denoted by A. Control Actions are the actions that the agent can take to interact with the system. It can be denoted as

A = {a_{1}, \dots, a_{n}}

, where each a represents the available control action in the environment.

3.1.3. Construction and Implementation of Control the Algorithm

The Control Algorithm involves constructing and implementing rules or policies that determine the Control Actions. It focuses on generating an environment that can derive loss scenarios by addressing the question of “Why does UCA occur?”. Additionally, the Control Algorithm sets up controlled states and conditions for each action to ensure precise operation.

The control algorithm can be represented as a policy function denoted by

π

, which maps states to actions. In the mathematical notation of

π (s) \to a

,

π (s)

represents the action chosen by the policy function

π

in state s.

3.1.4. Hazard Reclassification and Prioritization

Hazards are defined as potential threats within the system and must undergo a process of reclassification and prioritization. Hazards identified by STPA are reclassified using insights from RL simulations, potentially revealing new unsafe conditions. UCAs are used as conditions for reaching hazardous states. Based on the state conditions, the severity of the hazards is assessed, and a priority is assigned. Hazards are prioritized based on their frequency of occurrence. The hazards are further detailed and applied to the control algorithm. The hazards can be represented as states that lead to undesirable outcomes.

Let H be the set of hazardous states

H = {h_{1}, \dots, h_{k}}

, where each h represents the state reached hazard.

3.1.5. Setting the Environment State Transition

Various scenarios that can occur in the actual system are considered to set the environment’s state transition. This includes both typical and exceptional behaviors, representing the system transitions comprehensively. The state transition process enables the RL model to simulate the system’s behavior according to real-world situations.

The state transition function is denoted by P, which represents the probability of reaching the next state

s^{'}

with given current state s and action a. In mathematical notation,

P (s, a, s^{'}) = P r [S (t + 1) = s^{'} | S (t) = s, A (t) = a]

.

S (t)

is the state at time t,

A (t)

is the action at the time t, and

P r

represents the probability. It works as shown in Figure 4.

We can represent the system environment modeling phase in a more formal and structured way, considering the state space, action space, control policy, hazardous states, and state transition probabilities.

3.2. Training with Reinforcement Learning

3.2.1. Selection of Algorithm and Implementation for RL

An appropriate RL algorithm is selected for the implemented environment. The chosen learning algorithm is used to perform RL on the system. The algorithm observes the state of the RL environment and selects control actions to guide the system towards safe operations. Learning is carried out through a series of episodes, and iterative learning is performed to improve performance. The goal is to generate a trained model with state transition data flow of an appropriate length, which facilitates analysis, and to verify performance based on the average reward value.

In the A2C algorithm, we use a neural network to represent both the actor and critic networks [26]. The actor network parameterized by

θ_{a c t o r}

is responsible for selecting actions based on the state, and the critic network parameterized by

θ_{c r i t i c}

estimates the state-value function. The state s is represented as the input to the neural network.

3.2.2. Rewarding Safe Actions Based on State Conditions

Based on the context and state conditions, rewards are assigned when safe actions are performed. This encourages RL to favor safe actions and maximize rewards. Given the current state s, the actor network outputs a probability distribution over actions a, denoted as

π (a | s, θ_{a c t o r})

. When a safe action

a_{s a f e}

is selected from the distribution, a reward

R_{s a f e}

is provided to encourage the learning of safe actions, and it is expressed as

R_{s a f e} (s) = π (a_{s a f e} | s, θ_{a c t o r})

The reward-based learning mechanism of STPA-RL plays a crucial role in teaching agents about safe behavior within the system. By balancing rewards that reflect the system’s safety priorities, it motivates agents to explore a range of safe behaviors without leaning towards conservatism or risky actions.

3.2.3. Negative Rewarding UCA Based on State Conditions

Considering the context and state conditions, UCAs leading to hazardous situations are identified, influencing episode termination. This assists RL in avoiding unsafe actions and reinforcing safe actions. Similarly, when an unsafe action

a_{u n s a f e}

is chosen, we denote the probability of selecting this action as

π (a_{u n s a f e} | s, θ_{a})

. In A2C, we assign a negative reward

R_{u n s a f e}

to discourage the selection of unsafe actions:

R_{u n s a f e} (s) = - π (a_{u n s a f e} | s, θ_{a c t o r})

The A2C algorithm aims to maximize the total expected reward, a combination of rewards for safe actions and penalties for unsafe actions, enhancing the policy.

3.3. Model Simulation and Analysis

3.3.1. Recording Environmental State Trajectories

To trace the environmental state transition during the trained model simulation process, lists of trace paths are generated and stored. When a safe action is performed from the current state, the corresponding state and action are linked and stored in the trace path. When an unsafe action

(a_{u n s a f e})

is performed, it is linked to the risk denoted as

H {n u m}

in the trace path, where

{n u m}

represents the hazard number. Additionally, the length of the path up to the current episode termination is also stored.

During the evaluation of the A2C model, we trace the state–action transitions in the environment. Starting from the initial state

s_{0}

, the actor network outputs a probability distribution over actions

π (a | s, θ_{a c t o r})

. Based on this distribution, we select an action

a_{t}

and observe the next state

s_{t + 1}

and the corresponding reward

r_{t + 1}

. The trajectory of state transitions, actions, and rewards is represented as

σ

:

σ = 〈 (s_{0}, a_{0}, r_{1}, s_{1}), (s_{1}, a_{1}, r_{2}, s_{2}), \dots, (s_{t}, a_{t}, r_{t + 1}, s_{t + 1}) 〉

But to identify as a loss scenario, the transitions’ trace path should end with the unsafe action and reach the hazardous state. The hazard trajectory of a loss scenario is represented as

σ_{H}

:

σ_{H} = 〈(s_{0}, a_{(0, s a f e)}, r_{1}, s_{1}), (s_{1}, a_{(1, s a f e)}, r_{2}, s_{2}), \dots, (s_{t}, a_{(t, u n s a f e)}, r_{(t + 1, u n s a f e)}, s_{(t + 1, h a z a r d)})〉

3.3.2. Deriving Loss Scenarios and System Hazard Analysis

Based on the stored state–action transition trace path list, loss scenarios are derived to identify potential hazards and analyze system risks. The loss scenario is represented as

σ_{L S}

:

σ_{L S} = 〈(s_{t - 1}, a_{(t - 1, s a f e)}, r_{(t, s a f e)}, s_{t}), (s_{t}, a_{(t, u n s a f e)}, r_{(t + 1, u n s a f e)}, s_{(t + 1, h a z a r d)})〉

Then, we analyze the system’s behavior, identify potential loss scenarios, and assess hazards associated with unsafe actions by using the recorded trajectories. This process is crucial for understanding the effectiveness of the A2C algorithm in handling safe and hazardous situations within the given environment.

4. Case Study: Platform Screen Door

4.1. Definition of Accident and Hazard

The Platform Screen Door (PSD) system is a critical safety feature implemented in modern railway transportation systems. Its primary function is to block the space between the area where trains pass and the interior of the station [28]. When the train stops, it opens and closes the door again after boarding passengers. PSD provides benefits for passenger safety and efficient train operation. However, if errors occur in the PSD system, it can disrupt other train operations and lead to passenger safety incidents and boarding/unboarding failures. Potential losses in the system are defined as human and economic losses [29]. To prevent safety-related accidents, the STPA is utilized to conduct risk analysis on PSD system.

Define the following accidents. A01: Death or injury due to exposure to a train in operation because the door is not closed. A02: A person or object is caught in the door when it is closed with a person or object present. A03: Passengers fail to get on and off because the door does not open when the train stops. Next, based on the addressed accidents, define the hazards.

H01: The door opens while the train is in operation (A01).
H02: The door closes when a person or object is present (A02).
H03: The door does not open after the train stops (A03).
H04: The train does not close the door as it prepares to depart (A01, A03).

Then, derive safety constraints. Safety constraints are specific limitations placed on a system to ensure that it operates within acceptable safety limits [5]. SC01: The door must be closed during train moving(H01). SC02: The door should not close if an obstacle is detected(H02). SC03: The door must open after the train stops(H03). SC04: The door must be closed when the train is ready to depart(H04).

4.2. Schematizing Control Structure

The PSD system is composed of several components and interactions, including an input value called ‘train motion’ that provides information about the train operation, an ‘obstacle sensor’ that detects people and objects passing through doors, a ‘door position sensor’ that provides feedback on the current state of the door, a ‘door controller’ that determines the control command to open and close the doors, and a ‘door actuator’ that moves the actual physical door with mechanical force [30].

To identify the Process Model of the PSD system, the command from the ‘door controller’ to open and close the door is determined by the values of the ‘door position sensor’, the ‘train motion’ state, and the ‘obstacle sensor’. The system also utilizes three types of feedback. First, the ‘train motion’ state is identified by inputting the state of stopped, ready, and moving. Second, the ‘door position sensor’ recognizes whether the door is opened, opening, closed, or closing and provides feedback accordingly. Third, the ‘obstacle sensor’ detects the presence or absence of an object and feeds it back to the system. Figure 5 displays the control structure of the PSD system.

4.3. Identification of Unsafe Control Actions

Identifying UCAs involves recognizing control actions that could lead to hazards and potential accidents in the PSD system. A UCA is an unsafe action, so it is critical to identify and ensure the safe and reliable operation of complex systems [10]. There are four types of control actions that can be unsafe. Type A: not providing the control action causes hazard; Type B: providing the control action causes hazard; Type C: providing a potentially safe control action but too late, too soon, or in the wrong order; and Type D: the control action stopped too soon or is applied too long (for continuous control actions) [1]. However, in the context of the PSD system, Type D UCAs were not identified, while other types of UCAs for the PSD system are detailed in Table 1 [16]. Through STPA-RL, these UCAs are systematically analyzed, with RL models simulating various scenarios to evaluate the impact and likelihood of these unsafe actions, thus aiding in their prioritization and mitigation.

4.4. Applying STPA-RL

To derive a loss scenario, STPA-RL is utilized following the steps below as in Figure 3.

1.: Implement the PSD environment with the results of STPA 1-3 stages. Define states, actions, hazards, control algorithms, and the reward structure of the environment.
2.: Train the model by using the A2C algorithm and rewarding the safe action. The algorithm can be chosen differently depending on the characteristics of the system.
3.: Derive loss scenarios from the state trajectories that reach the hazard state with unsafe action.

Initially, state variables and inputs are defined to implement the PSD environment with the system’s control algorithm and feedback. These variables are structured in a 4-tuple format called

S t e p

, comprising three state variables of Process Model and agent action. Formally, we define a

S t e p

as follows:

$S t e p = S \times A$
$S = d o o r_p o s i t i o n \times t r a i n_m o t i o n \times o b s t a c l e_s t a t e$
∘
$d o o r_p o s i t i o n = {f u l l y_c l o s e d, o p e n i n g, f u l l y_o p e n e d, c l o s i n g}$
∘
$t r a i n_m o t i o n = {s t o p, r e a d y, m o v i n g}$
∘
$o b s t a c l e_s t a t e = {n o t_e x i s t, e x i s t s}$
$A = {c l o s e, o p e n}$

Informally,

S t e p

means what action the agent has selected based on the value of the process model at a specific time. Values of

S t e p

are all non-negative integers listed starting from zero and increasing sequentially. In the system environment, the agent can perform two actions: close or open. Performing these actions will cause a change in the

d o o r_p o s i t i o n

state. It changes from

f u l l y_c l o s e d

to

o p e n i n g

when the action open was made, and from

o p e n i n g

to

f u l l y_o p e n e d

when it is opened again. When it is

f u l l y_o p e n e d

and the action close was made, it changes to

c l o s i n g

, and when it is closed again, it becomes

f u l l y_c l o s e d

.

t r a i n_m o t i o n

varies depending on the current state of

d o o r_p o s i t i o n

and the state of the train.

o b s t a c l e_s t a t e

allows objects to exist or not exist at random rates. In particular, it is implemented following the hazards and safety constraints defined in identifying risks. Therefore, if the agent performs the action close, the priority of the hazard is H02, and the next hazard that is not applicable is H03. If the action open is performed, H04 is the priority, and the next is set to H01. The agent receives a reward when the action open is performed while

o b s t a c l e_s t a t e

is 1 or

t r a i n_m o t i o n

is 0. In addition, when

o b s t a c l e_s t a t e

is 0 and

t r a i n_m o t i o n

is 1 or 2, the action close is rewarded.

This case study used the A2C algorithm to train agents about the PSD environment. The learning model is trained with MLPPolicy (Multi-Layer Perceptron) policy [23]. The training process is conducted for a total of 1000 episodes, each with 1000 timesteps. The learning rate is a hyperparameter that controls the magnitude of weight updates during the training phase, and it is set to 0.0007. Also, the discount factor

(γ)

is set to 0.99. The factor for the trade-off of bias vs. variance for the Generalized Advantage Estimator is set to 1. The value function coefficient for the loss calculation is set to 0.5, and the maximum value for gradient clipping is set to 0.5. The agent utilizes a policy gradient method to learn from experience and improve its policy over time [26]. The agent learns not to fall into hazards defined on the system’s environment. It is rewarded if it performs safe control action. Training data are collected per episode by running the agent on the environment. A2C algorithm allows the PSD system to continuously adapt and improve its decision-making processes based on experience gained through interactions with the environment.

5. Results and Analysis

Model simulations were conducted over 1000 episodes. The model can now identify loss scenarios that lead to the hazard state, tracking previous paths—the trajectories of states and actions until UCA. Initially, we analyze the testing results to derive loss scenarios for the PSD system, focusing on hazard frequency. When there is a lack of statistically accumulated data for the system, engineers sometimes evaluate the system based on intuition or experience. To address this, our goal is to provide quantitative metrics for safety design decisions by deriving the frequency of loss scenarios through simulations. Hazard frequency refers to the occurrence of a hazard corresponding to the door position, train motion, obstacle state, and action for hazards in H01 to H04. In Figure 6a, all possible states of the PSD system are depicted, half of which are safety states, and the other half are the hazard states. Among the hazard states, H02 comprises the majority. However, in Figure 6b, during the system testing, it was discovered that H04 hazards occurred most frequently, accounting for

38 %

of the total hazard frequency.

The hazard frequency analysis method identifies critical PSD system states by measuring each state’s occurrence frequency and corresponding hazards in the RL model simulation, estimating hazard probabilities, prioritizing them, and repeatedly running the model to ensure result consistency. Eight examples of hazard trajectories (

σ_{H}

) and their corresponding loss scenarios (

σ_{L S}

) are presented in Table 2 and Table 3. As shown in the last row of the Table 2, when the door position is fully opened, train motion is ready, and there is no obstacle detected, the action of open results in a hazard with a frequency of

52 %

, making up

59 %

of all H01 hazards. The third row indicates that when learning is highly effective, the safe action will be selected with a frequency of

83 %

. Identifying critical system states with a high risk of hazards is possible through this analysis. Table 3 presents a list of loss scenario paths for the PSD system, which show the sequence of normal actions and states that lead to the hazard state. These paths help identify the specific state changes that caused the hazard to occur. The final step in the paths corresponds to the hazard state reached as a result of the agent’s action. To determine the cause of the hazard, analyze the preceding steps in the path. By these paths, we find the design mistake of the control algorithm or the origin of the loss for the safety of the system. The last row shows an example loss scenario of (1,0,1,1) scenario, which is

11.31 %

of the total, which resulted in a hazard state with (2,1,0,1) of H04.

5.1. Loss Scenarios

The last safe step was categorized before reaching the hazard step as a loss scenario (

σ_{L S}

) and labeled with ‘LS’.

s t e p

is a state and action pair. Table 4 summarizes loss scenarios that occur with high probability among

L S 01

to

L S 38

, and the values are rounded to two decimal places. First, the loss scenarios for reaching the H01 hazard are as follows:

L S 01

. When the train is in operation (moving), the door is fully closed, and the obstacle is not detected, the door is continuously closed, but the door is opened in the same state, resulting in a loss (UCA02, UCA09).

L S 02

. The train is ready for departure, and no obstacle was detected while the door was closing, so the door was closed continuously, the door was fully closed, the train was changed to moving, and the door was opened, resulting in a loss (UCA02, UCA09).

L S 04

. When an obstacle is detected while the train is moving and the door is fully closed, and when the door is opened and changed to the opening state, the object is not detected and the door is opened even though the train is moving, resulting in a loss (UCA02, UCA09).

L S 05

. When an obstacle is detected while the train is moving and the door is opening, the door is changed to a fully opened state by opening the door, and when the obstacle is not detected and the train is moving, the door is opened and causes a loss (UCA02, UCA09).

Second, the loss scenarios for reaching the H02 hazard are as follows:

L S 10

. Loss occurs when the train is ready to depart and the object is not detected while the door is closing, the door is continuously closed, the train is ready after the door is fully closed, and the door is still closed even when the obstacle is detected (UCA03).

L S 13

. The train is stopped and the door opens continuously while the door is opening, the train is ready to depart after the door has been changed to fully opened, but the door is closed when an obstacle is detected, and the door needs to be opened (UCA03).

L S 17

. When the train is ready for departure, the door is fully opened, and the obstacle is not detected, the door turns into a closing state by closing the door, but when the obstacle is detected and the door needs to be opened, the door is closed, resulting in a loss (UCA03, UCA05, UCA07).

Third, the loss scenarios for reaching the H03 hazard are as follows:

L S 20

. When the train is moving, the door is fully closed, and an obstacle is not detected, the door is continuously closed and the train motion changes to stopped, causing a loss by not opening the door (UCA04).

L S 24

. When the train is stopped, the door is fully closed, and the obstacle is not detected, the door is opened and the door is changed to the opening state, the train stops, and the door is closed, resulting in a loss (UCA04).

Last, the loss scenarios for reaching the H04 hazard are as follows:

L S 30

. When the train is stopped, the door is opening, and the obstacle is not detected, the door continues to open and then the door is changed to fully opened, causing a loss by opening the door without closing when the train is ready to depart (UCA01, UCA09, UCA11).

L S 33

. When the train is ready for departure, the door is fully opened, the obstacle is detected, and the the door opens, but when the obstacle is not detected and the door must be closed, the door is opened, resulting in a loss (UCA01, UCA09, UCA11).

L S 38

. When the train is ready for departure, the door is fully opened, and the obstacle is not detected, so the door is closed, the door is changed to closing, and the train must be closed for departure, but the door is opened and causes a loss (UCA01, UCA09, UCA11).

In Figure 6, H04 is identified as the most frequent hazard, and it is crucial to examine the previous path that led to the hazard. Therefore, for

L S 30

to 37, particularly

L S 30

, where the train motion changes from stopped to ready for departure, the PSD system must close the door to reduce the hazard. It is pivotal to emphasize that these loss scenarios stem from specific system behaviors, including obstacle detection and state transitions between opening and closing. Through a comprehensive analysis of these loss scenarios, we can pinpoint critical areas within the system that require enhancements to ensure safety and mitigate potential hazards during train operations. This underscores how RL can simulate the hazards and scenarios that may unfold in the system’s environment.

When a state transition occurs due to performing a safe action, the agent receives a reward of 1. Consequently, our model achieved an average reward of approximately 200 after 1000 iterations. The observed average reward can be attributed to the model’s intentional stochastic selection of hazard states, even if it has been effectively trained to prioritize safe actions. Analyzing the trajectory of different paths offers insights into control algorithms that require enhancements within the system.

5.2. Hazard Trajectories

Loss scenarios of

L S 01

,

L S 17

,

L S 25

,

L S 34

, from H01 to H04, are graphically represented as

σ_{H}

in Figure 7. This approach is akin to simulations that visualize the patterns of transitions from the initial state of the system. To understand state transition patterns and improve faulty designs, state–action pairs are represented as nodes (Node), and transitions are represented as edges (Edge) in a graph. The initial state–action is the green-colored node, and the red-colored node is the hazard state that occurred due to an unsafe action.

In our PSD system, there are 48 possible state–action pairs. Therefore, if the number of nodes is the same across other hazard trajectories, it indicates an identical node situation. Each LS represents the path leading to the state where it is derived, depicting the flow from the initial state to reaching a hazard through unsafe actions via a diagram, aiming to comprehend various characteristics of system behavior. Upon examining the transition flow in the graph, nodes that transition recursively are observed. This occurs when performing an action in a particular state does not alter the state, prompting a repeated selection of the optimal action. By observing that the sequence 13→12→23→24→1→17→18→23 is consistent across all examples, it becomes evident that this represents the most fundamental system flow. Specifically, for LS25, H3 is denoted as (1,0,0,0), indicating that node 7 is the immediately preceding step in this scenario. Utilizing discovered relationship information, a detailed analysis of the existence of issues in the current control relationships of the system is conducted.

6. Threats to Validity

Threats to the internal validity of the STPA-RL approach, such as biases in system modeling, are carefully addressed by ensuring a thorough analysis of the system’s control structure and validating the RL models against real-world scenarios and operational data. During the initial implementation of the RL environment, we found that it is influenced by the contents analyzed by STPA and the control algorithms constructed, which can lead to an inaccurate environment. However, performing STPA analysis during the conceptual design phase enables the discovery of potential vulnerabilities through RL, facilitating the proactive identification of such issues and enhancing the generalizability of results. This process involves iterative testing and refinement to closely align the model with actual system behavior.

Threats to external validity, including the generalization problem due to the empirical study’s size, are acknowledged. Although STPA-RL was applied on the small scale of the PSD system, it provided sufficient research findings. Yet it is crucial to consider the target system’s characteristics and environment. Conducting further validation and experiments on larger-scale systems is indispensable for ensuring external validity in practical applications. This method effectively reduces biases in system modeling and strengthens the validity of the findings through iterative testing and refinement of the model to closely mirror actual system behavior.

7. Discussion

7.1. Safety Analysis with Reinforcement Learning

STPA has been recommended as an international standard for evaluations of automotive-related safety-critical systems, as indicated by SAE J3187 (SAE, Sydney, Australia) [31]. Therefore, there is a need for various studies exploring the diverse applications and utilization of STPA in the context of safety assessments. STPA, as a structured and systematic safety analysis method, can serve as a solid foundation for guiding the integration of Machine Learning (ML). Its emphasis on identifying unsafe control actions and analyzing causal factors aligns well with the requirements of ML algorithms in understanding system behaviors [32].

The system modeling method using MDP for RL is not subject to major formal restrictions and helps improve the implementation of control algorithms in current systems. Zacharaki et al. [7] introduced a method that capitalizes on partially observable Markov Decision Processes to amalgamate nominal actions of the system along with unsafe control actions posed by the STPA. Li et al. [33] use the STPA method to formulate a parallel control model for the systems of maritime autonomous surface ships and simulates this with Markov Chains.

Modern systems feature complex interactions and numerous subsystems, making control challenging and verification costly. Nevertheless, our proposed method can handle these systems that require many states and continuously identify state changes that lead to losses in control algorithms by deriving loss scenarios for potential risks while testing the learning results. RL is highly adaptable to various environmental and system changes and provides continuous learning from experience through trial and error [8], enabling the performance improvement of the system. It is available in situations where the transition probability is unknown and provides a measure of learning accuracy to derive loss scenarios. However, during the process of implementing the system into an environment, accuracy is dependent on the accuracy of the control algorithm and the information provided by STPA.

Despite this limitation, this method can be implemented close to the actual system, enabling the identification of causes of state changes for possible real-life hazards. Formal methods such as Model Checking on STPA [15,34], are idealistic and systematic but require a high level of knowledge and experience to be performed by engineers of the system in the real world. Their ability to explore and learn from various scenarios contributes to accurately identifying scenarios that could be missed in conventional analysis and design. The control algorithm for the STPA-based system shall be detailed to facilitate the modeling of the system environment for the RL algorithm.

In [35], investigating tes- input generation using RL, the focus is mainly on exploring and improving specific behaviors of software systems or detecting bugs by maximizing rewards. When applying STPA-RL, the primary objective is to emphasize safety and identify unsafe states or actions within the system to derive and enhance loss scenarios.

7.2. Loss Scenarios of STPA

By implementing a current control environment and rewarding safe actions with RL, we identified self-guided loss scenarios that lead to hazardous actions. This helps in shortening the required time to meet safety requirements and avoid hazards. The context table presented in [36] is used to systematically analyze system components and interaction control structures, identify loss scenarios, and discover potential loss scenarios in the system. Abdulkhaleq et al. [37] derived unsafe scenarios through a context table to generate software safety requirements and created software safety constraints for each context using the Boolean operators AND and OR. Also, it can look for all aspects of the system’s possible risks, but developing safety requirements for all of them could be time-consuming and could be useless if the hazards do not show up in practice.

This could be a challenge for organizations with limited resources or expertise in risk management, so to reduce time and resources for system hazard analysis, we recommend our proposed method. Chang et al. [38] previously aimed to introduce the method to combine STPA and RL. Thus, we further formalized the implementation of the RL environment, providing a more standardized approach for practical utilization. Additionally, we conducted analyses concerning hazard frequencies, along with the derivation of loss scenarios. Zeleskidis et al. [39] introduce athe transformation of STPA analysis into acyclic diagrams. We propose the hazard trajectory diagram to graphically indicate every sequence of steps that is heading to a loss.

8. Conclusions

Infusing trial-and-error-based techniques such as RL into the realm of safety analysis remains a relatively uncommon practice, underscoring the urgent need for research in this domain. In light of this gap, our study endeavors to bridge this divide by delving into the integration of RL with STPA. Recognizing the paramount importance of deriving loss scenarios within STPA for bolstering safety, our research focuses on identifying potential hazards and their consequential outcomes within a system. Traditionally, STPA analysts cataloged loss scenarios using context tables with the combinations of causal factors. However, this approach has drawbacks as it is difficult to cover all the risks that may arise in practice and may not adequately prioritize risks based on their impact and likelihood of occurrence. Therefore, this paper proposes the applicability of RL to STPA and presents the experimentation result with the case study of the PSD system. By implementing the current control environment and rewarding safe actions, RL enables the identification of self-guided loss scenarios that lead to hazards. RL allows us to freely model various systems and interactions when implementing a control environment. The choice of RL algorithms to suit the requirements of the system can provide the ability to track loss scenarios that efficiently reach hazards and reflect real-world systems’ uncertainty.

In this paper, we derived 4 hazards, 11 UCAs, and 38 loss scenarios by using STPA-RL for the PSD system. These loss scenarios identify the flow in which control actions are performed incorrectly and determine the frequency of hazards that may occur. Tracking causes of hazards is efficient while searching for potentially unsafe situations in the system. Seeing what happens rather than guessing the likelihood of the presence of causal factors can point to the specific state from the flows. Based on the results of the testing, this derived information supports changing the system control algorithm where high-frequency hazards appear and set safety constraints to reduce hazards before developing the system. This helps prioritize which hazards to pay attention to in providing safety.

Through system simulation with STPA-RL, we provide an easy test to discover problems, tweak control algorithms, and devise a design-change method. It is particularly suitable for application in the early stages of system development, allowing for the identification of potential risks and loss scenarios before the actual system is constructed and deployed. In conclusion, our approach not only enhances system safety through simulation but also allows the agent to learn about conditions leading to unsafe situations and appropriate responses. This enables the swift development of proactive measures, improving safety and reducing costs associated with system failures. Future studies should include verifying the suitability of the RL model for the STPA-analyzed system and comparing loss scenario results for larger systems.

Author Contributions

Conceptualization, J.C., R.K. and G.K.; methodology, J.C. and R.K.; validation, R.K. and G.K.; writing—original draft preparation, J.C. and R.K.; writing—review and editing, R.K. and G.K.; visualization, J.C.; supervision, G.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2021-0-00122, Safety Analysis and Verification Tool Technology Development for High Safety Software Development).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Leveson, N.G. Engineering a Safer World: Systems Thinking Applied to Safety; The MIT: Cambridge, MA, USA, 2016. [Google Scholar]
Ericson, C.A. Hazard Analysis Techniques for System Safety; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Peters, M.; Ketter, W.; Saar-Tsechansky, M.; Collins, J. A reinforcement learning approach to autonomous decision-making in smart electricity markets. Mach. Learn. 2013, 92, 5–39. [Google Scholar] [CrossRef]
Fisac, J.F.; Lugovoy, N.F.; Rubies-Royo, V.; Ghosh, S.; Tomlin, C.J. Bridging hamilton-jacobi safety analysis and reinforcement learning. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: New York, NY, USA, 2019; pp. 8550–8556. [Google Scholar] [CrossRef]
Ishimatsu, T.; Leveson, N.G.; Thomas, J.; Katahira, M.; Miyamoto, Y.; Nakao, H. Modeling and hazard analysis using STPA. In Proceedings of the 4th IAASS Conference, Huntsville, Alabama, USA, 19–21 May 2010. [Google Scholar]
Faria, J.M. Machine learning safety: An overview. In Proceedings of the 26th Safety-Critical Systems Symposium, York, UK, 6–8 February 2018; pp. 6–8. [Google Scholar]
Zacharaki, A.; Kostavelis, I.; Dokas, I. Decision Making with STPA through Markov Decision Process, a Theoretic Framework for Safe Human-Robot Collaboration. Appl. Sci. 2021, 11, 5212. [Google Scholar] [CrossRef]
Wiering, M.A.; Van Otterlo, M. Reinforcement learning. Adapt. Learn. Optim. 2012, 12, 729. [Google Scholar]
De Souza, N.P.; César, C.d.A.C.; de Melo Bezerra, J.; Hirata, C.M. Extending STPA with STRIDE to identify cybersecurity loss scenarios. J. Inf. Secur. Appl. 2020, 55, 102620. [Google Scholar] [CrossRef]
Lee, S.H.; Shin, S.M.; Hwang, J.S.; Park, J. Operational vulnerability identification procedure for nuclear facilities using STAMP/STPA. IEEE Access 2020, 8, 166034–166046. [Google Scholar] [CrossRef]
Gertman, D.I.; Blackman, H.S. Human Reliability and Safety Analysis Data Handbook; John Wiley & Sons: Hoboken, NJ, USA, 1993. [Google Scholar]
Durmaz, E.; Tümer, M.B. Intelligent software debugging: A reinforcement learning approach for detecting the shortest crashing scenarios. Expert Syst. Appl. 2022, 198, 116722. [Google Scholar] [CrossRef]
Salmon, P.M.; Stanton, N.A.; Walker, G.H.; Hulme, A.; Goode, N.; Thompson, J.; Read, G.J. The Systems Theoretic Process Analysis (STPA) Method. In Handbook of Systems Thinking Methods; CRC Press: Boca Raton, FL, USA, 2022; pp. 71–89. [Google Scholar]
Leveson, N.G.; Thomas, J.P. New Approach to Hazard Analysis. In Guide of Hazard Analysis Using STPA; Telecommunication Technology Association: Seongnam, Republic of Korea, 2018. [Google Scholar]
Dakwat, A.L.; Villani, E. System safety assessment based on STPA and model checking. Saf. Sci. 2018, 109, 130–143. [Google Scholar] [CrossRef]
Thomas, J.P., IV. Extending and Automating a Systems-Theoretic Hazard Analysis for Requirements Generation and Analysis. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2013. [Google Scholar]
Gurgel, D.L.; Hirata, C.M.; Bezerra, J.D.M. A rule-based approach for safety analysis using STAMP/STPA. In Proceedings of the 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC), Prague, Czech Republic, 13–17 September 2015; IEEE: New York, NY, USA, 2015; p. 7B2-1. [Google Scholar] [CrossRef]
Liu, R.; Nageotte, F.; Zanne, P.; de Mathelin, M.; Dresp-Langley, B. Deep reinforcement learning for the control of robotic manipulation: A focussed mini-review. Robotics 2021, 10, 22. [Google Scholar] [CrossRef]
Kane, D.; Liu, S.; Lovett, S.; Mahajan, G. Computational-Statistical Gap in Reinforcement Learning. In Proceedings of the Proceedings of the Thirty Fifth Conference on Learning Theory, London, UK, 2–5 July 2022; Volume 178, pp. 1282–1302.
Quah, K.H.; Quek, C. Maximum reward reinforcement learning: A non-cumulative reward criterion. Expert Syst. Appl. 2006, 31, 351–359. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Gerpott, F.T.; Lang, S.; Reggelin, T.; Zadek, H.; Chaopaisarn, P.; Ramingwong, S. Integration of the A2C algorithm for production scheduling in a two-stage hybrid flow shop environment. Procedia Comput. Sci. 2022, 200, 585–594. [Google Scholar] [CrossRef]
Kao, S.C.; Krishna, T. E3: A hw/sw co-design neuroevolution platform for autonomous learning in edge device. In Proceedings of the 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Stony Brook, NY, USA, 28–30 March 2021; IEEE: New York, NY, USA, 2021; pp. 288–298. [Google Scholar] [CrossRef]
Li, R.; Wang, C.; Zhao, Z.; Guo, R.; Zhang, H. The LSTM-based advantage actor-critic learning for resource management in network slicing with user mobility. IEEE Commun. Lett. 2020, 24, 2005–2009. [Google Scholar] [CrossRef]
Grondman, I.; Busoniu, L.; Lopes, G.A.; Babuska, R. A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Trans. Syst. Man, Cybern. Part C (Appl. Rev.) 2012, 42, 1291–1307. [Google Scholar] [CrossRef]
Fu, H.; Liu, W.; Wu, S.; Wang, Y.; Yang, T.; Li, K.; Xing, J.; Li, B.; Ma, B.; Fu, Q.; et al. Actor-critic policy optimization in a large-scale imperfect-information game. In Proceedings of the International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Hu, H.; Wang, Q. Implementation on benchmark of SC2LE environment with advantage actor–critic method. In Proceedings of the 2020 International Conference on Unmanned Aircraft Systems (ICUAS), Athens, Greece, 1–4 September 2020; IEEE: New York, NY, USA, 2020; pp. 362–366. [Google Scholar] [CrossRef]
Department of Transportation Republic of the Philippines. Metro Manila SubwayProject (MMSP) Valenzuela - Paranaque Phase 1: Part2-Employer’s Requirements Section VI. In Platform Screen Door (PSD) System at Stations; 2019. Available online: https://www.ps-philgeps.gov.ph/home/images/BAC/ForeignAssitedProjects/2019/PH-P267/CP106/07PSD_12Dec2019(PA).pdf (accessed on 28 September 2023).
Lee, S.; Shin, S. Analysis on Risk Factors to Platform Screen Door Failure Based on STPA. J. Korean Soc. Railw. 2021, 24, 931–943. [Google Scholar] [CrossRef]
Hirata, C.; Nadjm-Tehrani, S. Combining GSN and STPA for safety arguments. In Proceedings of the Computer Safety, Reliability, and Security: SAFECOMP 2019Workshops, ASSURE, DECSoS, SASSUR, STRIVE, andWAISE, Turku, Finland, 10 September 2019; Proceedings 38. Springer: Berlin/Heidelberg, Germany, 2019; pp. 5–15. [Google Scholar] [CrossRef]
SAE International. System Theoretic Process Analysis (STPA) Recommended Practices for Evaluations of Automotive Related Safety-Critical Systems J3187. 2022. Available online: https://www.sae.org/standards/content/j3187_202202/ (accessed on 28 September 2023).
Acar Celik, E.; Cârlan, C.; Abdulkhaleq, A.; Bauer, F.; Schels, M.; Putzer, H.J. Application of STPA for the Elicitation of Safety Requirements for a Machine Learning-Based Perception Component in Automotive. In Proceedings of the International Conference on Computer Safety, Reliability, and Security, Munich, Germany, 6–9 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 319–332. [Google Scholar]
Li, W.; Chen, W.; Hu, S.; Xi, Y.; Guo, Y. Risk evolution model of marine traffic via STPA method and MC simulation: A case of MASS along coastal setting. Ocean Eng. 2023, 281, 114673. [Google Scholar] [CrossRef]
Tsuji, M.; Takai, T.; Kakimoto, K.; Ishihama, N.; Katahira, M.; Iida, H. Prioritizing scenarios based on STAMP/STPA using statistical model checking. In Proceedings of the 2020 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), Porto, Portugal, 24–28 October 2020; IEEE: New York, NY, USA, 2020; pp. 124–132. [Google Scholar] [CrossRef]
Kim, J.; Kwon, M.; Yoo, S. Generating test input with deep reinforcement learning. In Proceedings of the 11th International Workshop on Search-Based Software Testing, Gothenburg, Sweden, 28–29 May 2018; pp. 51–58. [Google Scholar] [CrossRef]
Yang, H.; Kwon, G. Identifying Causes of an Accident in STPA Using the Scenario Table. J. KIISE 2019, 46, 787–799. [Google Scholar] [CrossRef]
Abdulkhaleq, A.; Wagner, S.; Leveson, N. A comprehensive safety engineering approach for software-intensive systems based on STPA. Procedia Eng. 2015, 128, 2–11. [Google Scholar] [CrossRef]
Chang, J.; Kwon, R.; Kwon, G. STPA-RL: Analyzing Loss Scenarios in STPA with Reinforcement Learning. J. Korean Inst. Inf. Technol. 2023, 21, 39–48. [Google Scholar] [CrossRef]
Zeleskidis, A.; Dokas, I.M.; Papadopoulos, B. A novel real-time safety level calculation approach based on STPA. In Proceedings of the MATEC Web of Conferences, Amsterdam, Netherland, 9–11 October 2019; EDP Sciences: Les Ulis, France, 2020; Volume 314, p. 01001. [Google Scholar] [CrossRef]

Figure 1. Procedure of STPA. Source: http://psas.scripts.mit.edu/home/materials/ (accessed on 15 February 2024).

Figure 2. Simplified Advantage Actor–Critic.

Figure 3. Procedure of STPA-RL.

Figure 4. State transition by safe and unsafe actions.

Figure 5. Control structure of the platform screen door.

Figure 6. Result of hazard proportion. (a) All possible states of PSD system, (b) hazard steps of loss scenarios.

Figure 7. Cases of hazard trajectories from H01 to H04 (LS01, LS17, LS25, LS34). (a) The list of steps corresponding to each node. (b) The trajectory of the steps for the loss scenario reaching the hazard step. See Table 4 for the hazard steps of H01 to H04.

Table 1. Unsafe control actions.

Control Action	Type A: Not Providing Causes Hazard	Type B: Providing Causes Hazard	Type C: Too Late/Too Soon/Out of Order	Type D: Stopped Too Soon/Applied Too Long
Close Door	UCA01: not closing door when train is ready and detected that obstacle does not exist (H04).	UCA03: closing door after detecting obstacle exists (H02).	UCA05: closed the door too late that obstacle appeared while closing (H02).	N/A
	UCA02: not closing door while train is moving (H01).	UCA04: closing door when the door needs to be opened because the train is stopped (H03).	UCA06: closed the door too late while train is ready, so the train started to move (H01).
Open Door	UCA07: not opening door after detecting obstacle exists and while door is closing (H02).	UCA09: opening door when train is not stopped (H01).	UCA11: detected obstacle and try to open door, but it was too late that the door got closed (H02).	N/A
	UCA08: not opening door after the train is stopped (H03).	UCA10: opening door when obstacle is not-detected while train is ready (H01).

Table 2. Sample cases of hazard frequency and action ratio.

Hazard	Door Position	Train Motion	Obstacle State	Action = Close	Action = Open	Hazard Frequency
H01	fully closed	moving	not exist	$72 %$ (safe)	$28 %$ (hazard)	$61 %$ of H01
H01	opening	moving	not exist	$43 %$ (safe)	$57 %$ (hazard)	$31 %$ of H01
H02	closing	moving	exists	$17 %$ (hazard)	$83 %$ (safe)	$1 %$ of H02
H02	fully closed	moving	exists	$52 %$ (hazard)	$48 %$ (safe)	$39 %$ of H02
H03	opening	stop	exists	$57 %$ (hazard)	$43 %$ (safe)	$36 %$ of H03
H03	fully closed	stop	not exist	$72 %$ (hazard)	$28 %$ (safe)	$54 %$ of H03
H04	closing	ready	not exist	$35 %$ (safe)	$65 %$ (hazard)	$40 %$ of H04
H04	fully opened	ready	not exist	$48 %$ (safe)	$52 %$ (hazard)	$59 %$ of H04

Table 3. Sample cases of hazard trajectories for PSD system.

Hazard	H01	H01	H02	H02	H03	H03	H04	H04
UCAs	UCA02 UCA09	UCA02 UCA09	UCA03 UCA05 UCA07	UCA03	UCA04	UCA04	UCA01 UCA09 UCA11	UCA01 UCA09 UCA11
step 1	(2,1,0,0)	(2,2,0,0)	(0,2,1,1)	(2,2,1,1)	(2,1,0,0)	(1,2,1,1)	(2,1,0,0)	(1,2,1,1)
step 2	(3,1,1,1)	(3,0,0,1)	(1,2,1,1)	(2,0,1,1)	(3,1,1,1)	(2,0,1,1)	(3,1,0,0)	(2,0,0,0)
step 3	(2,1,1,1)	(2,1,0,0)	(2,2,0,0)	(2,1,0,0)	(2,1,0,0)	(2,1,0,0)	(0,2,0,0)	(2,1,1,1)
step 4	(2,1,0,0)	(3,1,0,0)	(3,2,1,0)	(3,1,0,0)	(3,1,0,0)	(3,1,0,0)	(0,0,0,1)	⋯
step 5	(3,1,0,0)	(0,2,0,0)		(0,2,1,0)	(0,2,0,0)	(0,2,0,0)	(1,0,1,1)	(2,1,0,0)
step 6	(0,2,0,0)	(0,2,1,1)			(0,2,0,0)	(0,0,0,0)	(2,1,0,0)	(3,1,0,0)
step 7	(0,2,0,0)	(1,2,0,1)			(0,0,0,1)		(3,1,0,1)	(0,2,0,0)
step 8	(0,2,0,1)				(1,0,0,0)			(0,0,0,1)
step 9								(1,0,1,1)
step 10								(2,1,0,1)

Table 4. Loss scenarios for H01 to H04.

Hazard	Hazard Step	Frequency	Loss Scenario before Reaching Hazard
H01	(0,2,0,1)	$62.40 %$	LS01 (0,2,0,0) 43.59%	LS02 (3,1,0,0) 48.72%	LS03 (1,2,0,0) 7.69%
	(1,2,0,1)	$31.20 %$	LS04 (0,2,1,1) 100%
	(2,2,0,1)	$4.80 %$	LS05 (1,2,1,1) 66.67%	LS06 (3,2,1,1) 33.33%
	(3,2,0,1)	$1.60 %$	LS07 (2,2,0,0) 100%
H02	(0,2,1,0)	$39.62 %$	LS08 (1,2,0,0) 5.95%	LS09 (0,2,0,0) 40.48%	LS10 (3,1,0,0) 53.57%
	(1,0,1,0)	$0.47 %$	LS11 (0,2,1,1) 100%
	(2,1,1,0)	$27.36 %$	LS12 (2,1,1,1) 31.03% LS15 (3,1,1,1) 24.14%	LS13 (1,0,0,1) 32.76% LS16 (2,0,1,1) 1.73%	LS14 (1,0,1,1) 10.34%
	(3,1,1,0)	$25.94 %$	LS17 (2,1,0,0) 100%
	(1,2,1,0)	$6.13 %$	LS18 (0,2,1,1) 100%
	(2,2,1,0)	$0.47 %$	LS19 (1,2,1,1) 100%
H03	(0,0,0,0)	$53.87 %$	LS20 (0,2,0,0) 86.27%	LS21 (3,2,0,0) 1.31%	LS22 (1,2,0,0) 12.42%
	(1,0,0,0)	$36.62 %$	LS23 (0,2,1,1) 23.08%	LS24 (0,0,0,1) 56.73%	LS25 (0,0,1,1) 20.19%
	(2,0,0,0)	$6.69 %$	LS26 (2,2,1,1) 15.79%	LS27 (1,2,1,1) 68.42%	LS28 (3,2,1,1) 15.79%
	(3,0,0,0)	$2.82 %$	LS29 (2,2,0,0) 100%
H04	(2,1,0,1)	$59.41 %$	LS30 (1,0,0,1) 27.60% LS33 (2,1,1,1) 31.22% LS36 (3,0,1,1) 0.45%	LS31 (1,0,1,1) 11.31% LS34 (3,1,1,1) 25.34% LS37 (2,0,1,1) 0.91%	LS32 (2,0,0,1) 2.26% LS35 (3,0,0,1) 0.91%
H04	(3,1,0,1)	$40.59 %$	LS38 (2,1,0,0) 100%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chang, J.; Kwon, R.; Kwon, G. STPA-RL: Integrating Reinforcement Learning into STPA for Loss Scenario Exploration. Appl. Sci. 2024, 14, 2916. https://doi.org/10.3390/app14072916

AMA Style

Chang J, Kwon R, Kwon G. STPA-RL: Integrating Reinforcement Learning into STPA for Loss Scenario Exploration. Applied Sciences. 2024; 14(7):2916. https://doi.org/10.3390/app14072916

Chicago/Turabian Style

Chang, Jiyoung, Ryeonggu Kwon, and Gihwon Kwon. 2024. "STPA-RL: Integrating Reinforcement Learning into STPA for Loss Scenario Exploration" Applied Sciences 14, no. 7: 2916. https://doi.org/10.3390/app14072916

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

STPA-RL: Integrating Reinforcement Learning into STPA for Loss Scenario Exploration

Abstract

1. Introduction

2. Background

2.1. System-Theoretic Process Analysis

2.2. Reinforcement Learning

3. Proposed Approach: STPA-RL

3.1. System Environment Modeling

3.1.1. Definition of Process Model for State Representation

3.1.2. Definition of Control Actions for Action Representation

3.1.3. Construction and Implementation of Control the Algorithm

3.1.4. Hazard Reclassification and Prioritization

3.1.5. Setting the Environment State Transition

3.2. Training with Reinforcement Learning

3.2.1. Selection of Algorithm and Implementation for RL

3.2.2. Rewarding Safe Actions Based on State Conditions

3.2.3. Negative Rewarding UCA Based on State Conditions

3.3. Model Simulation and Analysis

3.3.1. Recording Environmental State Trajectories

3.3.2. Deriving Loss Scenarios and System Hazard Analysis

4. Case Study: Platform Screen Door

4.1. Definition of Accident and Hazard

4.2. Schematizing Control Structure

4.3. Identification of Unsafe Control Actions

4.4. Applying STPA-RL

5. Results and Analysis

5.1. Loss Scenarios

5.2. Hazard Trajectories

6. Threats to Validity

7. Discussion

7.1. Safety Analysis with Reinforcement Learning

7.2. Loss Scenarios of STPA

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI