Our primary research addresses the security concerns of semi-supervised node classification tasks in GNNs. Given that GNNs are vulnerable to adversarial injection attacks and considering the partial observability of adversarial defense methods—where defenders are only aware of the graph structure and node attributes after the attack, without knowledge of which nodes were adversarially injected—we propose a GNN secure training strategy operable under a solely observable graph structure and node attributes. This strategy does not just protect a specific node; instead, it takes into account the security of all nodes across the entire graph. The steps are shown in
Figure 4 and can be mainly divided into three parts: (1) model the problem as a POMDP model; (2) use reinforcement learning to solve the optimal strategy of the POMDP; and (3) control the convolutional scope based on the optimal strategy, thereby obtaining a secure training process. Specifically, first, the graph security issues are converted into a POMDP. Then, using Graph Convolutional Memory (GCM), the observation values
O of a POMDP are transformed into a state with temporal memory, denoted by
. After that,
is taken as the state in reinforcement learning to solve for the optimal strategy of the POMDP problem. Finally, the convolutional scope is controlled by the optimal strategy to avoid learning from malicious nodes.
4.1. Definition of Problem
We focus on the semi-supervised learning task for undirected graph classification. We define an undirected graph , where is the set of nodes, and is the set of edges. The graph G has a total of N nodes. Moreover, to capture the structure of the graph, we define , where is a binary adjacency matrix. indicates the existence of an undirected edge between the nodes and , while indicates that there is no edge between the nodes. is the feature matrix that represents the features of all nodes, where the i-th row of the feature matrix X represents the F-dimensional feature vector of the node . Each node has a label, , where k is the number of classes. In summary, we define the dataset , where . GNNs are trained to obtain a classifier based on labeled and unlabeled nodes. Given that GNNs may be vulnerable to poisoning attacks, attackers introduce a small perturbation to the original structure A to create . The resulting dataset becomes . In this case, a classifier trained on the perturbed dataset fails to correctly predict the node classification y for the node v. Using the message-passing mechanism of GNNs and existing GNN models as input, our goal is to learn a structure for message passing that mitigates the negative impact of the attack. Based on the above symbolic definition, we define the problem of defending against graph poisoning attacks: In the face of poisoning attacks, attackers apply a perturbation to the original graph G, and the perturbation is unknown. This leads to a perturbed graph , which reduces the performance of the GNN model. Therefore, we need to find a new GNN that considers the safety design of all nodes in the graph.
Based on the above symbolic definition, we define the problem of defending against graph poisoning attacks: In the face of poisoning attacks, attackers apply an unknown perturbation
to the original graph
G, resulting in a perturbed graph
that reduces the performance of the GNN model. We need to find a new GNN
that considers the safety design of all nodes in the graph.
4.2. POMDP Modeling
In fact, adversarial injection attacks might inject multiple attack nodes simultaneously, necessitating defense mechanisms for all nodes. A black-box state with multiple attack nodes fits better as the state space of a POMDP since it is directly unobservable. Moreover, as the evaluation of transitioning from the state to relies on accuracy, the model’s accuracy serves as an indirect observation of the system state rather than providing direct state information. In this context, accuracy acts as an observation outcome, reflecting potential changes in state. Thus, the collection of edges between nodes is more suitable as the observation space for a POMDP, offering directly obtainable information about the current state. Given the problem’s indirect observation characteristic and inherent uncertainty, a POMDP offers a more fitting framework to model and address this issue.
To address the above issues, we model graph security issues as a POMDP. As shown in
Table 1, in our POMDP, the state space
represents the attacks by malicious nodes, the action space
, and the observation space
, where
is the set of edges between nodes in the graph. We take actions
from policy
, transition to the next state
according to the transition probability
, and receive a reward
. In general, the transition process of states and observations in a POMDP can be described as follows:
In order to address the challenges presented, we propose a framework that formulates the graph security issues as a POMDP. The proposed POMDP framework, delineated in Algorithm 1, encompasses the following components:
An observation space , where each signifies the set of inter-node edges within the graph;
A state space , representing the subset of edges clandestinely injected by adversarial nodes, which remain unobservable;
An action space , delineating the permissible modifications to the graph’s topology;
A transition probability , which is instantiated by the classification accuracy ACC, corresponding to the graph’s current state, postulating that actions leading to an enhancement in ACC are associated with a higher transition probability to the resultant state;
An observation probability Z is set unequivocally to 1, ensuring that any action taken results in a deterministic observational outcome;
A reward function R that aligns with the reward parameter derived from the Actor–Critic (A2C) reinforcement learning algorithm, quantifying the reward obtained upon transitioning from the state s to the state via the action a;
A discount factor , ascribed from the A2C algorithm, employed to calibrate the significance accorded to immediate versus prospective rewards. A value of proximal to 1 () signifies a predilection for long-term strategies and the consequential future rewards.
The operational mechanism of the POMDP model engages in selecting an action predicated on the policy , ensuing a state transition to according to the defined transition probability , and acquiring a reward . This is succeeded by an update to the policy predicated on the obtained reward. Concisely, the state and observation transition dynamics within the POMDP framework are articulated as follows:
For a current state
, executing an action
leads to a transition to the next state
according to the following transition probability
T:
For a current observation
, executing an action
results in a change in the observation to
, as follows:
This means that a state corresponds to multiple observations. For a state at a certain moment, executing an action may or may not change the state, but the corresponding observation will definitely change.
As shown in
Figure 5, we present a simple example to introduce our POMDP model for the graph security problem: For the initial state
, the state
is defined by the set of edges injected by the attacked nodes A and B, and its corresponding observation is
, where
is the set of all edges in the graph. Now, executing an action “remove e(3,4)” and then evaluating that this edge may not be one injected by the attacker, our state
does not change. However, removing an edge changes the observation
to
. Similarly, at this moment, executing the action “remove e(A,3)” and evaluating this edge as potentially dangerous and injected by the attacker, the state
will change to
, and the observation
will change to
. If we obtain the optimal strategy for the POMDP, then, as illustrated, we will end up with a secure graph that has removed the edges maliciously injected by the attack nodes.
It should be noted that the state space is not directly observable to us, meaning the attack nodes A and B, and in reality, we do not know if they are attack nodes and what their attack actions are. To facilitate the understanding of how we model the graph security problem as a POMDP model and how our POMDP model operates, we explicitly represent the attack nodes and the edges they inject.
Specifically, we first model the graph security issues as a POMDP model. For a graph , after being attacked, the initial state is the disturbance added by the attack node x, but both the attack node and the disturbance are unknown. The observation o is known; it is the set of edges between nodes in the graph, and the initial observation includes the edges injected by the attack nodes. We attempt to change the state s by taking the action to add or remove an edge in the graph. If the operated edge is one injected by an attack node, then the state s will change. Moreover, each action taken will lead to a transition from the observation to .
In brief, in our defined POMDP model, the agent’s current observation and the required action will result in a transition to the next observation. This action may or may not change the state, depending on the defined reward function (Equation (
4)).
4.3. Determining the Optimal Strategy
In the previous section, we modeled graph security issues as a POMDP. In MDP, the policy uses the real state
, but in a POMDP, due to its partially observable nature, solving the POMDP is challenging, so we rely on Graph Convolutional Memory (GCM) [
20] to solve the POMDP. The GCM’s goal is to construct an approximation of
from the observation sequence
. The GCM reconstructs a graph; for an input observation
, the GCM treats
as a node placed within the graph. It then computes its neighbor nodes
using topological priors and updates its edges, and finally, the GNN summarizes this graph to generate the approximate state
. This approximate state
is then used for decision making in the policy
.
To solve the POMDP, we use the A2C (Advantage Actor–Critic) algorithm [
27] as the reinforcement learning method to find the optimal strategy. The A2C (Advantage Actor–Critic) algorithm uses two networks: one decides actions (Actor), and the other evaluates the quality of these actions (Critic), updating both simultaneously to learn tasks faster and more stably. In the A2C algorithm, the Actor generates actions based on the current policy, while the Critic evaluates the resulting state to provide feedback on the effectiveness of these actions. The Actor selects actions according to the current policy
, while the Critic evaluates the quality of the current policy, i.e., the state-value function
or the action-value function
. Specifically, the Actor component proposes actions based on the current approximate state, aiming to change the graph’s structure to mitigate or avoid the impact of attacks. The Critic component then evaluates the expected effect of these actions, providing feedback on the current strategy’s effectiveness.
The core of the A2C algorithm lies in how it uses feedback (i.e., reward values) obtained from the environment to update the policy. By calculating the advantage function
, A2C can assess the additional value of taking a specific action
a in a given state
s compared with following the average policy. The advantage function, defined as the difference between the action-value function
and the state-value function
, assesses the relative value of taking a particular action compared with the average policy. The essence of this process is the reward value—the environment’s immediate feedback on the agent’s actions, which directly affects the calculation of
and
. The definition of the reward value is as follows:
At the current time
t,
signifies the model’s accuracy. We define the baseline as the set of accuracy rates from the preceding
n moments, which is expressed as
, where
and
b is the maximum number of entries that the baseline can contain. The sum
represents the aggregate of the accuracy rates over these
n moments in the baseline. When a new accuracy value
is to be included in the baseline, and the condition
holds true, the earliest element
should be excised from the baseline before the inclusion of
. This method is delineated by the following rule of update:
Within this context, “acc” stands for the model’s accuracy. After each policy update step, we compute the model’s current accuracy at time
t, denoted as
. The accuracy we define is shown as follows:
In this context, N denotes the total number of samples; represents the score that the model predicts for the i-th sample to be of the class j; is the actual class label of the i-th sample; indicates the model’s prediction for the i-th sample, specifically, the class j with the highest score; and is the indicator function, which outputs 1 if the condition within the parentheses is true—that is, if the predicted class matches the true class—and 0 otherwise.
An agent, upon executing an action within the environment, receives a reward signal, which serves as an immediate indication of the action’s utility. The A2C algorithm employs these reward signals to refine the agent’s behavioral strategy. More precisely, by optimizing the advantage function , the Actor is directed to select actions that are anticipated to provide superior rewards. Concurrently, the Critic, by assessing the disparity between the actual rewards and those predicted based on the extant policy—the temporal-difference (TD) error—systematically calibrates its assessment of the state value function . This error signal, rooted in rewards, is instrumental in the strategic adjustment process as it furnishes an explicit evaluation of the policy’s efficacy, thereby enabling the Critic to render more accurate feedback to the Actor and enhancing the strategy optimization process.
With this iterative process of interaction and refinement, the A2C algorithm strikes a balance between exploring new actions to uncover more efficacious strategies and selecting the most appropriate action predicated on accumulated knowledge. As depicted in Algorithm 1, the reinforcement learning training protocol incrementally acquires an optimal strategy that upholds safety within a specified graph structure and under potential attack scenarios, effectively neutralizing the threat of graph poisoning attacks. This methodology constitutes a secure convolution process, accounting for not only the immediate impact of the current state of the graph and its actions but also, through the synergy of GCM and A2C, the overarching structure and enduring safety of the entire graph, thereby offering a holistic and dynamic defense mechanism for graph data.
Algorithm 1: Optimal strategy solving algorithm. |
|