2.1.1. Subsection Encoding Layer
In this paper, the model vectorizes text through a word embedding layer, generating input character vectors and input word vectors. During model training, an adversarial training method is incorporated, specifically the Fast Gradient Method (FGM) proposed by Goodfellow et al. [
22]. The Fast Gradient Method (FGM) was selected for our adversarial training due to its efficiency and effectiveness in generating adversarial examples. FGM is computationally less intensive compared to other methods such as Projected Gradient Descent (PGD) or Carlini and Wagner (C&W) attacks, allowing for faster training times without significantly compromising robustness. Additionally, FGM’s simplicity and strong performance in improving model generalization and resilience against adversarial attacks make it a preferred choice. By perturbing the input data with the gradient of the loss function, FGM effectively exposes the model to a variety of adversarial scenarios, enhancing its ability to generalize to unseen examples. The standard format of adversarial training is as follows:
In the formalism,
represents the training dataset,
denotes the input,
is the label,
is the model parameters, and
is the loss for a single sample.
is the adversarial perturbation, and
is the space of perturbations. A common constraint is ‖
‖≤ϵ, where ϵ is a constant. For each sample, an adversarial example
is constructed, and the pair
is used to minimize the loss and update the parameters θ via gradient descent. The adversarial perturbation
used is as follows:
For
, standardize the following:
In our model, TextCNN is employed to extract features from the input text. TextCNN utilizes three convolutional layers with kernel sizes of three, five, and seven, respectively. Each convolutional layer considers a domain of the same size as its kernel, allowing the output to perceive a broader dimension of the input. The model’s input is a sentence along with all the vocabulary of the contiguous sub-sequences matching the current sentence. We represent the input sentence as
, where
denotes the i-th character, and the vocabulary matched to the sentence is represented as
. Each character
is represented by a vector, denoted as
, obtained by looking up the pre-trained character embedding matrix, where
is a lookup table for character embeddings.
By employing TextCNN on the sequence
, we extract features, resulting in a sentence representation
, where each
is a feature vector corresponding to the i-th position in the input sequence. To represent the semantic information of characters, we retrieve word embeddings from a pre-trained word embedding matrix. Each vocabulary item
is represented as a semantic vector, denoted by
, where
is a lookup table for word embeddings. This representation facilitates a richer semantic understanding of the text, integrating both character-level and word-level information.
The sentence representation matrix
and the word embedding matrix
are concatenated to form the output of the encoding layer. This combined matrix is denoted as
, serving as a comprehensive feature set that captures both the contextual cues from the sentence and the semantic attributes of the individual words. This concatenated matrix provides a robust foundation for subsequent layers to perform more sophisticated analyses and classifications.
2.1.2. Graph Attention Layer
To integrate information from self-matching vocabulary and the immediate contextual vocabulary, this paper employs three graph attention network layers to structure the model. These layers are as follows:
Word–Character Containing Graph Network: This network is designed to aid characters in acquiring boundary information and semantic information from self-matching vocabulary. It establishes connections between characters and their directly associated words, enriching the characters with deeper lexical insights.
Word–Character Transition Graph Network: This network captures the semantic information of the nearest contextual vocabulary. It transitions between characters and words that form contextual relationships, helping to understand the flow and connection of ideas within the text.
Word–Character Lattice Graph Network: Inspired by Zhang Yue’s [
23] use of a lattice structure in LSTM models to integrate vocabulary knowledge, this paper extracts the lattice structure to form the third attention network layer. This structure allows for a more flexible and interconnected approach to handling complex character–word relationships, providing a mesh-like framework that captures broader lexical fields.
Each of these network layers shares the same set of vertices, which consist of characters from the sentence and their matching vocabulary. However, the sets of edges are entirely distinct, facilitating specialized processing by each layer.
To represent the sets of edges, adjacency matrices are introduced. An element in an adjacency matrix indicates whether the vertices in the graph are adjacent; ‘1’ denotes adjacency, while a ‘0’ denotes non-adjacency. This matrix-based approach allows for efficient representation and processing of graph data, facilitating effective attention-based learning across the different layers.
In the Word–Character Containing Graph Network (WCCGN), characters within a sentence can capture the boundary and semantic information of self-matching vocabulary. As demonstrated in
Figure 2, if a vocabulary item m contains a character n, then the corresponding entry in the adjacency matrix C for the WCCGN (m, n) is assigned a value of 1. This indicates a direct relationship where the vocabulary item encompasses the character, thus linking characters to the words they contribute to forming. This connection facilitates the effective capture of both boundary and deeper semantic layers of information, enhancing the text’s representational richness in the network.
The Word–Character Transition Graph Network (WCTGN) facilitates the capture of semantic information from the nearest contextual vocabulary for each character. As illustrated in
Figure 3, if a vocabulary item
or a character
matches the immediately preceding or succeeding subsequence of a character
, the corresponding entry in the adjacency matrix
for the WCTGN (m, j) or (n, j) is assigned a value of 1. This establishes a direct semantic link between characters and their adjacent vocabulary items, reflecting immediate linguistic contexts.
Furthermore, if a vocabulary item is contextually related to another vocabulary item as part of the preceding or succeeding context, the adjacency matrix entry (m, k) is also assigned a value of 1. This linkage captures broader contextual relationships, ensuring that the semantic flow between closely related vocabulary items within the text is maintained and effectively represented in the model. This structure enhances the model’s ability to comprehend and integrate contextual nuances, significantly improving its text classification capabilities.
The Word–Character Lattice Graph Network (WCLGN) is designed to implicitly capture the semantic information of the nearest contextual vocabulary and some self-matching vocabulary terms. As shown in
Figure 4, if a character
is immediately preceding or succeeding another character
, the corresponding entry in the adjacency matrix
for the WCLGN (n, j) is assigned a value of 1. This assignment explicitly connects characters that are adjacent within the text, facilitating the capture of local contextual information.
Additionally, if a character matches the starting or ending character of a vocabulary item , the adjacency matrix entry (n, m) is also assigned a value of 1. This linkage captures not just the adjacency but also the boundary alignment between characters and vocabulary terms. By mapping these relationships, the WCLGN can effectively integrate and represent both the direct context and the boundary information of vocabulary items, enhancing the model’s capability to understand and process textual data comprehensively. This dual capture mechanism ensures that the text classification model can leverage both local and broader semantic cues efficiently.
In the construction of the model using three graph attention network layers, each j-th layer of a graph attention network (GAT) receives an input set of node features along with an adjacency matrix , where denotes the number of nodes and represents the dimensionality of the features at the j-th layer. The output of the j-th layer is a new set of node features .
A GAT employs independent attention mechanisms, each of which computes the importance of node ’s features to node ’s features. The attention mechanism in GAT is used to weigh the influence of each node’s features on each other based on the structure specified by the adjacency matrix . The computation of the new feature for each node involves aggregating features from its neighborhood, weighted by attention scores. The attention scores are computed as follows:
1. Linear Transformation: Each node feature
is first transformed by a weight matrix
, which is commonly shared across the network but specific to each attention head. This step projects the features into a space where attention coefficients can be more effectively learned:
2. Attention Coefficient Calculation: The attention coefficients that indicate the importance of node
’s features for node
are calculated using a pairwise attention mechanism on the transformed features. Typically, this involves a nonlinear transformation such as the softmax function applied to a linear combination of features:
where
is a weight vector,
denotes concatenation, and
includes node
and its neighbors as specified by
.
3. Feature Update: The new features for each node are then computed as a weighted sum of the transformed features of the neighboring nodes, scaled by the computed attention coefficients:
where
is an activation function.
This framework enables each node to dynamically adjust the influence of its neighboring nodes’ features based on the overall structure of the graph, thereby effectively capturing both local and global structural information in the feature updates. This process is repeated for attention heads, and the results can be averaged, depending on the specific architecture, to form the final output features for each node.
To construct models for three entirely distinct character–word graphs, this paper utilizes three independent graph attention networks (GATs), designated as GAT_1, GAT_2, and GAT_3. Since all three GATs share the same set of vertices, the input node features for each GAT are provided by the matrix
. This shared matrix initializes the node features across all networks. The formula is as follows:
Among them,
,
k ∈ {1, 2, 3}, we keep the first n columns of these matrices and discard the last m columns because only these characters represent decoding labels.
2.1.3. Adjacency Matrix Processing
In the context of graph neural networks used for text classification, dealing with common issues like overfitting and oversmoothing is crucial, especially when handling complex data such as text. Overfitting occurs when a model performs excellently on training data but poorly on unseen test data, often due to insufficient training data or excessive model complexity, leading the model to learn the training data features too specifically at the expense of generalization. Oversmoothing, on the other hand, happens when an increase in network layers causes node representations to become overly similar, resulting in information loss and gradient vanishing.
To address these challenges, the Dropedge technique is introduced, which involves randomly removing some non-zero elements in the adjacency matrices and setting them to zero. This random alteration helps preserve original features while enhancing data diversity, which aids in reducing model complexity and preventing overfitting.
Furthermore, Dropedge can slow down the convergence of the network, which helps mitigate the oversmoothing issue. By reducing direct connections between nodes, the model can more effectively maintain crucial information from the input data and prevent homogenization of node representations, thereby enhancing the model’s generalizability. Thus, Dropedge not only reduces computational complexity but also allows for deeper network layers, enabling the extraction of more profound features and enhancing the expressiveness of the network. This, in turn, improves the model’s accuracy and robustness.
Figure 5 illustrates the process of applying Dropedge to the adjacency matrices, demonstrating how this technique modifies the network structure to improve model performance and stability in text classification tasks.