1. Introduction
Today, the Internet of Things (IoT) impacts almost every aspect of societal needs [
1]. With the rapid development of network and information technology, new cyber threats (e.g., session hijacking, masquerade attack, and interruption) [
2] are showing a gradual rising trend. Increasing complexity of attack strategy and the ever-changing attack scenarios make traditional network defense, such as firewalls, hard to resist. In 2019, more than 10,000 new types of cybercrime were committed in Russia [
3]. In February 2022, Ukrainian government agencies and banking websites were targeted by large-scale distributed denial-of-service (DDoS) attacks, resulting in the offlining of at least 10 websites [
4]. To achieve better command of threat situations and coordinate the response to unknown threats, security experts proposed cyber threat intelligence (CTI) for network defense. Gartner [
5] first put forward that CTI is knowledge of existing or emerging threats against assets, including scenarios, mechanisms, indicators, and actionable recommendations, which can provide the subject with countermeasures.
Knowledge of threat intelligence originates from security analysis reports, blogs, social media, etc., which provides powerful data support for situational awareness and active network defense [
6]. However, threat intelligence is mainly in the form of natural language, containing a large amount of unstructured data. Thus, it is difficult to visualize the internal relations of crucial elements. To help researchers understand the semantic association of elements quickly, it is necessary to design corresponding algorithms for mining entities and relations between them from large-scale threat intelligence documents to construct a knowledge graph.
Relation extraction aims to identify relations between entities from a given text [
7]. As shown in
Figure 1, the head entity Attacker “
Mealybug” and the tail entity Trojan “
Trojan.Emotet” can express the relation of “
Use”. Although relation extraction in the general domain has achieved satisfactory results, the mainstream models present the following limitations in the cybersecurity domain: (1) the lack of open-source datasets about threat intelligence; (2) threat intelligence contains plenty of terms such as vulnerability number, malware name, advanced persistent threat (APT) group, etc., with serious out-of-vocabulary (OOV) problem; (3) threat intelligence documents are complex in structure. The frequency of entities in a sentence is extremely low, leading to the serious imbalance in the distribution of data labels. In addition, the current work mainly focusses on text mining at the sentence level. However, in practical scenarios, there may be multiple mentions for an entity and the relations between entities usually depend on at least two sentences for inference [
8].
To this end, this paper proposes a novel feature-enhanced document-level relation extraction model (FEDRE) to improve the in-domain performance of threat intelligence, which integrates new features. Then we introduce a teacher–student model to achieve knowledge distillation (FEDRE-KD). In summary, we present a practical model to convert threat intelligence documents into structured data and construct a knowledge graph. It can be further utilized in threat hunting and decision making. Our contribution can be summarized as follows:
(1) We captured part-of-speech (POS) of entity, width of mention, distance between entity pair, and type of entity as new features in document-level threat intelligence relation extraction. Pre-training model bidirectional encoder representation from transformers (BERT) was applied as encoder to alleviate the OOV problem.
(2) We introduced a teacher–student model, gathering effective information from texts by soft labels, which retains the association between classes and eliminates some invalid redundant information. We achieved knowledge distillation and further improved performance.
(3) We collected 227 threat intelligence documents and manually annotated them based on an ontology we defined. We systematically compared the performance of our model with the mainstream neural network models on the document-level relation extraction task. Experimental results demonstrate the effectiveness of our model. The extraction results were integrated to construct a threat intelligence knowledge graph, realizing the visualization of correlation of key elements.
4. Experiment
4.1. Dataset
We annotated 227 threat intelligence documents manually, 151 of which were selected as the training set and the remaining 76 as the test set. The training set contained 1610 entities and 949 relations. Definitions of entities and relations between them are shown in
Table 1 and
Table 2.
Threat intelligence ontology was constructed, as shown in
Figure 4.
4.2. Experiment Setup
Our model was trained on Nvidia Geforce RTX 3090 GPU based on Pytorch1.7.1. We used cased BERT-based as the pre-trained encoder for threat intelligence. We trained the model for 100 epochs with batch size 8, using the AdamW optimizer with warmup and early stop strategy (If the performance was not improved for 20 consecutive epochs, the training process would be stopped). The learning rate was set to 5 × 10−5 for BERT and 1 × 10−4 for other layers. The loss weight of the teacher model and student model was set to 1:1, i.e., . We chose 25 as the size of the POS embedding , type embedding , width embedding and distance embedding .
To tackle the imbalance of the dataset, we adopted random oversampling to copy minority classes before training our model. Specifically, tokens were replaced by their synonyms to create new samples.
Following prior studies, we introduced commonly used metrics in the relation extraction task to evaluate our model, i.e., precision (P), recall (R), and F1-score (F1). Additionally, we used F1 as the main evaluation metric. Furthermore, we presented time overhead as another index. Meanwhile, we calculated performance for each kind of relation- analysing model at a more granular level.
We compared our model with three excellent works, including SSAN [
15], GAIN [
31], and ATLOP [
8]. For fair comparisons, we use cased BERT-based as the base encoder for all methods.
4.3. Result and Analysis
4.3.1. Model Comparison
Table 3 presents the relation extraction results of our model and baseline models on our dataset. First, compared to ATLOP, FEDRE improved its performance significantly by 21.01/22.61/22.38 P/R/F1 on the test set. This demonstrates the usefulness of additional features during inference. In addition, we concluded that FEDRE-KD outperformed FEDRE by 4.51 in the F1, proving that knowledge distillation can effectively be promoted. The experimental results also show that FEDRE-KD performed better than all the baseline models. The F1 of our model was 21.07 higher than that of SSAN and 20.06 higher than that of GAIN.
4.3.2. Ablation Study
We conducted ablation studies to further analyze the utility of each module in FEDRE. The results are shown in
Table 4.
We first removed the POS embeddings and width embeddings, which are denoted as NoPOS and NoWidth, respectively. It was obvious that performance would drop if any feature of them was removed, indicating that the information for POS and width is important for relation prediction. Specifically, we found that verbs and nouns were more likely to be associated to other tokens in threat intelligence documents. Meanwhile, integrating width embedding could enrich representation at the mention level.
Then, we removed the entity-type embeddings, which is denoted as NoType. The performance dropped sharply, by 11.49. There existed different relations between different kinds of entities. For instance, “Patched” would only appear when the head entity belonged to “Vulnerability” and the tail entity belonged to “Time”. Therefore, integrating type embeddings can enrich representation at the entity level.
Finally, we removed the distance embeddings, which is denoted as NoDistance. The performance dropped by 11.47. This further demonstrates that the distance of two entities could enrich representation at the entity-pair level.
4.3.3. Fine-Grained Performance Comparison
To further observe the ability of introducing additional features and knowledge distillation to fit different types of data,
Table 5 shows the fine-grained performance in detail.
Combined with the distribution of relations in
Table 2, it can be intuitively found that introducing additional features significantly improved the classification ability of most types, such as ”Target”, “Perform”, and “Use”. Meanwhile, it is obvious that introducing knowledge distillation brought further promotion, with the maximum improvement of 21.16.
4.3.4. Choice of Sampling Technique
To alleviate the imbalance of the dataset, oversampling and undersampling were introduced. The results in
Table 6 prove that the oversampling algorithm could significantly improve the performance. However, the undersampling algorithm suffered from the risk of unreasonably removing instances of loss of important information.
4.4. Threat Intelligence Knowledge Graph Construction
We inputted threat intelligence documents with annotated entities into the trained FERED-KD model. The model predicted relations selected from predefined relation sets for all the entity pairs. Then we inserted the entity-relation set into the knowledge graph using the neo4j-admin command. The results are shown in
Figure 5.