5.2. Construction of Cybersecurity Dataset
In order to verify the effectiveness of the model, this paper obtains a text intelligence training dataset from the mainstream Chinese information security community. The training data are cleaned, standardized, and structured to ensure clarity and accuracy.
(1) Data Crawling
All the data required for model training comes from text information on cybersecurity intelligence community websites. In this paper, the Python web crawler is used to crawl a large amount of data from major website platforms, including 10 Chinese sites such as “Anquanke”, “National Computer Network Emergency Response Technical Team/Coordination Center of China (CNCERT/CC)”, “WWW.YOUXIA.ORG”, “CNNVD”, “ZDNet”, “Hacker News”, as well as 6 English websites, totaling over 10,000 valid data entries. To further standardize the data format, the main focus is on obtaining information such as titles, dates, authors, sources, categories, views, comments, tags, content, links, and summaries.
(2) Data annotation
After the data text format cleaning, to form unified training set and test set samples and facilitate the subsequent annotation and classification, the data are imported into the MySQL database for shared storage.
To determine the features of the training set and test set, manual and machine annotations are employed to label object entities, attributes, relations, within the processed data text. For this model, a total of 13,903 text sentences have been annotated. In order to better reflect the characteristics of “big data, few shot” of cybersecurity data, this chapter does not adopt the traditional 8:1:1 split method for the training set–validation set–test set, but instead use a 1:1:8 ratio to better simulate real situations and verify the effectiveness of the model in small samples. The data are annotated with entity boundaries and relationship types, and the processed data format is shown in
Figure 13.
A total of 12 sets of relation examples are designed, including
has_version,
has_element,
because_of,
is_product_of,
has_vul,
use_means_of,
lead_to_consequence,
exploit,
develope,
cooperate_with,
target_at,
no_relation, The relationship definition is shown in
Table 4.
has_version: This relation links the entities of a product and version number(s), indicating the corresponding version(s) of the product. For example, in the sentence “Sendmail versions 8.8.0 and 8.8.1 have a MIME buffer overflow vulnerability that can be exploited to gain root access”, the vulnerable versions of Sendmail are 8.8.0 and 8.8.1, thus there is a “has_version” relation between “Sendmail” and “versions 8.8.0 and 8.8.1”.
has_element: This relation connects a product and its element(s), indicating the subordinate relationship between the element(s) and the product. For example, in the sentence “Sendmail versions 8.8.0 and 8.8.1 have a MIME buffer overflow vulnerability that can be exploited to gain root access”, MIME is an element of Sendmail, thus there is a “has_element” relation between “Sendmail” and “MIME”.
because_of: This relation links a vulnerability with its cause, indicating the cause of the vulnerability. For example, in the sentence “Versions of Screen prior to 3.9.1.0 have a vulnerability related to multi-attach error”, the multi-attach error is a cause of the vulnerability, thus there is a “because_of” relation between “vulnerability” and “multi-attach error”.
is_product_of: This relation connects a product with its manufacturer, indicating that a certain product belongs to a certain manufacturer. For example, in the sentence “Microsoft Windows NT is a large-scale computer network operating system developed by the Microsoft Corporation”, Microsoft Windows NT is a product of Microsoft, thus there is an “is_product_of” relation between the entity “Microsoft Windows NT” and the entity “Microsoft Corporation”.
has_vul: This relation links a product with the vulnerability(ies), indicating the vulnerability(ies) present in the product. For example, in the sentence “Sendmail versions 8.8.0 and 8.8.1 have a MIME buffer overflow vulnerability that can be exploited to gain root access”, Sendmail has a buffer overflow vulnerability, thus there is a “has_vul” relation between “Sendmail” and “buffer overflow vulnerability”.
use_means_of: This relation connects an organization with a tool, indicating that the attacker uses a specific tool for network attacks. For example, in the sentence “The attacker caused a denial of service by sending excessive messages”, the means of attack by the “attacker” is “sending excessive messages”, thus there is a “use_means_of” relationship between them.
lead_to_consequence: This relation links a vulnerability with its consequence, indicating the consequence caused by the vulnerability. For example, in the sentence “There is a vulnerability in QMS CrownNet Unix Utilities version 2060, allowing root login without a password”, the vulnerability leads to the consequence of “allowing root login without a password”, thus there is a “lead_to_consequence” relation between them.
exploit: This relation connects an organization with a vulnerability, indicating that the organization exploits the vulnerability for network attacks. For example, in the sentence “There is a buffer overflow vulnerability in the krb425_conv_principal function of Kerberos 5, allowing remote attackers to gain root privileges”, the “remote attackers” exploit the “buffer overflow vulnerability”, thus there is an “exploit” relation between them.
develope: This relation links an organization with a tool, indicating that an organization has developed a certain tool. For example, in the sentence “Fluxwire was created by the CIA to enable mesh networking”, the “CIA” created “Fluxwire”, thus establishing a “develop” relationship between the two.
cooperate_with: This relation connects organizations or tools, indicating cooperation or association between organizations or tools. For example, in the sentence “We believe that this malicious file is associated with some APT attack organizations in India, including Patchwork, BITTER, and Confucius”, the “malicious file” is associated with the “APT attack organizations Patchwork, BITTER, and Confucius”, thus there is a “cooperate_with” relation between them.
target_at: This relation links organizations, indicating hostile relation between them. For example, in the sentence “We believe that this malicious file is associated with some APT attack organizations in India, including Patchwork, BITTER, and Confucius”, the “malicious file” is associated with the “APT attack organizations Patchwork, BITTER, and Confucius”, thus there is a “target_at” relation between them.
no_relation: This relation indicates that there is no relationship between two entities.
5.5. Analysis of Comparison Experiment Results
This chapter conducts comparison experiments on three models, K-adapter, SpanBert, and KnowPrompt, which have achieved excellent results in relation extraction tasks over the past two years. Firstly, a comparison is made on the general relation datasets TACRED and ReTACRED. Subsequently, using the cybersecurity dataset constructed in this paper and the same experimental parameters, the performance of each model in the cybersecurity domain is evaluated.
TACRED is one of the largest and most widely used general-purpose relation classification datasets, containing 42 relationship types, with 68,124, 22,631, and 15,509 samples in the training, validation, and test sets, respectively. ReTACRED is another version of the TACRED dataset, addressing some shortcomings of the original TACRED dataset, with 58,465, 19,584, and 13,418 instances in the training, development, and test sets, respectively, and it includes 40 relationship types.
K-adapter: K-adapter, proposed by Wang et al. [
22] in 2021, is a classic method for injecting knowledge into PLMs (Pre-trained Language Models). Adapter acts as a “plugin” loaded onto the outside of PLM. A pre-trained model can load multiple adapters, each representing a different type of knowledge. K-adapter designs two adapters: one for factual knowledge obtained from Wikipedia, and one for linguistic knowledge obtained from web text dependency parsing. This model adopts RoBERTa as its base model and demonstrates good performance in relation extraction tasks.
SpanBert: SpanBert, proposed by Joshi et al. [
23] in 2020, is an excellent extension of BERT. The authors argue that span segments carry semantic information, thus changing the strategy of randomly masking tokens in BERT to masking a span segment instead. Additionally, they introduce the span-boundary objective (SPO), which stores span information in the representations of its boundary tokens. This model achieves state-of-the-art (SOTA) results on both the SQuAD and OntoNotes datasets.
KnowPrompt: KnowPrompt, proposed by Hu et al. [
24] in 2022, is a model for relation extraction using prompt learning. The model constructs virtual entity words and virtual relation words and generates entities and relations by introducing external knowledge, thereby mapping to real labels. KnowPrompt adopts RoBERTa_large as its pre-trained model and achieves SOTA results in both fully supervised and few-shot scenarios.
We first compare Ours Model on two general relation extraction datasets, TACRED and ReTACRED. Since general datasets are not suitable for cybersecurity rules, the rule construction module has reconfigured rules for the 42 relations in the general datasets to enable model comparison. As shown in
Table 8, the following can be observed:
- (1)
Our model is highly effective on general datasets. On the TACRED dataset, the F1-score is 74.4%, which is 2 percentage points higher than KnowPrompt. On the ReTACRED dataset, the F1-score is 92.9%, which is 1.6 percentage points higher than KnowPrompt. Moreover, the baseline model used in this paper is Bert-base, which has fewer parameters compared to RoBert-large used by KnowPrompt, and does not require external knowledge transfer.
- (2)
Prompt learning paradigms are superior. It can be seen that fine-tuning methods based on pre-trained models such as Bert-base, Bert-large, ERNIE, and RoBert achieve F1 scores ranging from 66% to 70% on the TACRED dataset. In comparison to KnowPrompt and our model, which are based on prompt learning paradigms, their performance is inferior.
- (3)
Rule construction and knowledge injection are both effective. K-adapter utilizes Wikipedia knowledge base and web text information, KnowPrompt references external knowledge bases, and our model constructs rules, all achieving around 72% effectiveness on the TACRED dataset, proving the effectiveness of knowledge. However, compared to introducing large knowledge bases, this paper only needs to construct rules for the 42 relations to achieve similar results.
Next, this paper selects three models, K-adapter, SpanBert, and KnowPrompt, which perform well on the general dataset, for experimental comparison on the multi-relation cybersecurity dataset constructed in this chapter, and divides it into ten datasets based on the training set. It is worth noting that KnowPrompt and our model only require entity position information without the need for entity class information, while K-adapter and SpanBert require both pieces of information. Therefore, the model uses a cybersecurity dataset with added class information. The comparison experiment results are shown in
Table 9, and the visualization graph of the experimental results is shown in
Figure 14, from which the following conclusions can be drawn.
- (1)
Our model is the most effective in the field of cybersecurity. Across almost all partitioned datasets, the multi-relation extraction model in this chapter achieved good results. Moreover, on the 0.1:1:8 dataset, only this model was effective, while the rest performed poorly with extremely few shots. This demonstrates the effectiveness of the rules and templates specifically built for cybersecurity data in this chapter, enabling efficient relation classification.
- (2)
Significant improvement is observed with the increase in annotated samples. Overall, as the training set increases, all models show significant enhancement, reaching a critical point at the 1:1:8 dataset ratio where all models exhibit qualitative improvement. Therefore, this paper adopts 1:1:8 as the benchmark for few shot dataset partition.
- (3)
Prompt-based paradigms are highly effective in the few-shot domain. Overall, the model effectiveness is ranked as follows: Our Model > KnowPrompt > SpanBert > K-adapter. In the field of cybersecurity, prompt learning paradigms are more effective than pre-training fine-tuning paradigms; while K-adapter achieved excellent results on the general dataset TACRED, it performed the worst on the dataset in this paper, with the F1-score of only 0.02% on the 0.1:1:8 dataset. As the number of annotated samples increases, the accuracy gradually improves, reaching 35.4% on the 2:1:7 dataset, but still significantly lagging behind the other three models. This indicates that pre-training fine-tuning paradigms like K-adapter are more suitable for large datasets and perform poorly in the few-shot domain. Both prompt learning-based models achieved around 96% effectiveness after reaching a training set size of 1.4. Further increasing the training set did not significantly improve performance, demonstrating the effectiveness of the prompt-based paradigm in the few-shot domain. The improvement in SpanBert’s effectiveness is also significant, indicating the need for more attention to semantic information even in small samples.
- (4)
Fast experimental speed. In summary, the multi-relation extraction model in this chapter has constructed cybersecurity rules and templates, using the Bert-base model. With fewer parameters and no need for annotated entity class information, it can quickly, efficiently, and accurately identify cybersecurity relations.