1. Introduction
Cyber threat intelligence (CTI) is an important source of knowledge for cybersecurity practitioners. Timely acquisition and analysis of CTI information plays a vital role in the process of cyber attack and defense. CTI contains a variety of detailed information about current or upcoming cybersecurity threats, such as strategies, technologies and assets [
1]; it can help businesses or organizations implement active cyber defense against cybersecurity threats [
2]. However, due to the diversity and concealment of cyber threat knowledge, it is difficult for most security agencies and personnel to efficiently obtain accurate threat knowledge, which leads to the inability to start the defense mechanism accordingly. In addition, with the continuous strengthening of the ability of self-replication, mutation and dissemination of various attack software and malicious programs, their destructiveness is also increasing, which brings great difficulties for data recovery and information system reconstruction at a later stage. Traditional cybersecurity experts need to use CTI to broaden their knowledge boundaries to deal with new cybersecurity issues [
3]. Generally speaking, all information related to a cyber threat can be called CTI, such as logs, network traffic, pictures, text, etc. However, most knowledge of threat intelligence is described and published in textual form. Therefore, how to use natural language processing technology to effectively extract threat intelligence knowledge from open source Internet articles or reports, and transform it into a standardized and structured knowledge organization form, has been of great significance and practical application value for cybersecurity research.
As one of the most active research hotspots in the field of cybersecurity, the task of network threat intelligence knowledge extraction is to identify the security entities in the text by analyzing and learning the context information and classifying the semantic relations contained in it [
4]. This task aims to extract knowledge from different sources, including different structural data, and deposit it in a knowledge graph. In the knowledge graph, knowledge is generally composed of the form of the entity relation triples
, where
represents the head entity and
is the tail entity, and
is the relation between the two. Data sources can be structured data (linked data, databases), semi-structured data (tables and lists in the web page), unstructured data (pure text data), etc. This task mainly includes two parts: named entity recognition (NER) and relation extraction (RE). The former is used to identify the boundaries of the entities and classify them into a predefined category, while the latter is used to identify whether there is a certain predefined relation type between the entities in the input text. An important task of CTI knowledge extraction is to identify important entities involved in the threat, such as attackers, network products, vulnerabilities, etc. At present, the more commonly used entity extraction method is based on the sequence prediction model, that is, to obtain the optimal annotated sequence for the input text sequence according to the context semantics. Entity extraction concerns the explicit knowledge of the intelligence text, the implicit semantic relation in the sentence and the interaction between the entities. The goal is to extract possible correlations between the entities, such as an attack organization vulnerability “exploit” relation, an attack organization “Use” relation of malicious software, a software and hardware product version of the “has_version” relation, etc. According to the different datasets used, it can be divided into three categories: template-based relation extraction, relation extraction based on supervised learning and relation extraction based on weak supervised learning. Common implementation methods include the pipeline method and the joint extraction method.
For the field of cybersecurity, in addition to designing an effective information extraction model, it is also necessary to have enough field annotation data as a training corpus to train or fine-tune the model. CTI, usually compiled by experts in the field of cybersecurity, may be a barrier to the general reader because of the expertise needed. In addition, due to individual differences in writing habits, language use and analytical focus, the intelligence content shows significant diversity in consistency and accuracy. Due to the complexity of entities in the field of cybersecurity, the task of extracting cybersecurity knowledge from various channels requires an ontology model that can effectively organize cybersecurity entities and relations as a guide. The ontology can accurately cover the information of the domain entity type, entity relation type and so on.
Extracting information from a corpus presents three main challenges. First, in the field of Chinese cybersecurity, there is no open dataset that satisfies both entity and relation extraction tasks [
5]. Existing datasets are either not open or only support general NER tasks, and do not adequately support NER, relation extraction, or joint entity–relation extraction tasks in cybersecurity [
6]. Second, the structure of network security text data is complex, with many descriptions mixing Chinese and English [
7]. This includes software and vulnerabilities that have both Chinese and English names, and the extensive use of nested structures, abbreviations and obscure terms, which significantly increases the difficulty of entity and relationship extraction. Finally, the Chinese CTI corpus is more complicated than the English CTI corpus, adding another layer of difficulty to the extraction process [
8].
To address the aforementioned challenges, we present the Bilingual (Chinese–English) Vulnerability Triple Extraction Dataset (BVTED), the first known dataset capable of supporting Chinese cybersecurity entity–relation triple extraction tasks. We develop an ontology model for describing cybersecurity intelligence knowledge, encompassing all relevant entity and relation types necessary for knowledge extraction. Given the dataset’s characteristics, which include a substantial amount of mixed Chinese and English texts, we trained five deep learning-based named entity recognition models, two standalone relation extraction models, and two joint entity–relation extraction models. The experimental results validate the effectiveness of the BVTED dataset, demonstrating its potential for advancing research in cybersecurity intelligence.
This work provides a significant scientific contribution by addressing the lack of open, comprehensive datasets in Chinese cybersecurity knowledge triple extraction tasks. By offering a robust dataset and demonstrating its utility through extensive experiments, we enhance the ability to mine, analyze and utilize cybersecurity knowledge, thereby advancing the field of cyber threat intelligence. The remainder of this paper is organized as follows:
Section 2 reviews related work, including the existing datasets and information extraction techniques in the cybersecurity domain;
Section 3 introduces the construction, annotation and statistical analysis of the BVTED dataset;
Section 4 introduces the experimental settings and evaluation metrics in this paper and evaluates the final experimental results;
Section 5 summarizes the contributions and limitations of our research;
Section 6 provides a detailed outline of our future research directions.
3. Materials and Methods
To elucidate this dataset, this section presents an overview of the data collection, annotation, statistics and differences from existing datasets.
3.1. Data Collection Process
For supporting the task of cybersecurity triples extraction and cybersecurity knowledge graph construction, a cybersecurity domain dataset with entity and relation annotation is necessary. The corpus data can be sourced from a variety of open sources of heterogeneous threat intelligence data, such as vulnerability databases, cybersecurity news or blogs, cybersecurity industrial technical reports, and hacking forums. Among them, the data of the vulnerability database have a more standardized format, which will facilitate our data processing and annotation. A major contribution of our work is to annotate the first Chinese vulnerability dataset at the sentence level. We build this dataset by crawling 137,625 vulnerability data from the China National Vulnerability Database of Information Security (CNNVD). The data collected covered the period from 1 October 1988 to 3 February 2020. Each vulnerability data record in the CNNVD contains 10 items: vulnerability name, vulnerability type, product, vendor, threat level, CNNVD id, CVE id, recording time, vulnerability textual description, reference URLs, and official patches. Among them, items such as vendor, threat level and vulnerability type are often missing. With the triple extraction technology, these missing entities can be replenished based on the textual description. We acquire the textual description from each vulnerability data element and separate them into individual sentences. Finally, 27,311 non-repeated sentences are randomly selected from a total of 461,199 individual sentences.
3.2. Annotation Tool and Method
Figure 2 shows an annotation example for a vulnerability description sentence. The first line is the tag sequence with the “BIO” tagging method. The second line is a Chinese sentence example from “CNNVD-200212-237”. We connect each character in the Chinese sentence and their label with a dashed arrow. For instance, the first character of the Chinese sentence “C” is connected with its label “B-P” by a red dashed arrow, which means this “C” is the starting part of a “Product” entity. Similarly, the labels of the characters from the second one, “a”, to the 12th one, “n”, are all “I-P”, which means that they are the middle part of the “Product” entity. The 13th character, “is”, does not belong to any entity; hence, its tag is the abbreviation “O” for “Outside”. The English translation for the Chinese example is in the last line. The relations between entities are represented by a line with an arrow under the third line.
In the dataset, the average length of sentences is 51.39. The length of the longest sentence is 627. The shortest sentence has just 6 characters. Since the open-source automatic information extraction model can not fully identify most vulnerability entities and their relations in the cyber threat intelligent field, we use an open-source annotation tool named “Colabeler” (perhaps an obscure tool) to manually label the entities and relations in CNNVD vulnerability description sentences.
Figure 3 shows a labeling example for a CNNVD vulnerability sentence using the mentioned tool “Colabeler” [
http://www.colabeler.com/]. The entities’ words are labeled with different colors and connected with lines with arrows. The entities’ types are tagged above the entity. The relations’ names are labeled on the line.
As the input of the target knowledge extraction model, we need to transfer the output of the annotation tool into a standard CoNLL format.
Figure 4 shows two kinds of formats of annotation data.
Figure 4a represents the data exported from the annotation tool in json format. This part contains entities annotation, relations annotation and target vulnerability textual content.
Figure 4b is the labeled data with BIO format transferred from the output of the annotation tool. This part consists of five columns as follows: index, characters in vulnerability sentence, entity tags with BIO format, the relation tags of the corresponding head entity, and the index of tail entities.
3.3. Statistics of Entities and Relations
As mentioned in the proposed ontology, we use 13 kinds of entity types and 15 kinds of relationships to describe the vulnerability knowledge in CNNVD.
Table 1 shows the details of each kind of entity type. The 13 entity types in this dataset are Product, Definition, Vendor, Location, Vulnerability, Version, Part, Consequence, Attacker, Method Reason, Condition and Time. The columns of
Table 1 from the left to the right correspond to the following: the Chinese entity type, the English entity type and entity meaning, the abbreviation for the English entity type, the Chinese example for the corresponding entity, the English example for the corresponding entity. We use industrial standard BIO text labeling mode to label the CNNVD vulnerability text. We combine the “B-” with the Chinese entity type or with the abbreviation for the English entity type as the entity start tag, such as “B-Product” or “B-P”. Similarly, the other inside entity tags are combined with “I-” and the Chinese entity type or with the abbreviation for the English entity type, such as “I-Product” or “I-P”.
In this dataset, we count and rank the frequency of each entity type and show them in
Figure 5. In our dataset, there are a total of 97,391 entities, with an average of 3.5 entities in every sentence. As shown in
Figure 5, the three most numerous entity types in our dataset are “Product”, “Definition” and “Vendor” (32,328, 19,887 and 13,279, respectively), indicating that, on average, each sentence contains 1.2 “Product” entities. “Reason”, “Condition” and “Time” are the minimum three entity types, and the frequency of these is 318, 156 and 82, respectively. It is worth mentioning that there is a data imbalance where the number of the last six kinds of entities is less than 1200. As a result, some data augmentation technologies need to be applied to alleviate this data imbalance problem during the model training stage.
In this section, 15 kinds of entity relationships will be introduced to describe the organization of the vulnerability entities. Although the ontology in
Figure 1 shows 26 entity relationships, there are 15 non-repetitive entity relationship types since the head entity and tail entity of the same relationship can be different. They are chosen because these 15 relationships are those mostly described in the CNNVD vulnerability textual description, and other relationships are rarely involved. In order to facilitate the understanding of the relations,
Table 2 presents the details of each relationship.
Figure 6 is a frequency bar chart of the relationships in the dataset based on frequency ordering. We counted 69,614 relationships from 27,311 non-repeated sentences in our dataset, which means there are 2.55 triples in one sentence on average. As shown in this figure, the number of each of the last six relationships is less than 1000, which may become an important factor affecting the training and extraction effects of these relations. Nonetheless, these relationships are still non-negligible components in describing the relationship between the vulnerability-related entities. As mentioned in the entity annotation work, data augmentation techniques are also needed in the training phase for these types of relationships.
3.4. Comparison with Existing Datasets
Unlike previous information extraction datasets, the dataset we constructed is a mixed Chinese and English dataset, which means that almost every sentence contains both Chinese and English. Among all the 27,311 non-repeated sentences, there are 27,279 mixed Chinese and English sentences, accounting for 99.89%. It is because of this sentence feature that the target entities we want to extract are also likely to be composed of mixed Chinese and English entities. In the dataset we constructed, there are 32,721 full Chinese entities, 35,334 English entities, and 29,336 mixed Chinese and English entities. Among them, the number of mixed Chinese and English entities accounts for 30.12% of all 97,391 entities.
4. Experiments and Evaluation—Results
In this part, we use and validate multiple baselines on the BVTED dataset. For the named entity recognition task, we select five baselines, including a traditional deep learning model and a pre-training model, and, in particular, use a multi-language pre-training model to reduce the out-of-vocabulary (OOV) problems due to the mixture of Chinese and English in the dataset. As baselines, the BiLSTM-based algorithm model and the CNN-based algorithm model are applied to the relationship classification task. In addition to the pipeline’s method, we also select two state-of-the-art baselines to test the effect of the joint extraction model of entity relations on the BVTED dataset.
In order to make the distribution of each entity in the training set, validation set and test set roughly the same, we separately divide the entity types with a small annotated number in the dataset. From
Figure 5, the samples of “Reason”, “Condition” and “Time” are small, with 318,156 and 82 frequencies, respectively. Therefore, we separately divide each of the three entity types into the training set, validation set and test set by an 8:1:1 random classification. Finally, these three entity types are guaranteed in all three parts of the dataset, and the data distribution of the three entity types is roughly equivalent. Similar to the processing of the entity types, we separately divide nine relationship types with less than 3000 frequencies (‘used_in’, ‘also_know_as’, ‘lead_to_consequence’, ‘use_means_of’, ‘equal_to’, ‘because_of’, ‘under_the_condition’, ‘exploit’, ‘includes’), dividing the remaining data at a unified ratio of 8:1:1. On the one hand, this division ensures that each part will contain each relationship type, and at the same time, each relationship type has the same distribution in each part of the dataset.
4.1. Evaluation Metrics
The three commonly used measures in the named entity recognition and relationship extraction tasks are precision, recall and the F1-score. These metrics were chosen for their effectiveness in evaluating the performance of the NER and RE models, particularly in imbalanced datasets commonly encountered in cybersecurity applications.
Precision is the ratio of correctly predicted positive categories to all samples predicted to be positive. It helps minimize false positives. Recall is the ratio of the correctly predicted positive category to all the positive samples in the original sample. It helps minimize false positives. From the single performance indices of precision and recall, the harmonic average of the two indexes is calculated to obtain the F1-score. This is particularly useful for imbalanced datasets. The formulas are as follows:
TP represents the number of positive cases that are correctly predicted; TN represents the number of negative cases that are correctly predicted; FP represents the number of negative cases that are incorrectly predicted; and FN represents the number of positive cases that are incorrectly predicted.
4.2. Experimental Settings
The software and software configurations used in the experiments in this paper are shown in
Table 3. All the experiments in this paper are expanded on the BVTED dataset, and the parameter settings of the model in the experiments are shown in
Table 4 and
Table 5. To meet the memory requirements, the gradient accumulation algorithm is set in the two models based on BERT to reduce the number of model operations and calculate a loss value for every four steps. The training process uses the Adam optimization algorithm [
34], which adjusts the learning rate and momentum parameters to optimize the weight of the model. With a low memory footprint, it can converge faster and adapt to different parameter characteristics.
4.3. Experimental Result
In this section, we introduce the experimental results of applying existing NER models, RE models and joint entity relation extraction models on the BVTED dataset to demonstrate the effectiveness of the newly established dataset.
4.3.1. Baseline Models of Cybersecurity Named Entity Recognition
In this section, as shown in
Table 6, we use baselines of five named entity recognition tasks for training and testing on the BVTED dataset and use F1 values obtained on the test set to evaluate the performance of individual models. In the first model (BERT + CRF) [
35], we add a softmax layer and a CRF layer as a decoding module after encoding the BERT output sequence. To assess the impact of various embedding methods on the BiLSTM + CRF model [
36], we individually employ three distinct representation models (Word2Vec, BERT and ERNIE) in the embedding layer to capture the sequence information. The output of the embedding layer undergoes encoding by BiLSTM, extracting contextual semantics. Similar to the first baseline model, both softmax and CRF layers are incorporated for decoding the encoded information. The amalgamation of these three representation models with the BiLSTM-CRF encoder–decoder model constitutes the other baselines.
From
Table 6, we can see that the overall F1 value of the five algorithm models has reached more than 91% when the entity extraction completes. Comparing the F1 value of the BERT + CRF model and the BERT + BiLSTM + CRF model, the F1 value of the latter is generally smaller than the F1 value of the former, probably because the BiLSTM layer increases the complexity of the model and introduces the redundancy of information. In the same upper code model, the lower word embedding mode plays an important role in the information representation in the text. The model based on the Word2Vec embedding has a higher F1 value than the Chinese BERT model. Therefore, the static Word2Vec embedding based on the text training in the field of CTI will be more suitable for the specific words and terms in this field. The pre-trained knowledge-based knowledge enhancement mechanism language model ERNIE [
37] and Multilingual BERT [
38] both achieve better F1 values than using the Chinese BERT model as the vector representation layer. This shows that the knowledge-enhanced pre-training model can improve the ability to represent professional words and terms in threat intelligence texts. The Multilingual BERT-based model achieves the best F1 value for most entity types and the overall extraction effect (96.67%). This shows that the Multilingual BERT model can more effectively represent the text with a mixture of Chinese and English.
For entity types such as “definition”, “place name” and “attacker” that can be expressed in a limited dictionary, or can be extracted based on rules, the extraction effect of the F1 value on the five training models used reaches more than 99%. This shows that these entity types are easier to extract with a relatively single expression. The small number of annotated samples limited the performance of the Chinese BERT models; for example, the “Reason”, “Condition” and “Time” samples are less than 350, so the Chinese BERT models obtain relatively small F1 values. Especially for the “Time” entity type with the number of annotated data less than 100, the model extraction effect is poor because the parameters are not effectively adjusted, all falling below 62%.
4.3.2. Baseline Models of Cybersecurity Relation Extraction
To train and test the BVTED, the experiments use two groups of common relation extraction models. The first group are classification-based relation extraction approaches, which include classification methods based on BiLSTM and classification methods based on CNN. In the second group, CasRel and OneRel are the joint relation extraction state-of-the-art baselines.
ERNIE + BiLSTM: Zhang et al. [
39] proposed bidirectional long short-term memory networks (BiLSTM) to model the sentence and two entities with complete, sequential information about all words and classified relations using a softmax classifier. We utilize the knowledge-enhanced pre-training model ERNIE as an embedding layer to provide rich semantic context for the following feature extraction step. These two modules are combined to form the first baseline model ERNIE + BiLSTM.
PCNN: Zeng et al. [
40] proposed a novel model dubbed the Piecewise Convolutional Neural Network (PCNN), with multi-instance learning for a distant supervised relation extraction task. This baseline model solved the wrong label problem and error propagation or accumulation problem, avoiding feature engineering and capturing the structural information between two entities using a CNN architecture with a piecewise max pooling method.
CasRel: Wei et al. [
41] proposed a novel cascade binary tagging framework (CasRel) that models relationships as mapping subjects and objects, instead of classifying discrete relation labels of entity pairs. This architecture addresses the overlapping triple problem using three components. First, a BERT encoder module is used to extracted the feature information of the input sentence. Second, a subject tagger of a cascade decoder is designed to detect all possible subjects in the input sentence. Third, a relation-specific object tagger of a cascade encoder, a high level tagging module, is used to identify objects and the involved relations with respect to the subjects obtained from a lower level.
OneRel: Shang et al. [
42] proposed a novel model named OneRel, designed for joint entity and relation extraction. This model frames the joint extraction task as a fine-grained triple classification problem. Specifically, OneRel incorporates a classifier to assess whether a token pair and a relation form part of a factual triple. Additionally, it features a relation-specific horn tagging strategy, ensuring a straightforward yet effective decoding process.
Table 7 shows the experimental results of four baselines on the dataset BVTED. The experiments use precision, recall and the F1-score.
To evaluate the performance of each model on the dataset BVTED. From the experimental results, the three evaluation indexes of the PCNN model are better than the other models. The BiLSTM relationship extraction model based on the ERNIE pre-trained model performs the worst. Therefore, the most important thing in the relationship extraction task for this dataset is to capture the structural information between two entities, rather than the context semantics of the upper distance. The experimental results show that CasRel and ERNIE + BiLSTM have advantages in handling complex relationships and long-range dependencies, but fail to fully exploit their strengths in the datasets and tasks of this experiment. The reasons may include the following two aspects: (1) The text of the dataset describing threat intelligence information is roughly similar and has a clear sharding structure; (2) The relationships in this dataset are mainly dependent on local context rather than long-distance dependence. The reason why the OneRel model performs less well than PCNN may be that its integrated architecture is less efficient than directly extracting local features in this specific scenario.
4.4. Limitations of Proposed Methodology
Despite achieving promising results on real-world datasets, the proposed pre-trained models for NER and RE exhibit several limitations in enhancing the efficiency and generalization performance of cybersecurity knowledge triplets. From a dataset perspective, the BVTED dataset is solely sourced from CNNVD vulnerability descriptions, which constrains its generalizability. A single data source may not capture the diverse spectrum of cybersecurity threats and vulnerability narratives, potentially diminishing the model’s performance on unseen or stylistically different data. Additionally, the dataset’s annotation process, which combines rule-based and manual methods, incurs high labor and time costs, limiting the scalability and efficiency of dataset construction. Furthermore, the imbalance in data distribution, where certain entities or relations, such as “Product” entities and “is_a_kind_of” relations, are overrepresented compared to others like “Condition” entities and “exploit” relations, can adversely affect the performance of knowledge triplet extraction models.
From the perspective of the NER and RE models, there is a significant dependency on large amounts of training data. As demonstrated in
Table 6, the performance of these models declines markedly when the training data are insufficient, indicating a challenge in maintaining high performance under data-constrained conditions. Additionally, the models may not achieve the desired effectiveness when applied to security datasets with different linguistic structures and styles. This suggests a limitation in the cross-linguistic and cross-domain generalization capabilities of the trained models, necessitating further fine-tuning and adaptation to suit various languages and data formats. These limitations highlight the need for future research to incorporate more diverse data sources, develop more efficient annotation methods, address data imbalance, and enhance model adaptability to ensure broader applicability and robustness of the proposed methodology.
5. Conclusions
In the realm of cyber threat intelligence, cybersecurity vulnerability intelligence plays a crucial role. We developed an ontology for vulnerability knowledge, encompassing 13 entity types and 15 unique relation types, which captures the common knowledge found in vulnerability description texts. This ontology serves as a foundational framework for extracting and structuring information, enabling more effective analysis and utilization of cybersecurity vulnerability data. Leveraging this ontology, we construct a comprehensive dataset for the cybersecurity domain, named BVTED, which supports the task of entity–relation triple extraction. The dataset comprises 27,311 unique vulnerability description sentences with mixed Chinese and English expressions. It contains 97,391 entities and 69,614 relations, including 32,721 Chinese entities, 35,334 English entities and 29,336 entities with mixed Chinese and English.
To evaluate the dataset’s effectiveness, we conduct performance tests using state-of-the-art baselines for named entity recognition and relation extraction tasks. The experimental results confirm the validity and utility of the BVTED dataset. Specifically, our cybersecurity entity extraction model with BiLSTM+CRF and Multilingual BERT achieved an overall F1-score of 0.9667. For relational judgment, the PCNN model attained a precision of 95.35%, a recall of 95.12% and an F1-score of 95.19%. These results substantiate the feasibility and efficacy of utilizing the BVTED dataset for cybersecurity applications.
This research contributes significantly to the field of CTI analysis by demonstrating the potential of advanced NER and RE techniques in enhancing the accuracy and efficiency of knowledge extraction processes. Despite these achievements, our methodology faces certain limitations. The BVTED dataset’s reliance on CNNVD data limits its generalizability, and the high labor and time costs associated with our combined rule-based and manual annotation methods restrict the scalability of our dataset construction. Additionally, the imbalance in data distribution poses challenges for knowledge triplet extraction models. Addressing these issues is crucial for advancing the practical applications of our proposed approach.