Natural Language Processing Model for Automatic Analysis of Cybersecurity-Related Documents

Georgescu, Tiberiu-Marian

doi:10.3390/sym12030354

Open AccessArticle

Natural Language Processing Model for Automatic Analysis of Cybersecurity-Related Documents

by

Tiberiu-Marian Georgescu

Department of Economic Informatics and Cybernetics, The Bucharest University of Economic Studies, 6 Piata Romana, 010374 Bucharest, Romania

Symmetry 2020, 12(3), 354; https://doi.org/10.3390/sym12030354

Submission received: 15 January 2020 / Revised: 14 February 2020 / Accepted: 24 February 2020 / Published: 2 March 2020

Download

Browse Figures

Versions Notes

Abstract

This paper describes the development and implementation of a natural language processing model based on machine learning which performs cognitive analysis for cybersecurity-related documents. A domain ontology was developed using a two-step approach: (1) the symmetry stage and (2) the machine adjustment. The first stage is based on the symmetry between the way humans represent a domain and the way machine learning solutions do. Therefore, the cybersecurity field was initially modeled based on the expertise of cybersecurity professionals. A dictionary of relevant entities was created; the entities were classified into 29 categories and later implemented as classes in a natural language processing model based on machine learning. After running successive performance tests, the ontology was remodeled from 29 to 18 classes. Using the ontology, a natural language processing model based on a supervised learning model was defined. We trained the model using sets of approximately 300,000 words. Remarkably, our model obtained an F1 score of 0.81 for named entity recognition and 0.58 for relation extraction, showing superior results compared to other similar models identified in the literature. Furthermore, in order to be easily used and tested, a web application that integrates our model as the core component was developed.

Keywords:

cybersecurity; machine learning; ontologies; named entity recognition; natural language processing; relation extraction

1. Introduction

In the last decade, great progress has been made in natural language processing (NLP) based on machine learning (ML). In this context, the interest in developing solutions to automatically understand text using ML algorithms increased. This paper describes an NLP model based on ML model specialized in the cybersecurity field. The framework automatically recognizes the main entities and extracts the relations between them. Furthermore, it provides a solid baseline for future more complex solutions such as (1) automatically extracting information from hacker’s discussions or (2) semantic indexing relevant documents.

In a recent paper, we presented [1] a semantic indexing system designed to automatically monitor the latest information relevant to cybersecurity. The system filters the data and presents it in an organized and structured framework according to the users′ needs. The system automatically collects text data, analyzes it through NLP algorithms, stores only the relevant documents that are semantically indexed and makes the documents available on a platform where users can conduct semantic searches. The model described in this paper can be adapted to be the NLP component needed to implement the solution presented in [1].

1.1. Our Contribution

This article describes an original prototype that we developed for cognitive text analysis in the cybersecurity field. We present the architecture of the prototype and detail each component. Our main contribution consists in the design and development of the NLP model based on supervised learning. We describe the methods and techniques used to model the cybersecurity field. We used Watson Knowledge Studio to design, train and test the model and obtained remarkable performances compared to other similar models identified. Our model has an F1 score of 0.81 for named entity recognition (NER) and 0.58 for relation extraction, which highlights it as the best NLP based on supervised learning implementation for cognitive analysis of the cybersecurity-related text.

This paper also presents the development of a domain ontology, designed to be the structure of the NLP model. The ontology was developed based on a two-step approach, the symmetry stage followed by the machine adjustment.

(1) The symmetry stage. Pioneers of Artificial Intelligence, such as Alan Turing, have promoted the importance of training a machine in a similar manner as the children’s learning process, this being the premise of trying to replicate the logical flow of human decision making through processing symbols [2]. This was the source of inspiration in the first stage of ontology development. Considering the symmetric way of representing knowledge between humans and machines, in the first stage, we developed the ontology based on how people represent the cybersecurity domain. After interviewing 14 cybersecurity experts, we created a dictionary that contained over 5000 words. Subsequently, the words were grouped into 29 categories, described in Supplement 1.

(2) The machine adjustment stage. Once the ontology was designed, it was implemented in Watson Knowledge Studio. We created an NLP model based on supervised learning, trained it and tested it periodically. Two sets of tests were used to check the validity of each class: the F1 score per class and the confusion matrix. When classes with low F1 score and high values in the confusion matrix were encountered, they were eliminated, and the model was retrained. In the end, the ontology used to develop the model contained only 18 out of a total of 29 categories. The categories were either used as defined in stage 1, merged into other categories or split into more categories. Supplement 1 illustrates the initial categories, as well as their connection to the final structure of the ontology. The Supplements 2 and 3 illustrate the confusion matrices between the classes of the ontology, in absolute and relative values, respectively.

1.2. Related Work

Developing domain ontologies for cybersecurity is a topical subject in the specialized literature. Various articles, such as [3,4], described the development process and the main components of such ontologies. Articles [5,6] proposed software solutions containing an ontology specialized for the cybersecurity field. Recent papers proposed domain ontologies for Internet of Things (IoT) security, such as [7] or [8]. In order to develop an ontology, paper [9] used ML to extract entities from the text and obtains a cybersecurity knowledge base. Paper [10] presented the state-of-the-art of the web observatory, provided insights and discussed the main challenges associated with this concept, including security and privacy.

Text mining in cybersecurity is another topical subject. Various papers such as [11] used text mining and information retrieval techniques in the cybersecurity field. Most of the projects identified preferred supervised learning. Paper [12] implemented the perceptron learning algorithm to automatically annotate data available in the US national vulnerability database (NVD) [13]. They managed to automatize the training process, by creating a set of heuristics for text annotation. The results consisted of a training corpus of circa 750,000 tokens. However, their corpus was very homogeneous, which affected the results. Due to the difficulty of human annotating large collections of data, some researchers preferred semi-supervised learning algorithms. In [14], a bootstrapping algorithm was used to heuristically recognize cyber-entities and identify new entities through an iterative process of analyzing a large unannotated corpus. The results showed high precision, but very low recall. Paper [15] improved the approach and developed cyber-entity tagging with much better results. The recent evolution of deep learning attracted researchers to implement deep neural network models for NLP. One of the main advantages is that they reduce the effort of human annotators. The results of using the deep learning approach seem to be satisfying. Article [16] performed a comparative analysis of the main deep learning methods for NER and entity extraction. Survey [17] described the state-of-the-art of deep learning algorithms and applications for cybersecurity.

Various projects implemented NLP-based models for the cybersecurity field, such as [8] or [18]. In [8], the author developed a cybersecurity model for IoT, which was connected to a gateway in order to identify the potential vulnerabilities of an IoT environment. Similar methods and technologies were used in [19], which described a model that extracts relevant information about emails from the shipping industry. Papers [20,21] proposed NLP models based on ML for the medical field, extracting relevant data that can be used to make valuable inferences. A comparison of our model and other projects is conducted in Section 4. Our work is highlighted by better performance indicators. Moreover, our model is more complex than the other projects discussed, having the ability to recognize more types of entities and relations.

This article describes an original prototype that we developed for cognitive text analysis in the field of cybersecurity. The architecture, components, technologies used and implementation techniques of the solution are presented in detail. The prototype is referred to as Cybersecurity Analyzer and is available at: www.cybersecurityanalyzer.com. An older version of this prototype was presented at the national innovation contest PatriotFest 2018. The project was evaluated by the competition jury, winning the main prize at Gala PatriotFest 2018 [22]. The solution is available online starting from November 15, 2018. It has been uploaded to an open environment, where it can be tested by users interested in contributing to this project, by providing feedback. For the development of Cybersecurity Analyzer, open-source and trial license applications were used. The only costs were generated by purchasing the domain name.

Section 2 illustrates the architecture of Cybersecurity Analyzer and the prototype components are discussed in detail. Section 3 describes the process of developing the NLP model based on ML. A domain ontology specialized for cybersecurity was developed and later used to define and train the model. Scraping custom-made solutions were used to automatically download data available online, which was used in the training process. Section 4 analyzes the main performance indicators of the model for both NER and relation extraction. Our model is compared to other projects. Section 5 illustrates the interface of Cybersecurity Analyzer, as well as a use-case example.

2. The Architecture of Cybersecurity Analyzer

Figure 1 illustrates the prototype’s architecture. It is structured on four levels: (1) document upload, (2) cognitive analysis, (3) data store and (4) presentation. The links between levels are made through representational state transfer application programming interfaces (REST APIs).

In order to facilitate the user’s access to the prototype, a web interface was developed. Figure 2 illustrates the main page of the web application. Within it, there is an upload form where text documents can be inserted in various text formats (e.g., .doc, .docx, .pdf, .txt).

Once uploaded, the document is sent via a REST API to the NLP model based on ML developed in the cloud, using the Watson Knowledge Studio service [23]. In order to access the model, the credentials required for API transmission are stored in the server-side. The inclusion of a server-side component was mandatory to ensure the security of the API credentials. Once the document is sent to the Watson Knowledge Studio service, the data flow reaches level 2. At the core of the Cybersecurity Analyzer solution is the model adapted to the cybersecurity domain. The components of the model were implemented using tools available in IBM Cloud [24]. The uploaded documents are annotated and stored in IBM Cloud through the Watson Discovery [25] service. The Cybersecurity Analyzer prototype does not store processed documents for cost reasons. After a document is enriched with metadata and sent via REST API to the presentation level (4), the document is deleted. This option was preferred because the license used for the Watson Discovery service was limited to storing up to 2000 files. Level 4 manages the presentation of the annotated documents. The interface is implemented as a web application. Within it, users can observe the recognized entities as well as the relation between the entities. In the following sections, each component is described in detail.

3. The Development of the NLP Model Based on ML for the Cybersecurity Field

3.1. Developing a Domain Ontology

The ontology developed in our work was designed to be implemented in the NLP model based on ML. After performing a literature review, we recognized the ontology developed by Iannacone et al. as the closest to the one needed for the prototype [3]. Initially, we considered using the ontology proposed by them for the implementation of our prototype, but after we performed several tests we decided to develop a new ontology, custom-built to be easily integrated with the NLP based on an ML component.

In order to develop the ontology, we used the two-step approach described in the introductory section. Besides studying the literature review, we conducted interviews with 14 cybersecurity experts. The main purposes of the interviews were: (a) identifying the materials they use for documentation, (b) finding the most common cybersecurity terms used by them and (c) understanding the utility of a cognitive analysis solution in the cybersecurity field from their perspective.

Based on the interviews and on the most frequently used cybersecurity documents, a dictionary containing approximately 5000 words was created. Subsequently, we created 29 categories and each word in the dictionary was assigned to one or more categories. The categories were implemented as classes in Watson Knowledge Studio, as follows: Account, Action/Course of Action, Address, Antivirus, Assets, Attack, Attacker, Detection, Device, Event, Firewall, Hardware, Impact, Incident, Loss, Malware, Networking, Procedure, Protocol, Risk, Service, Software, Target, Threat, Tools, User, Victim, Vulnerability and Weakness.

In the second stage, we trained the NLP supervised learning model and performed periodic tests. We used F1-score, precision and recall indicators, the methodology of which is described in detail in Section 4. Besides the indicators, we used confusion matrices in order to identify the classes where the entities were not correctly labeled.

The results of the tests showed that the initial ontology was too complex to achieve satisfactory performance. Therefore, less relevant classes or classes with high redundancy were eliminated and several classes were merged. As an example, we removed the classes: Antivirus, Firewall, Procedures and included their instances in the Defensive means class. Although the variety of knowledge representation decreased, we considered that the current form of the ontology is suitable for its use in the designed model. Supplement 1 illustrates the initial 29 classes of the ontologies, as well as the evolution of the classes after the tests were performed. An explanation is provided for each class which was changed and had the entities redistributed.

In order to develop the ontology, the tools Protégé 5.2 (desktop application) and WebProtégé (cloud-based application) were used. Protégé 5.2. is an open-source solution that offers a suite of tools to build domain models and knowledge-based applications using ontologies. Protégé toolkits are currently the most widely used solutions for developing ontologies. Protégé is developed by the Center for Bio-Informatics Research at Stanford University [26].

Figure 3 illustrates the classes of the developed ontology, as well as the relationships between them.

The ontology consists of 18 classes: Account, Address, Attack, Attacker, Defensive Means, Event, Exploit, Host, Impact, Loss, Malware, Networking, Offensive Means, Risk, Software, Threat, User and Vulnerability.

Table 1 illustrates the relations between the classes. As can be observed, there are 12 types of relations and, in total, there are 33 relations between the 18 classes. Once the model was implemented in the NLP model based on ML, we identified that some relations were not optimally managed, leading to aberrant results. As with the case of the classes, the relations that increased the complexity and did not significantly improve the system performances were eliminated.

3.2. Implementing the Ontology Using IBM Watson

In order to implement the ontology for the cybersecurity domain, the Knowledge Studio service of IBM Watson was used. This service allows the development of both rule-based models and ML models. Knowledge Studio service is available in the cloud and can be accessed in two ways: from the web interface and by connecting to the REST APIs. In our work, both methods were used, the first one for the ease of working directly with the interface, and the second one when the process automation was suitable. In order to interact with the APIs, we used Postman [27] as well as tailor-made scripts.

Knowledge Studio offers the possibility to implement the ontology components, such as classes, relations between classes, entities, various lexical forms for the same entity, rules, properties, etc. The 18 classes were implemented, according to the previously described ontology. Knowledge Studio was initially designed as a service that used only dictionaries, not ontologies. Therefore, it does not provide facilities for integrating ontologies written into standard formats such as RDF, OWL and XML. As the solution was developed and improved, more functionalities were introduced, creating the possibility of using ontology-specific components.

3.3. The Development Process of the NER Model

The cognitive analysis model developed has two main functionalities: NER and relation extraction. In order to implement the NER functionality, we first defined the classes and then the types of relations by using the domain ontology presented above.

The source-code of the Watson system is a trade secret; therefore, its techniques and algorithms are not known in detail. According to [28], in order to develop the system, over 100 different techniques have been implemented for “analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses” [28]. Paper [29] states that for classification logistic regression algorithms were chosen due to their simplicity, being preferred over other algorithms such as support-vector machines. Currently, Watson’s impressive performances are mainly based on deep learning algorithms. The purpose of these algorithms is to understand the content, domain and context. This approach involves the use of multi-level neural networks to extract knowledge from the data.

3.4. The Training Process

The first stage of the training process consisted of the selection of training data relevant to the cybersecurity field. The selection of documents was made considering the literature review as well as the interviews conducted with the cybersecurity professionals. The training sets consisted of a total of about 300,000 words, grouped into documents of approximately 1000 words each. A custom-made script was developed to split each text file into 1000 words chunks. Therefore, 300 documents were generated as follows: 210 Common Vulnerabilities and Exposures (CVEs) files [30], 30 documents consisting of research articles, 30 documents consisting of books, 15 files with news and 15 other relevant documents available online, as can be observed in Figure 4.

The CVE database is used by cybersecurity professionals to be up to date regarding the newest types of vulnerabilities discovered. All the cybersecurity experts interviewed considered the CVE database as the main source of information. Besides that, we noticed that other sources of documentation that they were commonly using contained numerous references to CVEs. About 100,000 CVEs were downloaded and grouped by the year of occurrence. Out of these, 10,000 CVEs from 1999 to 2019 were used to train the model. In order to automatically collect CVEs, a scraping solution was developed. Within the solution, special programs called spiders have been created for each website from which data was downloaded.

Once the relevant documents are available, the training process begins. The training stages we performed are:

Pre-annotation of documents using the rule-based model (based on the ontology): The rule-based model identifies predefined elements, such as instances or relations. Subsequently, since the human annotation process is time-consuming, we used the rule-based model to automatize and speed up the training process of the ML model. The advantage of this approach is the reduced time required for annotation. On the other hand, the annotators need to be extra careful. If the machine has annotation flaws and the mistakes are not corrected by the human annotators, the flaws become even more difficult to correct in the subsequent process;
Correction of the annotation made by the rule-based model: Very often, the entities automatically annotated by the rule-based model are incomplete and sometimes even wrong, therefore human intervention is required;
Quality examination of the annotation process: For this purpose periodic tests of the ML-based model’s performances are implemented, comparing the evolution of indicators over time;
Integration of the training sets into the model: Once the training sets are considered to be appropriately annotated, they are approved and integrated within the model, being marked as ground truth.

The machine learning model has reached an acceptable level of performance after approximately 40% of the training process, adding up to 120,000 words to the training corpus. From that point, the pre-annotation of the documents was done using the already trained ML model. This change of process helped the author to annotate the rest of the training documents faster. The training process took place over a period of five months. After about 80% of the total training process (labeled documents that contained circa 240,000 words), the model has reached a high level of maturity, its performances improving slower and slower. The evolution of the model is described in the next section.

Figure 5 illustrates a screenshot taken during the NER training process. The author of this paper was the only annotator. Each relevant token was identified and labeled according to the classes it belonged to. Based on these labels, the model builds the ground truth, which is subsequently used in the automatic annotation process.

On the right side of the figure, the classes of the model can be observed, and on the left side, there is a fragment of a training document. The words in the text are labeled as entities of specific classes, by using tags with the same color as the classes to which they belong. For example, the token cleartext belongs to the class Vulnerability and the tokens database and engine are labeled to the class Software. An interesting aspect is that the token denial of service was labeled to both the Attack and Impact classes.

Besides that, the relations between entities are identified and annotated. Figure 6 illustrates a screenshot made during the process of relation extraction training and Table 2 shows the labeled relations.

Once the model is trained, it can be used to annotate new documents for NER and relation extraction functionalities. After a document is annotated, it is stored in the Watson Discovery database, from where it can be used. The documents are saved in JSON files, together with the enrichments. Figure 7 illustrates the representation of a relation. As it can be observed, the token Web is an entity that is part of the class Networking, and the token Application is part of the class Software. Between these words, the model identifies the relation uses.

4. Model Performances

This section analyzes the performance of the NLP model based on ML. The model’s performance indicators were evaluated persistently during the development, in order to extract regular feedback regarding its progress. We present in detail the performance indicators for the current version of the model and briefly discuss the model’s evolution in time.

4.1. Methodology and Metrics Used

The evaluation methodology used is based on the IBM Watson documentation [26]. Since our ML model is supervised, its evaluation is based on comparing the annotations automatically made by the model with the human annotations. Human labels are used as ground truth; thus, the more similar the model’s annotations are to those performed by humans, the better the model’s results.

In order to evaluate the model’s performances, the relevant documents were grouped into three categories:

Training sets: represent documents labeled by humans. Starting from these annotations, the model learns to properly recognize entities, relations and classes;
Test sets: represent documents used to test the model after it has been trained. The performances are evaluated based on the differences between the annotations made by the model and those made by human;
Blind sets: represent documents that have not been previously viewed by the humans involved in the annotation process [31].

In order to validate the results, the main performance indicators specific to the ML domain were considered: F1 score, precision and recall [32]. The indicators were calculated for the two main functionalities of the solution: NER and relation extraction.

F1 score

The F1 score can be interpreted as a harmonic mean of the values of the precision and recall indicators, falling within the range [0,1]. Its formula is:

F 1 score = \frac{2 \times Recall \times Precission}{Recall + Precission}

(1)

Precision

Precision is an indicator that measures the ratio of the number of correct annotations to the total number of annotations made by the ML model. Precision can be interpreted as a deviation of the results from the real values. A maximum precision score for an entity assumes that each time it was annotated by the machine, the annotation was correct (consistent with the ground truth).

Precision = \frac{True Positive}{True Positive + False Positive}

(2)

Recall

Recall is the ratio between the true positive annotations and the total of the annotations that the machine should have identified. The maximum value of this indicator for an entity is met when the ML model correctly annotates each occurrence of that entity. A low recall score helps to identify contexts where the ML model fails to label objects that should have been annotated. The formula of the recall indicator is:

Recall = \frac{True Positive}{True Positive + False Negative}

(3)

According to the recommendations of [31], the documents were split as 70% training sets, 23% test sets and 7% blind sets. Several aspects were taken into account in order to ensure that the performance evaluation is properly conducted. We made sure that we applied the methodology’s stipulations by the book. First of all, we made sure that the dataset was large enough. Although we obtained similar performances for less training data, we continued to annotate the model until we reached a corpus of 300,000 words. We ensured that datasets were not very uniform, so the test sets would have become predictable. Nevertheless, we conducted many tests in order to emphasize that the high performances are not just isolated case but the natural evolution in time of the model.

As a sampling technique, we preferred hold-out over cross-validation. We considered the hold-out method suitable due to the large volumes of datasets. One of the main reasons we preferred the hold-out method was that it was easier to ensure that the training and test sets were properly separated, especially considering that our corpus contains 70% CVE documents. Due to the fact that we conducted many tests at various time intervals, we considered the hold-out technique appropriate. In the future, we take into account using cross-validation as well and compare the results.

4.2. The Values of Performance Indicators Obtained by Our Model

The F1 score, precision and recall indicators are calculated both at the aggregate level and for each individual entity and relation. During the development of the model, performance tests were performed regularly and the evolution of the indicators was taken into account. In order to validate the model, it is necessary to reach an aggregate value of the F1 score of at least 0.5. Figure 8 is a screenshot done while using Watson Knowledge Studio that illustrates the evolution of F1 scores at the aggregate level, for both NER and relations extraction.

As can be observed, for the NER functionality the model has achieved a relatively high level of performance since the first tested version, where the value of the F1 score was about 0.7. This can be due to the fact that the first training, testing and blind sets were very homogeneous, all consisting of CVE documents. Subsequently, the documents used were diversified, and the training sets contained different structures and approaches. The volume of the training sets was increased significantly, and the performances of the model also increased, but at a lower rate.

Based on the evaluations conducted on the 15 versions of the model, the best performances were obtained by version 1.13, where the F1 score for NER was 0.81, and for relation extraction was 0.58. Subsequent training of the model performed after version 1.13 not only did not lead to the increase of the performance indicators, but on the contrary, it generated their decrease. The author considers that this decrease was caused by the inconsistencies between the annotations. At the time of writing this paper, Cybersecurity Analyzer uses version 1.13. Below, the performances achieved by this version of the model are presented.

4.3. Analysis of F1 Score, Precision and Recall Indicators for NER

The values of the precision and recall indicators obtained for NER are 0.88 and 0.74, respectively. Taking into account the large number of classes, we consider that the value of 0.88 for precision is very good, indicating high accuracy. The lower value obtained for the recall indicator usually indicates that the model can be improved by increasing the volume of training data.

The analysis of the indicators for the NER is particularly useful during the development of the model. As classes with low performance indicators were identified, training sets were introduced to improve the model’s results for these classes in particular. Figure 9 represents a screenshot done during the use of Watson Knowledge Studio. It illustrates the values of the F1 score indicators, the accuracy and the recall for each class, the frequency of occurrence of the classes, both from the total of annotations and from the total of the words, as well as the percentage of documents that contain entities from each class.

Out of all the classes, only three have an unsatisfactory F1 score. The low F1 scores for the classes Loss, Risk and Threat may be due to very low values of their recall indicators, perhaps caused by the small number of entities of those types that occurred in the documents. During the training process, for several entities, it was challenging to assess if they belonged to the Vulnerability class or the Threat class. Therefore, during the annotation process, we decided to include these entities in the Vulnerability class. We believe that this is the main reason why a low performance was recorded for the Threat class.

The low percentage of documents that contain a type of class is often an indication of a set of training documents that do not fully represent the field. In this case, the ontology structure and the training documents must be investigated to ensure that the training sets contain relevant entity types. It is recommended that during, the training process, the model contains a minimum of 50 occurrences for each type of entity [31].

4.4. Analysis of F1 Score, Precision and Recall for Relation Extraction

The relation extraction functionality implies that the model identifies three elements: the relation type, the parent entity and the child entity. This process is much more complex than NER, therefore the values of performance indicators for relation extraction are lower. Figure 10 is a screenshot made during the use of Watson Knowledge Studio. The aggregate F1 score of 0.58 illustrates the validity of the relation extraction functionality. However, four of the relations, namely data_flow, has, includes and runs_on, have unsatisfactory F1 scores, mainly due to low values of recall indicators. In the future, efforts will be made to improve the F1 score, precision and recall for the relation extraction.

4.5. The Comparison of Our Results with Other Similar Models

We identified five papers that use similar techniques and technologies for cognitive text analysis. Table 3 illustrates a comparison between the alternative frameworks and the solution we propose, the Cybersecurity Analyzer. The main difference in the approaches is that our model contains more types of classes and relations than any of the other models, this aspect significantly increasing its complexity. The purpose of the developed model is to perform cognitive analysis of any relevant text in the field of cybersecurity, which implies the use of large and diversified data sets. Joshi et al. [18] developed a similar model, adapted for cybersecurity, using the Stanford NER solution [33]. The F1 score obtained by them for NER is close to the F1 score we obtained. However, it is important to emphasize that our model is able to better understand the domain by being able to identify 18 different types of classes compared to 10 of the model presented in [18]. Although Joshi et al. state that the model includes relation extraction components, they did not provide examples or performance indicators in this regard.

Projects [19,21] are designed for different domains than ours and have lower F1 scores for NER. The small number of their classes facilitates obtaining satisfactory performance indicators. We consider that the usefulness of these projects is rather to find certain aspects of interest in large quantities of data than to perform a cognitive analysis for a whole domain (as Cybersecurity Analyzer does). For relation extraction, the project [19] has an F1 score close to Cybersecurity Analyzer. However, the level of complexity of the model developed by Fritzner is much lower, thus the small number of classes (two) conducted to good performance indicators.

The NLP model based on ML described in [8] was developed by the author of this paper, hence the approach is very similar. Compared to Cybersecurity Analyzer, the model presented in [8] is specialized in Internet of Things (IoT) security. The training data volume was much lower and the training process was shorter, therefore the performances of Cybersecurity Analyzer are clearly superior.

The performance indicators demonstrate the validity of the ML model developed, both for NER and relation extraction. The comparison with similar projects identified in the literature illustrates that Cybersecurity Analyzer shows superior performance, having the highest F1 score indicator for both NER and relation extraction.

5. Using Cybersecurity Analyzer

In order to illustrate the documents in a clear and user-friendly manner, a web interface that receives the data from the Watson Discovery service through REST API was developed. The main functionalities within the graphical interface of the Cybersecurity Analyzer application are:

Presentation of the entities and classes relevant to the cybersecurity field for documents uploaded by users;
Drawing up a chart that presents an overview of the most important entities and classes relevant to the cybersecurity field identified in the uploaded documents (Figure 11 and Figure 12);
Presentation of the relations between the identified entities (Figure 13).

Text documents can be uploaded within the graphical interface. It is possible to upload predefined documents to quickly view the functionalities of the application. Figure 11 illustrates a chart obtained by loading the predefined Web Application Security document. Within it, 15 types of classes were identified, most of them belonging to the class Software (71). By selecting (left-click) a class represented in the chart, all the entities corresponding to the respective class can be viewed. Figure 13 illustrates the 16 entities identified for the class Attack.

The visualization of the relations between the entities can be done individually for each sentence. Figure 13 shows the relations existing in the phrase: This vulnerability enables remote attackers to target other users of the application, potentially gaining access to their data, performing unauthorized actions on their behalf, or carrying out other attacks against them. Numerous relations are identified, and each is represented in the form of an arrow from the parent entity to the child entity. The entities are colored according to the class they belong to, consistent with the graph presented in Figure 11. An example of a relationship illustrated in Figure 13 is: Remote attackers exploit vulnerability.

6. Conclusions and Future Work

This article described a prototype developed for cognitive text analysis of documents related to cybersecurity. The general architecture was presented and each component has been discussed. The main contribution of this article consists of the NLP model based on ML specialized for the cybersecurity field. The process of developing the model was extensively discussed. The performance indicators were presented in order to emphasize the model’s validity. The comparison of our model’s performances with those of other projects identified in the literature review showed better results among our work. A web application that integrates our model has been developed in order to facilitate access to the model.

In the future, we consider using other technologies to develop NLP models based on ML. Although Watson Knowledge Studio is a top NLP tool, using a commercial service can be considered a limitation. Therefore, we will concentrate on using open-source solutions.

There is a growing interest in the development of semantic indexing systems and cognitive analysis solutions of cybersecurity documents written in different languages. The solutions described in this paper can also be adapted to other languages by adjusting ontologies and training the ML model with specific sets of documents. We consider developing models such as Cybersecurity Analyzer for other languages as well.

The design and development of equivalent solutions for other areas than cybersecurity are envisaged by using similar architectures, instruments and techniques, but developing new customized NLP models.

Supplementary Materials

The followings are available online at https://www.mdpi.com/2073-8994/12/3/354/s1, Table S1: The Evolution of the Domain Ontology’s Classes, Table S2: The Confusion Matrix for the Domain Ontology in Absolute Values, Table S3: The Confusion Matrix for the Domain Ontology in Relative Values.

Funding

This research presents results obtained within the PN-III-P1-1.2-PCCDI-2017-0272 ATLAS project ("Hub inovativ pentru tehnologii avansate de securitate cibernetică/Innovative Hub for Advanced Cybersecurity Technologies"), financed by UEFISCDI through the PN III “Dezvoltarea sistemului national de cercetare-dezvoltare”, PN-III-P1-1.2-PCCDI-2017-1 program.

Acknowledgments

This paper is based on the research the author made in his PhD paper “Modelarea bazată pe volume mari de date. Securitate Cibernetică în contextul Big Data” (Eng. “Modelling based on large volumes of data. Cybersecurity in the context of Big Data”) [34].

Conflicts of Interest

The author declares no conflict of interest.

References

Georgescu, T.M. Machine learning based system for semantic indexing documents related to cybersecurity. Econ. Inform. 2019, 19. [Google Scholar] [CrossRef]
Cockburn, I.M.; Henderson, R.; Stern, S. The impact of artificial intelligence on innovation. In The Economics of Artificial Intelligence: An Agenda, National Bureau of Economic Research; University of Chicago Press: Chicago, IL, USA, 2019; Volume w24449. [Google Scholar]
Iannacone, M.; Bohn, S.; Nakamura, G.; Gerth, J.; Huffer, K.; Bridges, R.; Goodall, J. Developing an ontology for cyber security knowledge graphs. In Proceedings of the 10th Annual Cyber and Information Security Research Conference, Oak Ridge, TN, USA, 7–9 April 2015; p. 12. [Google Scholar]
Scarpato, N.; Cilia, N.D.; Romano, M. Reachability Matrix Ontology: A Cybersecurity Ontology. Appl. Artif. Intell. 2019, 33, 643–655. [Google Scholar] [CrossRef]
Narayanan, S.N.; Ganesan, A.; Joshi, K.; Oates, T.; Joshi, A.; Finin, T. Early Detection of Cybersecurity Threats Using Collaborative Cognition. In Proceedings of the 4th International Conference on Collaboration and Internet Computing (CIC), Philadelphia, PA, USA, 18–20 October 2018; pp. 354–363. [Google Scholar]
Georgescu, T.M.; Smeureanu, I. Using Ontologies in Cybersecurity Field. Inform. Econ. 2017, 21. Available online: http://revistaie.ase.ro/ (accessed on 16 November 2019). [CrossRef]
Mozzaquatro, B.A.; Agostinho, C.; Goncalves, D.; Martins, J.; Jardim-Goncalves, R. An ontology-based cybersecurity framework for the internet of things. Sensors 2018, 18, 3053. [Google Scholar] [CrossRef] [PubMed]
Georgescu, T.M.; Iancu, B.; Zurini, M. Named-Entity-Recognition-Based Automated System for Diagnosing Cybersecurity Situations in IoT Networks. Sensors 2019, 19, 3380. [Google Scholar] [CrossRef] [PubMed]
Jia, Y.; Qi, Y.; Shang, H.; Jiang, R.; Li, A. A practical approach to constructing a knowledge graph for cybersecurity. Engineering 2018, 4, 53–60. [Google Scholar] [CrossRef]
Radi, A.N.; Rabeeh, A.; Bawakid, F.M.; Farrukh, S.; Ullah, Z.; Daud, A.; Aslam, M.A.; Alowibdi, J.S.; Saeed-Ul, H. Web Observatory Insights: Past, Present, and Future. Int. J. Semant. Web Inf. Syst. 2019, 15, 52–68. [Google Scholar]
Bakker de Boer, M.H.; Bakker, B.J.; Boertjes, E.; Wilmer, M.; Raaijmakers, S.; van der Kleij, R. Text Mining in Cybersecurity: Exploring Threats and Opportunities. Multimodal Technol. Interact. 2019, 3, 62. [Google Scholar] [CrossRef]
Bridges, R.A.; Jones, C.L.; Iannacone, M.D.; Testa, K.M.; Goodall, J.R. Automatic labeling for entity extraction. arXiv 2013, arXiv:1308.4941. Available online: https://arxiv.org/abs/1308.4941 (accessed on 8 February 2020).
National Vulnerability Database. Available online: https://nvd.nist.gov/ (accessed on 11 October 2019).
McNeil, N.; Bridges, R.A.; Iannacone, M.D.; Czejdo, B.; Perez, N.; Goodall, J.R. Pattern accurate computationally efficient bootstrapping for timely discovery of cyber-security concepts. In Proceedings of the 12th International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 4–7 December 2013; pp. 60–65. [Google Scholar]
Bridges, R.A.; Huffer, K.M.; Jones, C.L.; Iannacone, M.D.; Goodall, J.R. Cybersecurity Automated Information Extraction Techniques: Drawbacks of Current Methods, and Enhanced Extractors. In Proceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2017; pp. 437–442. [Google Scholar]
Gasmi, H.; Laval, J.; Bouras, A. Information Extraction of Cybersecurity Concepts: An LSTM Approach. Appl. Sci. 2019, 9, 3945. [Google Scholar] [CrossRef]
Soman, K.P.; Alazab, M. A Comprehensive Tutorial and Survey of Applications of Deep Learning for Cyber Security. 2020. Available online: https://www.techrxiv.org/articles/A_Comprehensive_Tutorial_and_Survey_of_Applications_of_Deep_Learning_for_Cyber_Security/11473377/1 (accessed on 5 February 2020).
Joshi, A.; Lal, R.; Finin, T.; Joshi, A. Extracting cybersecurity related linked data from text. In Proceedings of the IEEE Seventh International Conference on Semantic Computing, Irvine, CA, USA, 16–18 September 2013; pp. 252–259. [Google Scholar]
Fritzner, J.E.H. Automated Information Extraction in Natural Language. Master’s Thesis, Norwegian University of Science and Tehnology, Trondheim, Norway, 2017. [Google Scholar]
NeCamp, T.; Sattigeri, P.; Wei, D.; Ray, E.; Drissi, Y.; Poddar, A.; Mahajan, D.; Bowden, S.; Han, B.A.; Mojsilovic, A.; et al. Data Science For Social Good; University of Chicago: Chicago, IL, USA, 2017; Available online: https://dssg.uchicago.edu/wp-content/uploads/2017/09/necamp.pdf (accessed on 12 November 2019).
Tonin, L. Digitala Vetenskapliga Arkivet. 2017. Available online: http://www.diva-portal.org/smash/get/diva2:1087619/FULLTEXT01.pdf (accessed on 14 November 2019).
PatriotFest. 2019. Available online: https://www.patriotfest.ro/ (accessed on 19 October 2019).
IBM (International Business Machines). IBM Cloud. 2019. Available online: https://console.bluemix.net/catalog/services/knowledge-studio (accessed on 11 October 2019).
International Business Machines Corporation. IBM Cloud. 2019. Available online: https://cloud.ibm.com/ (accessed on 8 October 2019).
IBM (International Business Machines). IBM Cloud. 2019. Available online: https://console.bluemix.net/docs/services/discovery/index.html#about (accessed on 11 October 2019).
Stanford University School of Medicine Stanford Center for Biomedical Informatics Research. Protégé. 2018. Available online: https://protege.stanford.edu/ (accessed on 16 November 2019).
Postman, H.Q. Postman. 2019. Available online: https://www.getpostman.com/ (accessed on 17 November 2019).
Ferrucci, D.; Brown, E.; Chu-Carroll, J.; Fan, J.; Gondek, D.; Kalyanpur, A.A.; Lally, A.; Murdock, J.W.; Nyberg, E.; Prager, J.; et al. The AI Magazine 110. 2010. Available online: http://www.aaai.org/Magazine/Watson/watson.php (accessed on 20 September 2019).
Kapoor, S.; Zhou, B.; Kantor, A. IBM Developer. 2017. Available online: https://developer.ibm.com/tv/deep-dive-watson-neural-networks/ (accessed on 19 October 2019).
MITRE Corporation. Common Vulnerabilities and Exposures. 2019. Available online: https://cve.mitre.org/ (accessed on 11 October 2019).
IBM (International Business Machines). IBM Cloud. 2019. Available online: https://cloud.ibm.com/docs/services/knowledge-studio?topic=knowledge-studio-evaluate-ml (accessed on 11 October 2019).
CoNLL. Shared Task Evaluation. 2018. Available online: https://universaldependencies.org/conll18/evaluation.html (accessed on 14 December 2019).
The Stanford Natural Language Processing Group. 2018. Available online: https://nlp.stanford.edu/software/CRF-NER.shtml (accessed on 19 October 2019).
Georgescu, T.M. Modelarea Bazată pe Volume Mari de Date. Securitate Cibernetică în Contextul Big Data (Eng. “Modelling Based on Large Volumes of Data. Cybersecurity in the Context of Big Data”). Ph.D. Thesis, The Bucharest University of Economic Studies, Bucharest, Romania, 2019. [Google Scholar]

Figure 1. The architecture of Cybersecurity Analyzer.

Figure 2. The starting page of the web application available at www.cybersecurityanalyzer.com.

Figure 3. The entities and relations of the ontology. Legend: The circles represent classes, the arrows symbolize relations between classes and the rectangles symbolize the types of relations

Figure 4. The structure of document sets used to train the model.

Figure 5. Named entity recognition (NER) training using Watson Knowledge Studio.

Figure 6. Relation extraction training using Watson Knowledge Studio.

Figure 7. The representation of the relation between two entities in JSON format.

Figure 8. The evolution of F1 score, precision and recall.

Figure 9. The NER performance indicators’ values for each class.

Figure 10. The values of the performance indicators for relation extraction.

Figure 11. Overview of the identified entities and classes.

Figure 12. The entities identified for the class Attack.

Figure 13. The representation of relations between entities.

Table 1. Relations between the Ontology’s Classes.

No. of Types of Relations	Names of Relations	No. of Relations	Parent Class	Child Class
1	attacks	1	Attacker	Account
1	attacks	2	Attacker	Software
2	data flow	3	Address	Networking
2	data flow	4	Networking	Address
3	exploits	5	Attacker	Vulnerability
		6	Exploit	Vulnerability
		7	Malware	Vulnerability
		8	Offensive Means	Vulnerability
4	generates	9	Attack	Event
		10	Attack	Impact
		11	Exploit	Impact
		12	Threat	Risk
		13	Vulnerability	Risk
5	has	14	Host	Address
		15	Software	Address
		16	Software	Vulnerability
6	includes	17	Attack	Networking
7	involves	18	Attack	Malware
		19	Attack	Offensive Means
		20	Event	Loss
		21	Impact	Loss
		22	Risk	Loss
		23	Vulnerability	Threat
8	logs in	24	Account	Software
9	performs	25	Attacker	Attack
9	performs	26	Attacker	Exploit
10	protects	27	Defensive Means	Account
10	protects	28	Defensive Means	Software
11	runs on	29	Software	Host
12	uses	30	Attacker	Malware
		31	Attacker	Offensive Means
		32	Software	Networking
		33	User	Account

Table 2. The Relations between Entities.

Relation Type	Parent Entity	Parent Class	Child Entity	Child Class
performs	Attacker	Attacker	Exploit	Exploit
performs	Attacker	Attacker	cause	Attack
performs	Attacker	Attacker	denial of service	Attack
generates	Exploit	Exploit	denial of service	Impact
generates	cause	Attack	denial of service	Impact
generates	denial of service	Attack	denial of service	Impact

Table 3. The Comparison of Performance Indicators for Cybersecurity Analyzer and Other Similar Projects.

Project	F1 Score for NER	Number of Classes	F1 Score for Relation Extraction	Number of Relations
Cybersecurity Analyzer	0.81	18	0.58	33
[18]	0.8	10	N/A	N/A
[8]	0.68	17	0.46	30
[19]	0.67	4	0.55	2
[20]	0.49	5	0.19	N/A
[21]	0.73	5	N/A	N/A

Legend: N/A—not available.

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Georgescu, T.-M. Natural Language Processing Model for Automatic Analysis of Cybersecurity-Related Documents. Symmetry 2020, 12, 354. https://doi.org/10.3390/sym12030354

AMA Style

Georgescu T-M. Natural Language Processing Model for Automatic Analysis of Cybersecurity-Related Documents. Symmetry. 2020; 12(3):354. https://doi.org/10.3390/sym12030354

Chicago/Turabian Style

Georgescu, Tiberiu-Marian. 2020. "Natural Language Processing Model for Automatic Analysis of Cybersecurity-Related Documents" Symmetry 12, no. 3: 354. https://doi.org/10.3390/sym12030354

APA Style

Georgescu, T.-M. (2020). Natural Language Processing Model for Automatic Analysis of Cybersecurity-Related Documents. Symmetry, 12(3), 354. https://doi.org/10.3390/sym12030354

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Natural Language Processing Model for Automatic Analysis of Cybersecurity-Related Documents

Abstract

1. Introduction

1.1. Our Contribution

1.2. Related Work

2. The Architecture of Cybersecurity Analyzer

3. The Development of the NLP Model Based on ML for the Cybersecurity Field

3.1. Developing a Domain Ontology

3.2. Implementing the Ontology Using IBM Watson

3.3. The Development Process of the NER Model

3.4. The Training Process

4. Model Performances

4.1. Methodology and Metrics Used

4.2. The Values of Performance Indicators Obtained by Our Model

4.3. Analysis of F1 Score, Precision and Recall Indicators for NER

4.4. Analysis of F1 Score, Precision and Recall for Relation Extraction

4.5. The Comparison of Our Results with Other Similar Models

5. Using Cybersecurity Analyzer

6. Conclusions and Future Work

Supplementary Materials

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI