Next Article in Journal
Image Stitching of Low-Resolution Retinography Using Fundus Blur Filter and Homography Convolutional Neural Network
Previous Article in Journal
A Method for Single-Phase Ground Fault Section Location in Distribution Networks Based on Improved Empirical Wavelet Transform and Graph Isomorphic Networks
Previous Article in Special Issue
The Convergence of Artificial Intelligence and Blockchain: The State of Play and the Road Ahead
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Inference-Based Information Relevance Reasoning Method in Situation Assessment

by
Shan Lu
* and
Mieczyslaw Kokar
Department of Electrical and Computer Engineering, Northeastern University, Boston, MA 02115, USA
*
Author to whom correspondence should be addressed.
Information 2024, 15(10), 651; https://doi.org/10.3390/info15100651
Submission received: 2 August 2024 / Revised: 9 September 2024 / Accepted: 10 October 2024 / Published: 17 October 2024
(This article belongs to the Special Issue Feature Papers in Information in 2023)

Abstract

:
The growing volume of information available to decision-makers makes it increasingly challenging to process all data during decision-making. As a result, a method for selecting only relevant information is highly desirable. Moreover, since the meaning of information depends on its context, the decision-making process requires mechanisms to identify the context of specific scenarios. In this paper, we propose a conceptual framework that utilizes Situation Theory to formalize the concept of context and analyze information relevance. Building on this framework, we introduce an inference-based reasoning process that automatically identifies the information necessary to characterize a given situation. We evaluate our approach in a cybersecurity scenario where computer agents respond to queries by utilizing available information and sharing relevant facts with other agents. The results show that our method significantly reduces the time required to infer answers to situation-specific queries. Additionally, we demonstrate that using only relevant information provides the same answers as using the entire knowledge base. Finally, we show that the method can be applied to a limited set of training queries, allowing the reuse of relevant facts to address new queries effectively.

Graphical Abstract

1. Introduction

To make optimal decisions, decision-makers need access to information relevant to their decisions. The information is either stored locally, e.g., in a database, or obtained through communication with other agents (human or computer-based). The information must be understandable, i.e., the agents must understand its meaning. However, the meaning of messages is context dependent; the same sentence may have different meanings in different contexts. Thus, decision-making must include the accurate assessment of the context. In this paper we use Situation Theory [1,2,3,4] which introduces the notion of situation as a formal representation of context.
Additionally, the amount of information must be manageable to the decision-maker. Typically, we believe that having too little information can lead to poor decisions, and that more information improves the quality of assessment and decision-making. However, today’s decision-makers face a new challenge: information overload. Beyond a certain threshold, an excess of information can lead to overload, reducing the efficiency of decision-making [5,6].
Eppler and Mengis have identified five main causes of information overload in their work [7]. As this paper focuses on developing a method to help human decision-makers manage large amounts of information, we consider only the causes related to the information itself. Based on the analysis in their study, we identify the following four main causes of information overload:
1.
Information Quantity: The rise of the World Wide Web and social networking tools has made vast amounts of information available to decision-makers.
2.
Information Diversity: Due to various information technology applications, decision-makers must handle different types of information sources and integrate them into a consistent representation.
3.
Information Complexity: To make correct decisions, decision-makers need to interpret the complex semantic relationships among different information items and between information sources.
4.
Irrelevant Information: While large volumes of data are continuously provided, only a small portion is necessary for decision-makers’ queries; the majority of irrelevant information acts as “noise”.
To improve decision-making efficiency, effective countermeasures are required to address information overload. Various information processing approaches have been developed, such as information retrieval [8,9], information filtering [10], and information salience ranking [11], which partially mitigate some of the aforementioned issues. However, challenges related to information diversity, information complexity, and irrelevant information persist.
One way to address the complexity and diversity of information is through abstraction—transforming specific cases into instances of a high-level conceptual pattern known as an information model. An information model [12] represents concepts, relationships, constraints, rules, and operations to specify data semantics within a chosen domain of discourse. When an information model is expressed in a language with formal syntax and semantics, it enables computers to parse, interpret, integrate heterogeneous information, and automatically derive additional information from the provided input.
Ontologies can serve as explicit, formal representations of such models [13,14]. Specifically, a situation model represented by an ontology can perform the following:
1.
Automatically infer important relationships among collected information entities.
2.
Reduce the number of possible relationships by including only domain-specific ones.
3.
Automatically identify relevant information for describing situations.
The approach presented in this paper is based on ontology-based modeling. However, this is not novel, as many other approaches also utilize ontologies.
This paper addresses the problem of relevance reasoning in situation assessment, specifically the identification of facts pertinent to a specific information analysis “goal”. To achieve this, the concept of “relevance” must be formally defined. While relevance is intuitively understood by humans, it needs to be translated into an algorithmic form to enable the automatic identification of relevant information in decision-making contexts.
When decision-makers handle specific situations and access informational resources, they often receive vast volumes of data, only a fraction of which is truly valuable. It is impossible for humans to process such large amounts of information quickly, and the surplus of irrelevant data imposes a cognitive burden on analysts. Moreover, when decision-makers need to exchange information about situations—whether it is those they are currently in or other circumstances—it is unclear what information is relevant and should be transmitted over potentially overloaded communication links. Sending excessive information over links with limited bandwidth is time-consuming. Thus, it would be more efficient to exchange the minimum necessary information while ensuring all relevant data are included. Therefore, a method is needed to help decision-makers identify and exchange relevant information efficiently.
The primary objective of this paper is to develop an inference-based information relevance reasoning method to automatically identify relevant information for characterizing the situations that decision-makers face. This paper makes the following contributions:
1.
A conceptual framework of information relevance is developed, interconnecting key concepts from Situation Theory (ST) [1,15], situation awareness [16], and ST ontology (STO-L) to define relevance.
2.
The conceptual framework of information relevance is implemented as an inference-based process for reasoning about information.
3.
The introduced information relevance reasoning process is applied to the cybersecurity domain.
Experimental results show that the proposed method significantly reduces the inference time needed to answer a query. Additionally, it is possible to use the proposed method on a limited number of training queries and reuse these relevant facts to answer new queries (this is applicable specifically to cases where there is a high degree of regularity in the dataset, as was the case in the data used in our experiment.).
The rest of this paper is organized as follows: Section 2 reviews related work, providing motivation for our framework and highlighting approaches developed by others that serve as a starting point for our research. In Section 3, we discuss the conceptual framework of information relevance. A cybersecurity use case is introduced in Section 4, followed by a discussion on the approach for information relevance reasoning in Section 5. The method is then evaluated in a cybersecurity scenario in Section 6. Finally, the paper is summarized in Section 7.

2. Related Work

The main ideas presented in this paper were initially formulated by us in [17,18], and culminated in the PhD dissertation [19]. The convergence of the ideas presented in these three publications occurred upon realizing that the notion of “situation” is closely tied to the notion of “relevance”. Specifically, descriptions of situations should only capture the facts pertinent to an agent’s “goal”, rather than everything known. Therefore, this section begins with a review of the concept of relevance, followed by a discussion on the concept of situation. Lastly, we address the concept of “context”. Research on these concepts spans various fields, including information retrieval, Situation Theory, AI, information fusion, and the semantic web. The subsequent subsections relate our research to these communities.

2.1. Relevance in Information Science

Information scientists have explored and debated definitions of relevance for decades. Generally, relevance is understood as a relation among various information entities. Saracevic published a series of papers providing a framework to relate various interpretations of relevance [20,21,22,23]. In Saracevic’s framework [20], relevance is an n-tuple with several dimensions along which parts can be related, connected, and interpreted. These dimensions are shown in Table 1. Different cases of relevance can be derived by traversing the table and selecting one label from each column A through E to instantiate the placeholders in the top row. For example, for the case of retrieving relevant documents: relevance is the 〈measure〉 of a 〈correspondence〉 existing between a 〈document〉 and a 〈query〉 as determined by 〈user〉.
Mizzaro [24] presented the history of relevance through an exhaustive review of 160 papers, aligning his approach with Saracevic’s conceptual framework.
The conceptual framework of relevance outlined in [20,24] offers broad definitions and classifications of relevance. In this paper, we adopt this fundamental understanding of relevance, tailoring the placeholders to suit the extraction of collections of relevant facts rather than merely relevant documents. Thus, the pattern will be as follows:
Relevance is the measure o f a c o r r e s p o n d e n c e e x i s t i n g b e t w e e n a s e t o f f a c t s a n d a q u e r y a s d e t e r m i n e d b y t h e u s e r .
In our case, the measure value is binary: a piece of information (a fact) is either relevant or not. The facts are part of descriptions of a world where a situation is occurring. The query is referred to as the goal, and the facts deemed relevant constitute what is later referred to as an “abstract situation” in the paper.

2.2. Relevance in Information Retrieval

In information retrieval (IR), the objective is to identify documents relevant to a specific information request. Information retrieval methods can be classified into syntactic (e.g., by keywords or word frequency in the document) and semantic (e.g., based on the formalization of contexts) [25]. For this paper, only context-based methods (with context being similar to situation) are relevant, particularly those representing queries and documents as logical formulas. For example, in [26,27], the information content of a document is represented by a formula d, and the query (information need) by a formula q. A document d is selected if the implication d q can be proven, meaning that the query formula can be inferred from the document formula. This can be viewed as the document satisfying (or being relevant to) the query.
Levy [28] offered a slightly different perspective on relevance reasoning by focusing on identifying irrelevant information. His main idea is that some facts are irrelevant if they are not part of any derivation of an answer to the query, while others are irrelevant if all answers can still be derived without those facts. Levy’s research has significantly influenced our approach.

2.3. Context and Situation Theory

The concept of context has been explored and described in numerous papers. Given our interest in the formal derivation of answers, we limit our review to papers that describe formal approaches, often referred to as the logicist approach.
We begin our analysis with the approach proposed by McCarthy [29], which was later expanded by Guha [30] and others. McCarthy observed that a proposition is true only within a specific context. Guha formalized this by introducing the i s t ( c , p ) predicate to capture the fact that p is true in context c. This approach was further formalized by Buvač and Mason [31]. These approaches address the problem of context being less than the whole world, but they do not resolve the issue of the “intentionality” of specific contexts. In other words, contexts are merely collections of sentences provided to the system, without a methodology for extracting user-intended specific contexts.
Therefore, the next logical steps are to consider the aspect of the “partiality” of knowledge and provide a way to focus the context based on the user’s intent. The first issue has been extensively investigated by Barwise, Perry, Devlin, and others under the name of Situation Theory (ST) [1,2,3,4]. ST addresses partiality by introducing the notion of “situation” and a new concept of logical truth that holds in a specific situation, as opposed to the traditional notion of truth based on possible worlds. In this paper, we address the issue of intentionality by using the notion of “goal” (or “query”), following Saracevic’s structure of relevance [20]. ST deals with the semantics of linguistic expressions and will be discussed in more detail in Section 3.1.
Akman and Surav were the first to treat contexts as situation types defined by Situation Theory (ST) and provide logical inference [32]. Tin and Akman initially analyzed Lisp-based languages—PROSIT and ASTL—and subsequently developed the BABY-SIT system [33] to represent and reason with situations. This line of research has been largely forgotten, possibly due to the rise of object-oriented programming over procedural programming. Another potential reason is the difficulty in providing a unified, low-complexity treatment of n-ary predicates in the “infons” of ST.
Instead, we propose an approach to implementing Situation Theory (ST) using ontological representations of both ST and domain knowledge, which offers a more human-oriented conceptualization. In our approach, the ontology allows for the addition of parameters beyond those introduced by Barwise (such as location, time, etc.). Furthermore, formalizing ontologies in OWL (Web Ontology Language [34]) provides decidable inference mechanisms (PTIME Complete). The issue of n-ary relations is tackled using rules provided by the OWL 2 RL profile. Also, adopting the OWL2 RL profile has implications for expressiveness, aligning with the approach advocated by the semantic web community.
Davidson [35] proposed a theory of the semantics of linguistic expressions that competes with Situation Theory (ST). He considered “events” as the fundamental metaphysical construct irreducible to other categories, associated with verbs in natural language expressions. This paper does not aim to delve into metaphysical debates, as providing a definitive proof of one approach’s superiority over another is beyond its scope. The literature lacks consensus on which metaphysical framework is preferable, with arguments supporting both perspectives. According to Zucchi [36], “Davidsonian events may be considered minimal situations of certain types”. Following this line of reasoning, we focus on ST and treat events as special kind of situations.
Dapoigny and Barlatier [37] presented an alternative approach to representing and reasoning about situations using dependent type theory [38]. This approach leverages the connection between logic and computation via the Curry–Howard isomorphism, transferring logical inference into the realm of computation performed on functions (the λ -calculus). Their method also employs ontologies, though no specific Situation Theory ontology is provided. Their calculus utilizes currying to manage n-ary predicates.
Overall, their approach addresses the same conceptual issues as our paper, differing primarily in the use of functional programming versus logical inference. Another distinction is the complexity of the input required from the user: our approach requires the user to provide a query, while theirs requires a definition of a situation. While a conceptual comparison of these approaches is possible, there is no specific implementation or quantitative evaluation available for a more detailed comparison. Additionally, as noted by the authors, their approach suffers from a lack of tool availability. They suggest that combining the two formalisms could potentially extend ST with additional features beneficial for certain applications. It would be interesting to see the results of such an integration.
To complete the review of ST-related work, we include the paper by Stocker et al. [39], which demonstrates the application of an ST-based approach to vehicle detection and classification using road-pavement vibration measurements. In this paper, situational knowledge is encoded using the STO [40] to represent knowledge about real-world situations acquired from sensory observations. Specifically, the situational knowledge represents the traffic observed by a pavement vibration sensor network. While this paper does not propose a new generic solution for situational awareness methodology, it serves as an excellent example of using ST in a practical application.

2.4. Situations in the Semantic Web

Gangemi and Mika [41] have described an ontology called D&S (Descriptions and Situations) for representing contextual information. Their conceptual framework includes the following: (1) a state of affairs (SoA) represented by a set of assertions, (2) a description that partly represents a theory T conceived by an agent, and (3) a situation, which is a unitarian entity constituted by the entities and relations mentioned in the SoA assertions.
This is a Semantic Web-focused view of the notion of situation, strongly influenced by the DOLCE ontology. It differs from other conceptualizations of situations by McCarthy, Barwise, Akman/Surav, and Dapoigny. In fact, ref. [41] explicitly states that “Situation in D&S is not related to situations in situation calculus”. We believe one might draw parallels between ST and D&S by relating descriptors to abstract situations and viewing the relation between SoAs and Barwise’s situation semantics as different approaches to capturing the partiality of situations with respect to whole-world semantics. However, proving this conjecture would require an in-depth study of this conceptualization to establish a more complete view of its relationship to other approaches in the literature.

2.5. Situation Assessment in Information Fusion

The most commonly used definition of situation awareness in the information fusion community was provided by Endsley [16]. In her framework, situation awareness is considered the state of knowledge of an agent, implicitly or explicitly represented in a “mental model”. This model captures what is relevant from the perspective of a goal. The process of achieving situation awareness, called situation assessment, consists of three phases: perception, comprehension, and projection. While this paper aligns with Endsley’s framework, our focus is on the process of determining what is relevant, specifically establishing the contents and boundaries of a model.
The information fusion research community employs a multi-level hierarchical reference model known as the Joint Director of Labs (JDL) Data Fusion Model [42]. Situation estimation is at Level 2 of the information processing hierarchy, involving the estimation of relations among objects. Blasch and Plano [43] proposed adding Level 5—user refinement—to capture aspects of user attention control, among other things. This paper aligns with attention control and proposes a viable solution for it.
Endsley’s and the JDL models have been used in many situation assessment applications, such as [44,45,46], where the notion of relevance is crucial. However, none of these models provide a formal definition of a situation or algorithms to determine what is relevant. The conceptual framework of relevance reasoning presented in this paper extends Endsley’s and the JDL models. An overview of the relationship between our approach and the information fusion approaches is described in [47].
In summary, this section outlines the core findings of our literature review. We began with the concept of relevance, moved to research focused on identifying entire documents as relevant information, and then narrowed our focus to contexts and situations.

3. Conceptual Framework of Information Relevance Reasoning

This section provides a brief overview of the main concepts of Situation Theory (ST) and the ST ontology. Following this, we describe the formalization of information relevance using the ST ontology.

3.1. Situation Theory

Situation Theory (ST) aims to define the meaning of “information” and “information content”. Its application to natural language is known as situation semantics [4]. The principles of ST are outlined as follows:
1.
Knowledge about situations is represented by objects called “infons”.
Infons are written as follows:
σ = R , a 1 , , a n , 0 / 1
where R is an n -place relation and a 1 , , a n are objects appropriate for R. The last item in an infon is the polarity of the infon. Its value is either 1 (if the objects stand in the relation R) or 0 (if the objects do not stand in the relation R).
2.
Information is in relations between situations and infons, denoted as the “supports” relation, represented as .
The existence of the supports relation between a situation and a collection of infons makes the infons “factual”. Given an infon σ and situation s, the proposition “s supports σ ” is written as follows:
s σ
3.
The information about the world is relative to an agent who perceives situations via its “scheme of individuation”, aka “agent connection”.
The role of agent connection is shown graphically in Figure 1. Agents use their individuation scheme to recognize objects in the world, their types, properties and relations with other objects. The results of individuation are represented as infons.
4.
Situations can be classified by situation types.
ST provides a number of basic types, e.g., SIT—situation types, TIM—the type of time, LOC, for location and more, and an infinite number of parameters of each type. Given a situation parameter, s ˙ , and an infon σ , there is a corresponding situation type S σ —the situations in which σ obtains the following:
S σ = [ s ˙ | s ˙ σ ]
The process of obtaining a situation type is called situation type abstraction.
5.
Abstract situation is a collection of infons that are supported by a given situation.
For a given situation s, the abstract situation I s is defined as a set of infons ι that are supported by s:
I s = { ι | s ι }
6.
Since situations do not expose themselves as objects, individuation schemes involve (logical) inference based on the infons that derives existence of a situation.
Inference is carried out within theories that are part of classifications, which include types (as mentioned in point (v) above) and an “entailment” relation (⊢). In our approach, as a first step, we use inference to derive from the available knowledge, A, whether a situation, s, that is of a specific type S σ , exists:
A s s : S σ

3.2. Situation Theory Ontology (STO-L)

STO-L is a formalized approximation of Barwise’s situation semantics represented as an ontology using OWL. Figure 2 shows the top classes of STO-L:
Situation: It is the central class of STO-L. Specific situation types in ST are subclasses of this class. Instances of the Situation class represent particular situations.
Infon: It is a subclass of the Situation class. Infons are used to define situation types. Instances of this class are objects that correspond to queries about situations. The properties of the instances of Infon (relation and arg1, arg2, …) capture the semantics of agents’ queries in STO-L.
In ST notation, the connection between a situation and an infon is expressed using the supports construct (⊧). STO-L models this relation using the subclass relation of OWL. This relation is verified using the entailment. In order for this relation to hold, the reasoner needs to prove that the Infon class can be satisfied, i.e., it is not empty, as represented by Equation (5). In such a case, the value of the polarity property for this infon is set to ’1’. While in ST all the knowledge is represented by infons, in STO-L only the infons that are related to a query about a situation are instances of the class Infon. The rest of the facts in STO-L are represented by RDF triples. An RDF triple consists of three parts <subject, predicate, object>.
Individual: Individuals are instances of classes from a specific domain involved in a particular situation. They connect to a situation via the relevantIndividual property. Individuals are also the components of the relations in some infons supported by the situation. They are connected to the relation via anchor1, anchor2…properties. Individuals can have some attributes, which consist of values and dimensionalities.
Relation: Relation instances are components of the infons supported by some situations. They are connected to a particular infon via the relation property and to situations via the relevantRelation property. The types of the individuals that are the arguments of a relation can be connected to the relation via arg1Type, arg2Type…properties. Whether a relation holds in a situation can be derived by either an OWL reasoner, or by some rules—instances of the class Rule.

3.3. Formalization of Relevance Using Situation Theory

3.3.1. Information Flow in Situation Assessment

Figure 1 shows the relation between the real world and an agent. The circle labeled s is a situation happening in the real world. According to ST, the agent acquires information about the s via Agent Connection and asserts it in its knowledge base (the rectangle labeled A). The agent’s automatic inference capability is used to infer additional facts and assert them into A. Thus, A contains all of agent’s knowledge—the domain knowledge (facts and rules) acquired via Agent Connection, as well as the inferred knowledge.
A query is a perspective that gives focus to an agent that should be considered for understanding a situation. In Figure 1, the arrow from a query to a situation in the world means “query about a situation”. Q denotes the query expression. To answer the query Q, the agent needs to use some of the knowledge in A, i.e., some of the terms in the triples, to substitute for the variables in the query. This knowledge is depicted by the rectangle labeled I s , which is the abstract situation. In ST, I s is a set of infons supported by a situation (Equation (4)). In our formalization, I s is a set of facts (RDF triples) that characterize a situation. The arrow from I s to the query Q means “answering a query”. We will explain the formalization details in the following sections.

3.3.2. Mapping Queries in ST to STO-L

Queries are initially represented as expressions in natural language and then translated into a query language. For example, consider the natural language query: Is Alice a relative of some person? In ST, such a query would have to be formulated as Is there a situation, in which Alice is relative of someone? and formally expressed using the ST notation as follows:
s ( p ˙ P e r s o n ) i s R e l a t i v e O f , A l i c e , p ˙
where s represents a situation and variables are represented using parameters, e.g.,  p ˙ is a variable of the type Person. The query whether “s supports σ ” is represented formally using an infon with an unknown polarity.
The above ST expression can be formalized as an INSERT query in SPARQL [48] (Listing 1). In SPARQL, the WHERE clause is the main part of a query. It is a graph pattern, which is a set of RDF triple patterns—triples where some of the elements (subject, predicate, object) may be variables. Note that the SPARQL query is more detailed than Equation (6). This is because Equation (6) is intended for human interpretation and does not need to make everything explicit. Since the SPARQL query is processed by a SPARQL interpreter, the various facts need to be explicitly stated, following the SPARQL syntax and the STO-L ontology.
Listing 1. A SPARQL representation of the query in Equation (6).
  • INSERT
  •     {Sq subClassOf Infon.
  •      s rdf:type Sq.
  •      s relevantRelation isRelativeOf.
  •      s relevantIndiviudal Alice.
  •      s relevantIndividual ?p.}
  • WHERE
  •     {isRelativeOf rdf:type BinaryRelation.
  •      isRelativeOf anchor1 Alice.
  •      isRelativeOf anchor2 ?p.}
As shown in Listing 1, if all the statements in the WHERE clause are inferred and matched by a SPARQL engine, the statements in the INSERT clause are added to A.
In other cases, the relation in a query may not be binary. For example, consider a query that involves a ternary relation: Is there a situation in which Alice hires someone as the manager? This query can be expressed in ST notation as follows:
s ( p ˙ P e r s o n ) h i r e s , A l i c e , p ˙ , m a n a g e r
It will be formalized in SPARQL as shown in Listing 2:
Listing 2. A SPARQL representation of the query in Equation (7).
  • INSERT
  •     {Sq subClassOf Infon.
  •      s rdf:type Sq.
  •      s relevantRelation hires.
  •      s relevantIndiviudal Alice.
  •      s relevantIndiviudal ?p.
  •      s relevantIndividual manager.}
  • WHERE
  •     {hires rdf:type TernaryRelation.
  •      hires anchor1 Alice.
  •      hires archor2 ?p.
  •      hires anchor3 manager.}
The two query examples shown above are simple queries, each containing only one infon. However, in some cases, the query may include more information that cannot be expressed with a single infon. In these instances, the query needs to be represented in ST using a compound infon [15]. Consider the complex query: Is there a situation in which Alice hires someone as a manager who is also her relative? This query can be represented in SPARQL as a combination of the clauses from Listings 1 and 2.

3.3.3. Inference in Query Answering

Expressing queries in SPARQL is the first step in relevance reasoning. Following this, a series of inference steps are invoked to answer the query. For the query example in Listing 1, the triples in the WHERE clause will be inferred using the generic inference rule shown in Listing 3. After these inferences, the query in Listing 1 will be answered by the SPARQL engine.
Listing 3. A pre-processing rule to infer the anchors of a binary relation.
  • INSERT
  •     {?r anchor1 ?a1.
  •      ?r anchor2 ?a2.}
  • WHERE
  •     {?r rdf:type BinaryRelation.
  •      ?r arg1Type ?A1.
  •      ?r arg2Type ?A2.
  •      ?a1 rdf:type ?A1.
  •      ?a2 rdf:type ?A2.
  •      ?a1 ?r ?a2.}
For the query in Listing 2, to infer that hires holds and to determine the anchors of this relation, an additional pre-processing rule for the ternary relation is required (Listing 4).
Listing 4. A pre-processing rule to infer the anchors of a ternary relation.
  • INSERT
  •     {?r anchor1 ?a1.
  •      ?r anchor2 ?a2.
  •      ?r anchor3 ?a3.}
  • WHERE
  •     {?r rdf:type TernaryRelation.
  •      ?r arg1Type ?A1.
  •      ?r arg2Type ?A2.
  •      ?r arg3Type ?A3.
  •      ?a1 rdf:type ?A1.
  •      ?a2 rdf:type ?A2.
  •      ?a3 rdf:type ?A3.}
This rule differs from the binary relation rule in Listing 3. For a binary relation, the fact that ? r holds for two specific arguments ? a 1 and ? a 2 can be represented as an RDF triple <?a1, ?r, ?a2> and inferred using OWL inference. Higher arity relations cannot be represented this way in OWL, as it only supports binary relations. Therefore, higher arity relations need to be “reified”, and inference must use rules. Consequently, the triple <?r, rdf:type, TernaryRelation> representing a class of (in this case ternary) relations must exist in the knowledge base. To represent an instance of a ternary relation (a quadruple), its three arguments need to be in a (binary) relation with an instance of TernaryRelation, which in turn must be related to three attributes via arg1Type, arg2Type, arg3Type. All these triples must be asserted into the knowledge base by a domain-specific rule (not shown here) so that OWL can infer that a respective situation exists. The issue of representing n-ary relations in OWL has been recognized by the semantic web community, which has developed patterns for such representations [49], and various approaches have been studied, e.g., [50].

3.3.4. Relevant Information

The previous subsections illustrate how a query Q about a situation s, as depicted in Figure 1, is formalized first in ST and then in SPARQL. The answer to a query is derived logically, as demonstrated in the listings. Since this derivation process requires special inference rules, as described above, we represent these rules as “ s ” and the derivation process as
A s Q
The above derivation does not necessarily require all the facts in A. As indicated in Figure 1, I s represents an abstract situation corresponding to the real-world situation s which is sufficient to answer Q. Our next step is to determine which facts from A should thus be included in I s . To formalize the informal definition of relevance provided in Equation (1), we define I s as the information relevant to a situation s with respect to the query Q and the set of facts A, according to the following equation:
I s = arg min A I s Q | I |
In other words, this definition states that I s is the smallest subset of the set of facts A that is sufficient for answering the query Q. It does not guarantee uniqueness, meaning there may be multiple subsets of A of the same size that are sufficient to derive the answer to Q. The algorithms for identifying relevant information are detailed in Section 5.

4. A Cybersecurity Use Case

We consider a use case from the cybersecurity domain (Figure 3). In a cyber situation assessment scenario, a team of cybersecurity analysts collaborates to protect an organization’s network and systems from cyberattacks. The primary task of the analysts is to answer queries related to attack situations based on network traffic analysis. During this process, they may need to exchange information about the situations they are investigating. Additionally, they might send queries to their collaborators, expecting both answers to their queries and the supporting data. Once they receive the data, they can use their analysis tools to verify the answers.
In our experiments, we used the MACCDC 2012 dataset [51], which contains over 100 million records of raw network traffic data collected over two days of monitoring. Analysts utilize Snort [52], a signature-based Intrusion Detection System (IDS), to monitor the network traffic for malicious activity or policy violations. The Snort alert dataset generated from the MACCDC 2012 dataset has been published on the SecRepo site [53]. However, it remains challenging for human analysts to comprehend the threat situations by manually analyzing these Snort alerts. Firstly, the volume of Snort alerts is still immense, with over 3 million alert records in the SecRepo dataset. Secondly, these alerts represent low-level attacks or anomalies that are independent of each other, although there may be logical connections between them [54]. Analysts need to construct attack situations from these low-level alerts. To achieve this, they must employ logical reasoning, which can be supported by logical inference tools like the one used in our experiments.
Sharing situational information is often necessary in various scenarios, such as when two analysts are monitoring different sub-networks of an organization. The Snort alerts monitored by each analyst are generated from different local sub-networks. If Analyst 2 sends a query to Analyst 1 asking whether a specific situation type has been observed, what information should Analyst 1 provide in response? We assume that replies to queries should include only the relevant situational information necessary to characterize the given situation and answer the query.
To achieve this, an automatic information relevance reasoning process is needed to help human analysts construct attack situations and identify the relevant information for characterizing specific attack situations.
The scenario described above can be generalized as a UML-style use case diagram (see Figure 4). As shown in this figure, the Analyst actor performs two tasks: querying a data store and sharing information. The  i n c l u d e s relation indicates that the functionality of sharing must involve the identification of relevant information, which in turn requires the formulation of queries.
The cybersecurity scenario used in this paper is just a special case of this use case. We propose that the approach to identifying relevant information described in this paper is applicable to any instance of the use case where a user needs to identify information relevant to a query based on logical inference and then share this information with another agent, as depicted in Figure 4. An example of such a scenario could be medical specialists sharing their assessments—diagnoses and/or treatment approaches—with a Primary Care Doctor (PCP) or with other specialists. Following the approach presented in this paper, they would first make their own assessments based on the data provided to them, formulate queries, invoke a query processor to answer the query, use a relevance reasoning system to identify the relevant facts, analyze the steps of the inference process, and if the inference is acceptable, send the relevant facts to the other party, who in turn could use a local inference engine to show how the query answer was derived.

5. Relevance Reasoning Process

In this section, we refine the conceptual framework of information relevance presented in Section 3. First, we provide an overview of the proposed relevance reasoning process (see Figure 5). Then, we explain the algorithmic details of each step in the process.

5.1. Process Overview

Input pre-processing: It annotates Snort alerts into a domain ontology, utilizing the STO-L ontology as the top-level ontology and the STIX ontology [55] for the cybersecurity use case.
The STIX ontology was developed based on the Structured Threat Information eXpression (STIX) standards [56] and other standards, including IODEF (The Incident Object Description Exchange Format), MAEC (Malware Attribute Enumeration and Characterization), CAPEC (Common Attack Pattern Enumeration and Characterization), and CWE (Common Weakness Enumeration). The ontology includes numerous classes (1303), properties (161), and axioms (8808). Only some classes of the STIX ontology are shown in Figure 6.
The Event and Observable classes capture low-level cyber events, such as a Snort alert, which is an instance of Event, with an observable instance created for each alert. Events and related entities are treated as objects in STO-L. The TTP, AttackPattern, and ExploitTarget classes capture high-level cyber attack scenarios. Instances of the Indicator class are fundamental elements of intelligence that connect low-level observables with high-level cyber intelligence concepts. For instance, an indicator for each observable is created, and some indicators can be combined and connected with TTP instances or KillChainPhase instances. The KillChain class represents the multiple steps in an attack, which is a specific situation type. After input preprocessing, the STIX ontology with Snort alert instances becomes a domain knowledge base.
Query Generation: For the purpose of evaluating the approach, a large volume of diverse queries is needed. To achieve diversification, queries are generated automatically based on predefined query pattern templates and the Snort alerts annotated in the knowledge base. The algorithmic details will be explained in Section 5.2.
Domain Inference and Query Answering: The process of applying domain rules to the domain knowledge base to deduce new information using an inference engine is known as domain inference. The process of answering a SPARQL query by matching the statements in the WHERE part of the query using a SPARQL engine is referred to as query answering. The algorithmic details will be explained in Section 5.3.
Extracting Relevant Information: In this step, two algorithms are executed to extract relevant information: the derivation base construction algorithm, which incrementally builds the derivations of query answers and extracts the facts used in these derivations, and the algorithm for finding minimal subgraphs sufficient to answer the query. The algorithmic details will be explained in Section 5.4.

5.2. Query Generator

The Query Generator takes two inputs: query templates and a domain knowledge base. A query template is a pseudo-query containing placeholders. The Query Generator algorithm constructs queries by replacing these placeholders with terms extracted from the user’s questions expressed in natural language. The selected terms are aligned with the types of terms represented in the domain’s vocabulary (ontology) [57]. Specifically, the placeholders are replaced by concrete IRIs from the knowledge base, which encodes the domain terms used in the queries.

5.2.1. Basic Situation Query Template

The basic situation query templates were constructed by extracting the common structures of the situation queries described in Section 3.3.2. Listing 5 shows a basic SPARQL query template. It contains three placeholders: R e l a t i o n , O b j e c t 1 , and  O b j e c t 2 .
Listing 5. The basic query template.
  • INSERT
  •     {Sq subClassOf Infon.
  •      s rdf:type Sq.
  •      s relevantRelation 〈Relation〉.
  •      s relevantIndiviudal 〈Object1〉.
  •      s relevantIndividual 〈Object2〉.}
  • WHERE
  •     {〈Relation〉 rdf:type BinaryRelation.
  •      〈Relation〉 anchor1 〈Object1〉.
  •      〈Relation〉 anchor2 〈Object2〉.}
For instance, consider the following query: Is there a situation where indicator1 is associated with scanObservable1? In this case, the placeholders in Listing 5 can be replaced with specific terms from the STIX ontology, such as a relation (e.g., stix:observable) and objects (e.g., stix:indicator1, stix:stix:scanObservable1). This substitution generates a specific query as shown in Listing 6.
We have developed several types of query templates to cover various scenarios: (1) templates for binary relations with three placeholders ( R e l a t i o n , O b j e c t 1 , O b j e c t 2 ) and no variables, utilizing only types (classes and relations); (2) templates with placeholders for object types and binary relations that include variables; (3) templates with placeholders and variables for n-ary relations; and (4) templates for complex queries involving multiple relations. Together, these templates enable the generation of a wide range of queries to address diverse needs.
Listing 6. SPARQL query example with no variables.
  • INSERT
  •     {Sq subClassOf Infon.
  •      s  rdf:type Sq.
  •      s  relevantRelation   stix:observable.
  •      s  relevantIndiviudal stix:indicator1.
  •      s  relevantIndividual stix:scanObservable1.
  •      }
  • WHERE
  •   { stix:observable rdf:type BinaryRelation.
  •     stix:observable anchor1  stix:indicator1.
  •     stix:observable anchor2  stix:scanObservable1.
  •   }

5.2.2. Query-Generating Algorithm

There are two methods for using the query templates. Users can manually retrieve relevant IRIs for relation and object terms from the knowledge base and insert them into the appropriate query templates to generate specific SPARQL queries. Alternatively, for testing purposes, a query generation algorithm has been developed to automate this process. This algorithm extracts IRIs for classes and individuals from the knowledge base and randomly replaces placeholders in the query templates with these IRIs, thereby generating all possible SPARQL situation queries for a given knowledge base. It is important to note that not all automatically generated queries align with the WHERE patterns and some may not yield answers. In our analysis, only those queries that matched the patterns were considered.

5.3. Domain Inference and Query Answering

As discussed earlier in this paper, domain knowledge is represented using OWL 2 RL, a syntactic subset of OWL 2. This knowledge is stored natively as RDF triples, and its entailment is governed by OWL 2 RL/RDF entailment rules, which are first-order implications [58,59]. OWL 2 RL ontologies can be extended with additional user-defined axioms through these rules. An inference engine applies these rules to determine whether ontology A 1 entails ontology A 2 . We utilize the BaseVISor inference engine [60] for implementing OWL 2 RL/RDF entailment.
Ontologies are represented as RDF graphs. An RDF graph is a collection of RDF triples. The subjects, objects, and predicates in an RDF graph A are known as RDF terms, denoted by T ( A ) . A triple pattern extends the concept of an RDF triple by including variables in the subject, predicate, and object positions. A graph pattern is a set of such triple patterns. The set of variables in a graph pattern Q is denoted by v a r ( Q ) .
A rule, r, is represented as a pair of graph patterns ( ρ l , ρ r ), where ρ l r ρ r . In this context, ρ l is referred to as the body or left-hand side of the rule, ρ r as the head or right-hand side, and  r denotes the entailment relation that specifies when ρ l implies ρ r . Both the OWL 2 RL/RDF semantics rules and user-defined rules are represented in this format, and the BaseVISor inference engine applies the same rule execution mechanism to all of them.
In the W3C documentation of OWL 2 Web Ontology Language Profiles [59], there are seventy-eight OWL 2 RL/RDF entailment rules, which are expressed in the form of “if (body) then (head)” statements. Table 2 shows two rules: (1) rdfs:subClassOf axiom: if class c 1 is a subclass of class c 2 , and x is an instance of c 1 , then it implies that x is an instance of c 2 ; (2) owl:inverseOf Axiom: if p 1 is i n v e r s e O f p 2 then p 2 is i n v e r s e O f p 1 .
In the STIX ontology, ScanObservable is a subclass of Observable, and observable1 is an instance of ScanObservable. Using the first rule, the OWL reasoner will infer that observable1 is also an instance of the Observable class.
In the STIX ontology, the properties isConsequenceOf and isPrerequisiteOf are inverses of each other, and indicator1 is consequence of indicator2. The OWL reasoner in the Protégé (or the BaseVISor inference engine using Java), will infer the fact <stix:indicator2, stix:isPrerequisiteOf, stix:indicator1>. This inference is based on the existing base facts <stix:isConsequenceOf, owl:inverseOf, stix:isPrerequisiteOf> and <stix:indicator1, stix:isConsequenceOf, stix:indicator2> in the knowledge base, which are prerequisites for applying the second rule.
Inferences such as the ones described above are small components of the broader domain inference process. When combined, however, they enable the reasoner to address complex queries and determine which facts are relevant to a specific query.
We differentiate between two types of facts in the knowledge base: base facts (denoted as A) and derived facts (denoted as A + ), where A A + . Base facts are those that were present in the knowledge base prior to domain inference, often imported from external sources such as Snort alerts annotated with the STIX ontology. Derived facts are those that are introduced during the domain inference process.
A query Q is represented as a graph pattern. The results returned by the SPARQL engine for Q are a set of mappings { μ μ : v a r ( Q ) T ( A + ) } , where each μ is a partial function. Each query answer α n is an RDF graph where all variables in v a r ( Q ) are replaced by elements from T ( A + ) according to a mapping μ . Some queries, particularly those with variables, may yield multiple answers. In such cases, the set of query answers is { α n 1 , , α n M } .

5.4. Extracting Relevant Information

As indicated in Equation (9), the facts that are used in the deriving answers to situational queries are considered the relevant information. Therefore, the primary goals of the relevant information extraction algorithms are the following: (1) to construct the query derivations (see Definition 1) for query answers using the derivation rules (denoted as s in Equation (9)), and (2) to extract the facts involved in these derivations. The first objective is achieved by Algorithm 1, which builds a derivation base. This base is then used by Algorithm 2 to identify the minimal derivations. The logic of Algorithm 1 is captured in Definition 1 below.
Definition 1 (Query Derivation).
Given a knowledge base A, a set of domain and entailment rules R, the knowledge base with derived facts A + , and a query Q, a derivation of a query answer is a sequence α 1 , , α n , where α n is the query answer; α 1 A ; each α i , ( 1 < i < n ) is a graph pattern s.t. α i R α i + 1 for some rules from R. A derivation base for the query Q is then D Q = i = 1 n α i A .
In some instances, a query answer may have multiple derivations from a given knowledge base. We denote the set of derivations for a query answer by Σ Q , where each D k Q Σ Q , for  1 k m , represents one derivation for Q. For queries that yield multiple answers, Σ j Q represents the set of derivations for α n j , where 1 j M . Using Definition 1, we can refine the definition of relevant information introduced in Equation (9) as follows:
I s Q = j = 1 M arg min D k Q Σ j Q | D k Q |
The relevant information extraction algorithm (Algorithm 1) incrementally constructs a derivation base D Q for each query answer α n j (line 2) and traces the derivation path from α n j back to the base facts α 1 j . It does this by iterating over the triples in α n j (line 5). If a triple is found in A, it is removed from α n j and added to D Q . The remaining triples in α n j are then sent to Algorithm 2 to find the minimal subgraph of A + that can derive α n j (line 9). This process continues (line 4) until all triples in the subgraph are sourced from A.
Algorithm 1: Construct-Derivation-Base(A, R, A+, { α n 1 , , α n M })
Information 15 00651 i001
The problem of finding a minimal subgraph of A + , denoted as Λ , that entails α n j A + can be described as follows (Algorithm 2). For each triple t α n j , start by assigning the RDF graph A + to the RDF graph λ , where λ entails t. Then, iterate over the rule set R. For each rule r R , if there is an RDF graph α r l that entails t according to r, then compare the sizes of the RDF graphs α r l and the current λ and assign the smaller one to λ . Accumulate the results of iterations, λ , in  Λ , and return it as the result of this algorithm.
Algorithm 2: Find-Minimal-Subgraph(R, A+, α n j )
Information 15 00651 i002

6. Evaluation

This section evaluates the relevance reasoning method using the use case and MACCDC 2012 dataset described in Section 4, focusing on three aspects:
1.
Effectiveness: When Analyst 1 provides Analyst 2 with relevant facts derived from the method for certain queries, can Analyst 2 obtain the same answers based on these relevant facts?
2.
Efficiency: Does using the relevant facts to answer queries significantly reduce the time compared to using the entire knowledge base?
3.
Results Reusability: Can an analyst extract a set of relevant facts by applying the relevance reasoning method to a limited number of queries, and subsequently use these facts to answer new queries?

6.1. Evaluation Metrics

The following notations will be used to formalize the evaluation metrics.
  • R—the relevance reasoner;
  • S—the SPARQL query engine;
  • K—the facts base;
  • Q —set of queries;
  • Q ¯ —training query set;
  • R ( Q , K ) —set of relevant facts based on query Q and fact base K;
  • S ( Q , K ) —set of answers to Q returned by SPARQL based on the fact base K;
  • S ( Q , R ( Q , K ) ) —set of answers to Q returned by SPARQL based on the relevant facts R ( Q , K ) K for query Q ;
  • T ( R ( Q , K ) ) —time of deriving relevant facts based on query Q and facts base K;
  • T ( S ( Q , K ) ) —time of answering Q for facts base K.
First, we define the evaluation function, ϵ ( ) , which will subsequently be used to establish the metrics for precision and recall:
ϵ ( Q , Q ) = 1 , if S ( Q , R ( Q , K ) ) = S ( Q , K ) 0 , if S ( Q , R ( Q , K ) ) S ( Q , K )
The evaluation space involves a fact base, K, and a set of queries, Q , for this fact base. Additionally, for each Q Q and K, there corresponds a set of query answers S ( Q , K ) defined by the SPARQL query engine S based on the facts derived by the relevance reasoner R. The query answers S ( Q , K ) are considered the ground truth. In this equation, Q denotes the query for which ground truth is evaluated, while Q is the query used by R to extract the relevant facts.
In the evaluation process, the answers to queries Q derived by S ( Q , R ( Q , K ) ) are compared against the ground truth using the evaluation function in Equation (11). First, we demonstrated that when the relevant facts derived by R ( Q , K ) are used to derive answers for Q by S, the value of ϵ ( Q , Q ) is 1, even though R ( Q , K ) is significantly smaller in size than K. We are not representing this fact graphically here.
Then, we assessed the usability of our relevance reasoning algorithm, R, in scenarios where the relevant facts were derived from a set of training queries, Q ¯ , and then applied to other queries Q that were not part of that set. For this purpose, we utilized the metrics of precision and recall, which are defined in terms of true positives, false positives, true negatives, and false negatives. These four sets are determined based on two assessments: (1) answers to Q derived from relevant facts specific to Q, and (2) answers to Q derived from relevant facts specific to Q . These assessments were conducted using the evaluation function defined in Equation (11). True positives (tp) are those query results for which ϵ ( Q , Q ) = 1 and ϵ ( Q , Q ) = 1 . False positives (fp) are those for which ϵ ( Q , Q ) = 1 and ϵ ( Q , Q ) = 0 . True negatives (tn) are those for which ϵ ( Q , Q ) = 0 and ϵ ( Q , Q ) = 0 . False negatives (fn) are those for which ϵ ( Q , Q ) = 0 and ϵ ( Q , Q ) = 1 . Then, precision and recall are defined as follows:
p r e c i s i o n = | t p | / ( | t p | + | f p | )
r e c a l l = | t p | / ( | t p | + | f n | )

6.2. Effectiveness and Efficiency

In the first experiment, a subset of data was selected from the MACCDC 2012 dataset and annotated as instances of the STIX ontology, resulting in K. K contained 3,000,000 triples, covering more than 50% of the data in the MACCDC 2012 dataset. Additionally, 5000 queries, Q , were generated by the query generator. For each Q Q , the reasoner R derived the relevant facts R ( Q , K ) . Then, the SPARQL engine S was run on these relevant facts to obtain the answers S ( Q , R ( Q , K ) ) .
To evaluate the effectiveness of the relevance reasoning method, the value of ϵ ( Q , Q ) was calculated for each Q Q . The results showed that ϵ ( Q , Q ) was 1 for every query, indicating that the query answers obtained using only the relevant information were identical to those derived using the entire knowledge base. This experiment demonstrates the high effectiveness of the method. In other words, when Analyst 1 (in the scenario shown in Figure 3) sends the relevant facts derived by the method for some queries to Analyst 2, we can guarantee that Analyst 2 will get the same answers based on these relevant facts.
To evaluate the efficiency of the relevance reasoning algorithm, the time T ( S ( Q , R ( Q , K ) ) ) was measured for each query. T ( S ( Q , K ) ) was considered as the baseline. The ratio, as defined in Equation (14), was then calculated for each Q Q :
T ( S ( Q , R ( Q , K ) ) ) / T ( S ( Q , K ) ) · 100 %
These ratio values are plotted in a histogram shown in Figure 7. The x-axis represents thirteen value ranges of the ratio, and the y-axis represents the number of queries falling within each range. For most queries, the values are between 0.4% and 0.5%. This histogram demonstrates that the time required to answer queries using only the relevant facts derived by the relevance reasoning algorithm is approximately 200 times shorter than answering the queries using the entire knowledge base K. In other words, it is significantly more efficient for Analyst 2 to derive the answers to the queries using only the relevant facts rather than the entire knowledge base.

6.3. Results Reusability

In the previous experiment, only the query-answering time, T ( S ( Q , K ) ) , was considered for efficiency evaluation. However, the time taken by R, T ( R ( Q , K ) ) , is significant and cannot be ignored. It may be more efficient to run R on a set of queries Q ¯ (a training set), accumulate the relevant facts, and then use these facts to answer all queries. This approach can be termed predictive situation assessment. It aligns with the projection phase of situation awareness in Endsley’s model [16]—the phase decision-makers often rely on.

6.3.1. Precision and Recall

In this experiment, in addition to the queries Q and the knowledge base K, 500 additional queries, Q ¯ , were generated as training queries. These training queries, Q ¯ , were based on a subset of K different from the one used for Q . The relevance reasoner R was run on K for subsets Q ¯ of Q ¯ , with an increasing number of queries. The relevant facts returned, R ( Q ¯ , K ) , were then used by the SPARQL engine S to answer each Q Q . The answers, S ( Q , R ( Q ¯ , K ) ) , were evaluated using precision and recall metrics, calculated according to Equations (11)–(13).
To demonstrate how the size of the training set Q ¯ affects the precision and recall for answering Q , the experiment was run 500 times. In each run, the size of Q ¯ increased from 1 to 500. Figure 8 shows the results, where the x-axis represents the number of queries in Q ¯ . The green line indicates precision, and the orange line indicates recall, both varying with the number of queries in Q ¯ . The results reveal that with more than 220 training queries, the precision for answering Q is close to 100%, and with more than 300 training queries, the recall is close to 100%. This experiment demonstrates that it is feasible to run our method on a limited number of queries to obtain a set of relevant facts and subsequently answer other new queries effectively. This conclusion applies specifically to cases where there is a high degree of regularity in the dataset, as was the case in our experiment.

6.3.2. Trade-Off between Accuracy and Efficiency

Reusing relevant facts has both advantages and disadvantages. On one hand, increasing the number of queries in Q ¯ enhances accuracy. On the other hand, it also increases the time required by the relevance reasoner, thereby reducing efficiency. By using K to answer Q as the baseline, we can demonstrate the trade-off between accuracy and efficiency in query answering.
In this experiment, the relevance reasoner R was run on 6 different sets of Q ¯ , each containing 50, 100, 150, 200, 250, and 300 training queries, respectively, resulting in 6 different sets of relevant facts. For each set, R ( Q ¯ , K ) , the SPARQL engine S was used to answer Q . The total time for both deriving relevant facts and answering Q is defined in Equation (15).
T ( R ( Q ¯ , K ) ) + T ( S ( Q , R ( Q ¯ , K ) ) )
The precision and recall metrics were calculated and are presented in Figure 9. The left y-axis indicates the total time used in each case, while the right y-axis represents the precision and recall. As shown in the figure, using the relevant facts derived from a set of 250 training queries achieved precision and recall close to 100%, while requiring less time.
In practical applications, security analysts can select or adjust the size of the training query set based on the analysis above. For instance, in scenarios where real-time processing is crucial, the analyst might opt to reduce the size of the training query set to lower processing latency, even if it results in reduced accuracy.

6.3.3. Discussion

In this section, we highlight several aspects of our method’s implementation and identify issues that could be explored in future research.
  • All experiments were conducted on a MacBook Air 2020 with the following specifications: (1) Processor: Apple M1 chip with an 8-core CPU (4 performance cores and 4 efficiency cores), a 7-core GPU, and a 16-core Neural Engine. (2) Memory: 8 GB unified memory. (3) Storage: 256 GB SSD. The implementation code can be accessed at: https://github.com/shanlu/Dissertation-Evaluation (accessed on 9 September 2024).
  • The relevance reasoning algorithms are implemented in Java using the OWL API. Retrieving and writing RDF triples from or to ontologies in .rdf or .owl file formats using the OWL API can be quite time-consuming. Future improvements could include using more efficient data structures to enhance algorithm performance. For instance, employing a triplestore could be a better option for representing domain knowledge.
  • As discussed in Section 3.3.4, our formal definition of relevance does not ensure that solutions are unique; multiple sets of facts with the same cardinality can be sufficient to answer a query. Further research is needed to explore this issue, such as analyzing theoretical limits on the number of relevant information sets of the same size, evaluating trade-offs between computing complete sets of relevant information and their associated computational costs, or considering additional constraints on the criteria for relevance.
  • Ontology reasoning methods can benefit from collaboration with machine learning techniques. Formal reasoners often face scalability challenges and may struggle with large-scale, incomplete, conflicting, or uncertain information. In contrast, machine learning techniques are typically more scalable and resilient to data disturbances. Integrating machine learning with our relevance reasoning approach could enhance performance in managing incomplete, conflicting, or uncertain information and address scalability issues.
  • The complexity of our solution is influenced by several factors: the complexity of executing a query, the complexity of Algorithms 1 and 2, and the complexity of the reasoning problem handled by the OWL reasoner. The primary contributor to overall complexity is OWL inference, which is PTIME-complete and thus classified as tractable.
  • In information collection, a phenomenon known as information scattering [61] occurs, where a few sources contain a significant amount of relevant information on a topic, while most sources have only a small amount. Bradford’s law [62] mathematically describes this pattern. Our experiments have shown that this phenomenon also applies to information relevance reasoning. Each R ( Q , K ) for a specific query Q can be considered an information source. As illustrated in Figure 8, the first fifty information sources account for over 50% of the relevant information needed to answer all queries in the domain, and the first hundred sources account for over 75%. This suggests that running the relevance reasoner on a set of training queries and then reusing the results for new queries could be particularly beneficial for security analysts.

7. Conclusions

In this paper, we introduced a conceptual framework for information relevance, formalized through Situation Theory as developed by Barwise, Perry, and Devlin. Building on this framework, we created an inference-based method for information relevance reasoning that automatically identifies pertinent information to characterize and communicate the situations faced by decision-makers. This method aids decision-makers by focusing solely on relevant information, thereby reducing both the cognitive load on individuals interpreting situations and the demands on communication channels transferring situation descriptions.
The primary innovation of the approach presented in this paper lies in the ontological interpretation of Situation Theory (ST). This involves mapping concepts such as situation, infon, relations, supports relation between situations and infons, situation types, and abstract situation to a Situation Theory ontology. We then formalize this framework using an inferential structure based on OWL and Rules inference. Practically, this approach enables the formal, logical derivation of facts (abstract situations) sufficient for inferring the satisfaction of a goal (formalized as a SPARQL query), which provides the “boundary” of a situation.
We evaluated our method using a cybersecurity domain dataset. The results demonstrated that answering queries with the relevant information provided by our method yields the same results as using the entire knowledge base, but with significantly reduced processing time. Additionally, we calculated the precision and recall for answering new queries based on relevant facts derived from a limited number of training queries. Our investigation into the trade-off between accuracy and efficiency revealed that utilizing relevant information from a limited training set is more efficient for answering new queries.
The cybersecurity scenario examined in this paper serves as a specific use case. We believe that the approach to identifying relevant information outlined here is broadly applicable to various decision-making scenarios where situation-specific information needs to be exchanged.
The work presented in this paper has several avenues for future extension:
1.
Reducing Redundancy: While our method ensures that only facts necessary to answer specific situational queries are included, some of these facts may still be redundant. Further research is needed to refine the size of the relevant information and minimize redundancy.
2.
Exploring Additional Scenarios: Extending this research to other scenarios and datasets within the cybersecurity domain could provide additional insights and validate the approach in varied contexts.
3.
Applying to Other Use Cases: Investigating the application of our relevance reasoning method to different use cases and domains, such as air traffic control or military command and control, would be valuable. These scenarios would require the development of suitable domain ontologies.
4.
Integrating Machine Learning: Using machine learning techniques in concert with our approach could enhance the capabilities of the relevance reasoning method, particularly in handling incomplete, conflicting, or uncertain information.
5.
Scalability and Performance Optimization: Optimizing the approach for better scalability and performance, especially in handling complex cybersecurity data, would improve its practical application.
6.
Developing a User-Friendly Interface: Creating an intuitive interface for practical application in cybersecurity contexts would facilitate the use of this method by practitioners.

Author Contributions

Conceptualization, S.L. and M.K.; methodology, S.L. and M.K.; validation, S.L.; formal analysis, S.L.; writing—original draft preparation, S.L.; writing—review and editing, M.K.; supervision, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
JDLJoint Directors of Laboratories
MACCDCU.S. National CyberWatch Mid-Atlantic Collegiate Cyber Defence Competition
OWLWeb Ontology Language
RDFResource Description Framework
RDFSResource Description Framework Schema
SASituation Assessment
SPARQLSPARQL Protocol and RDF Query Language
STSituation Theory
STIXStructured Threat Information eXpression
STOSituation Theory Ontology
STO-LSituation Theory Ontology-Lite
TTPTactics, Techniques, and Procedures
UMLUnified Modeling Language
URIUniform Resource Identifier
W3CWorld Wide Web Consortium
XMLExtensible Markup Language

References

  1. Barwise, J. The Situation in Logic; Center for the Study of Language (CSLI): Stanford, CA, USA, 1989; Volume 4. [Google Scholar]
  2. Barwise, J.; Seligman, J. Information Flow: The Logic of Distributed Systems; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
  3. Barwise, J.; Perry, J. Situations and Attitudes; MIT Press: Cambridge, MA, USA, 1983. [Google Scholar]
  4. Devlin, K. Logic and Information; Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]
  5. Bawden, D.; Robinson, L. Information overload: An overview. In Oxford Encyclopedia of Political Decision Making; Oxford University Press: Oxford, UK, 2020. [Google Scholar]
  6. Rathore, F.A.; Farooq, F. Information overload and infodemic in the COVID-19 pandemic. J. Pak. Med. Assoc. 2020, 70, S162–S165. [Google Scholar] [CrossRef] [PubMed]
  7. Eppler, M.J.; Mengis, J. The concept of information overload: A review of literature from organization science, accounting, marketing, MIS, and related disciplines. Inf. Soc. 2004, 30, 325–344. [Google Scholar] [CrossRef]
  8. Qiu, S.; Gadiraju, U.; Bozzon, A. Towards memorable information retrieval. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval, Virtual Event, 14–17 September 2020; pp. 69–76. [Google Scholar]
  9. Zammit, O.; Smith, S.; Windridge, D.; De Raffaele, C. Reducing the dependency of having prior domain knowledge for effective online information retrieval. Expert Syst. 2023, 40, e13014. [Google Scholar] [CrossRef]
  10. Feng, S.; Meng, J.; Zhang, J. News recommendation systems in the era of information overload. J. Web Eng. 2021, 20, 459–470. [Google Scholar] [CrossRef]
  11. Bedekar, M.; Deodhar, A.; Sakhare, K. Context-Based Email Ranking System for Enterprise. In Proceedings of the 2021 IEEE 2nd International Conference on Technology, Engineering, Management for Societal Impact Using Marketing, Entrepreneurship and Talent (TEMSMET), Pune, India, 2–3 December 2021; pp. 1–5. [Google Scholar]
  12. Lee, Y.T. Information modeling: From design to implementation. In Proceedings of the Second World Manufacturing Congress; ICSC Academic Press: Millet, AB, Canada, 1999; pp. 315–321. [Google Scholar]
  13. Sievers, M. Semantics, Metamodels, and Ontologies. In Handbook of Model-Based Systems Engineering; Madni, A.M., Augustine, N., Sievers, M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 1–32. [Google Scholar] [CrossRef]
  14. Hogan, A.; Blomqvist, E.; Cochez, M.; d’Amato, C.; Melo, G.D.; Gutierrez, C.; Kirrane, S.; Gayo, J.E.L.; Navigli, R.; Neumaier, S.; et al. Knowledge graphs. ACM Comput. Surv. (CSUR) 2021, 54, 1–37. [Google Scholar] [CrossRef]
  15. Devlin, K. Situation Theory and Situation Semantics. Handb. Hist. Log. 2006, 7, 601–664. [Google Scholar]
  16. Endsley, M.R. Toward a Theory of Situation Awareness in Dynamic Systems. Hum. Factors 1995, 37, 32–64. [Google Scholar] [CrossRef]
  17. Lu, S.; Kokar, M.M. A method to identify relevant information sufficient to answer situation dependent queries. In Proceedings of the 2018 IEEE Conference on Cognitive and Computational Aspects of Situation Management (CogSIMA), Boston, MA, USA, 11–14 June 2018; pp. 22–28. [Google Scholar]
  18. Lu, S.; Kokar, M.M. Degrees of Information Relevance in Situation Assessment. In Proceedings of the 2020 IEEE Conference on Cognitive and Computational Aspects of Situation Management (CogSIMA), Victoria, BC, Canada, 24–29 August 2020; pp. 180–187. [Google Scholar]
  19. Lu, S. A Method for Identifying Relevant Information Sufficient to Answer Situation Dependent Queries. Ph.D. Thesis, Northeastern University, Boston, MA, USA, 2022. Available online: https://repository.library.northeastern.edu/files/neu:4f186p89t/fulltext.pdf (accessed on 9 September 2024).
  20. Saracevic, T. Relevance: A Review of and a framework for the thinking on the notion in information science. J. Am. Soc. Inf. Sci. 1975, 26, 321–343. [Google Scholar] [CrossRef]
  21. Saracevic, T. Relevance: A Review of and a framework for the thinking on the notion in information science. Part II: Nature and manifestations of relevance. J. Am. Soc. Inf. Sci. 2007, 58, 1915–1933. [Google Scholar] [CrossRef]
  22. Saracevic, T. Relevance: A Review of and a framework for the thinking on the notion in information science. Part III: Behavior and effects of relevance. J. Am. Soc. Inf. Sci. 2007, 58, 2126–2144. [Google Scholar] [CrossRef]
  23. Saracevic, T. The Notion of Relevance in Information Science: Everybody Knows What Relevance Is. But, What Is It Really? Springer: Cham, Switzerland, 2016; Volume 8, 109p. [CrossRef]
  24. Mizzaro, S. Relevance: The whole history. J. Assoc. Inf. Sci. Technol. 1997, 48, 810–832. [Google Scholar] [CrossRef]
  25. Manning, C.D. Introduction to Information Retrieval; Syngress Publishing: Rockland, MA, USA, 2008. [Google Scholar]
  26. Huibers, T.W.; Lalmas, M.; Van Rijsbergen, C.J. Information retrieval and situation theory. In Proceedings of the ACM SIGIR Forum; Association for Computing Machinery: New York, NY, USA, 1996; Volume 30, pp. 11–25. [Google Scholar]
  27. da Costa Pereira, C.; Cholvy, L. Usefulness of Information for Achieving Goals with Disjunctive Premises. Knowl. Eng. Rev. 2024, 39, e3. [Google Scholar] [CrossRef]
  28. Levy, A.Y.; Fikes, R.E.; Sagiv, Y. Speeding up inferences using relevance reasoning: A formalism and algorithms. Artif. Intell. 1997, 97, 83–136. [Google Scholar] [CrossRef]
  29. McCarthy, J. Generality in artificial intelligence. Commun. ACM 1987, 30, 1030–1035. [Google Scholar] [CrossRef]
  30. Guha, R.V. Contexts: A Formalization and Some Applications; Stanford University: Stanford, CA, USA, 1992. [Google Scholar]
  31. Buvac, S.; Mason, I.A. Propositional logic of context. In Proceedings of the Association for the Advancement of Artificial Intelligence; AAAI Press: Washington, DC, USA, 1993; pp. 412–419. [Google Scholar]
  32. Akman, V.; Surav, M. The Use of Situation Theory in Context Modeling. Comput. Intell. 1997, 13, 427–438. [Google Scholar] [CrossRef]
  33. Tin, E.; Akman, V. Situated nonmonotonic temporal reasoning with BABY-SIT. AI Commun. 1997, 10, 93–109. [Google Scholar]
  34. OWL 2 Web Ontology Language Structural Specification and Functional-Style Syntax. Available online: https://www.w3.org/TR/owl2-syntax/ (accessed on 9 September 2024).
  35. Davidson, D. Essays on Actions and Events; Clarendon Press: Oxford, UK, 1980. [Google Scholar]
  36. Zucchi, S. Events and Situations. Annu. Rev. Linguist. 2015, 1, 85–106. [Google Scholar] [CrossRef]
  37. Dapoigny, R.; Barlatier, P. Formal foundations for situation awareness based on dependent type theory. Inf. Fusion 2013, 14, 87–107. [Google Scholar] [CrossRef]
  38. Martin-Löf, P.; Sambin, G. Intuitionistic Type Theory; Bibliopolis: Naples, Italy, 1984; Volume 9. [Google Scholar]
  39. Stocker, M.; Rönkkö, M.; Kolehmainen, M. Situational Knowledge Representation for Traffic Observed by a Pavement Vibration Sensor Network. IEEE Trans. Intell. Transp. Syst. 2014, 15, 1441–1450. [Google Scholar] [CrossRef]
  40. Kokar, M.M.; Matheus, C.J.; Baclawski, K. Ontology-based situation awareness. Inf. Fusion 2009, 10, 83–98. [Google Scholar] [CrossRef]
  41. Gangemi, A.; Mika, P. Understanding the semantic web through descriptions and situations. In Proceedings of the OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”; Springer: Berlin/Heidelberg, Germany, 2003; pp. 689–706. [Google Scholar]
  42. Steinberg, A.N.; Bowman, C.L. Revisions to the JDL data fusion model. In Handbook of Multisensor Data Fusion; CRC Press: Boca Raton, FL, USA, 2017; pp. 65–88. [Google Scholar]
  43. Blasch, E.; Plano, S. JDL Level 5 fusion model: User refinement issues and applications in group tracking. In Proceedings of the SPIE 4729, Signal Processing, Sensor Fusion, and Target Recognition XI, Orlando, FL, USA, 31 July 2002; Volume 4729, pp. 270–279. [Google Scholar]
  44. Sanneman, L.; Shah, J.A. The situation awareness framework for explainable AI (SAFE-AI) and human factors considerations for XAI systems. Int. J. Hum.—Interact. 2022, 38, 1772–1788. [Google Scholar] [CrossRef]
  45. Blasch, E.; Sullivan, N.; Chen, G.; Chen, Y.; Shen, D.; Yu, W.; Chen, H.M. Data fusion information group (DFIG) model meets AI+ML. In Proceedings of the Signal Processing, Sensor/Information Fusion, and Target Recognition XXXI; SPIE: Bellingham, WA, USA, 2022; Volume 12122, pp. 162–171. [Google Scholar]
  46. Gao, H.; Zhu, J.; Zhang, T.; Xie, G.; Kan, Z.; Hao, Z.; Liu, K. Situational assessment for intelligent vehicles based on Stochastic model and Gaussian distributions in typical traffic scenarios. IEEE Trans. Syst. Man Cybern. Syst. 2020, 52, 1426–1436. [Google Scholar] [CrossRef]
  47. Kokar, M.M.; Moskal, J.J. Interoperability of Information Fusion Systems. ISIF Perspect. Inf. Fusion 2023, 6, 53–54. [Google Scholar]
  48. SPARQL Query Language for RDF. Available online: https://www.w3.org/TR/rdf-sparql-query/ (accessed on 9 September 2024).
  49. Defining N-Ary Relations on the Semantic Web. 2006. Available online: https://www.w3.org/TR/2006/NOTE-swbp-n-aryRelations-20060412/ (accessed on 29 June 2024).
  50. Giunti, M.; Sergioli, G.; Vivanet, G.; Pinna, S. Representing n-ary relations in the Semantic Web. Log. J. IGPL 2021, 29, 697–717. [Google Scholar] [CrossRef]
  51. MACCDC 2012 Dataset. Available online: https://www.netresec.com/index.ashx?page=MACCDC (accessed on 9 September 2024).
  52. Snort. Available online: https://www.snort.org/ (accessed on 9 September 2024).
  53. Security Repo—Security Data Sample Repository. Available online: http://www.secrepo.com/ (accessed on 9 September 2024).
  54. Saad, S.; Traore, I. Extracting attack scenarios using intrusion semantics. In Foundations and Practice of Security; Springer: Berlin/Heidelberg, Germany, 2012; pp. 278–292. [Google Scholar]
  55. Ulicny, B.E.; Moskal, J.J.; Kokar, M.M.; Abe, K.; Smith, J.K. Inference and Ontologies. In Cyber Defense and Situational Awareness; Springer: Berlin/Heidelberg, Germany, 2014; pp. 167–199. [Google Scholar]
  56. STIX. Available online: https://stix.mitre.org/ (accessed on 9 September 2024).
  57. Unger, C.; Freitas, A.; Cimiano, P. An introduction to question answering over linked data. In Proceedings of the Reasoning Web International Summer School; Springer: Berlin/Heidelberg, Germany, 2014; pp. 100–140. [Google Scholar]
  58. Ter Horst, H.J. Completeness, decidability and complexity of entailment for RDF Schema and a semantic extension involving the OWL vocabulary. J. Web Semant. 2005, 3, 79–115. [Google Scholar] [CrossRef]
  59. OWL 2 Web Ontology Language Profiles (Second Edition). Available online: https://www.w3.org/TR/owl2-profiles/ (accessed on 9 September 2024).
  60. Matheus, C.J.; Baclawski, K.; Kokar, M.M. Basevisor: A triples-based inference engine outfitted to process RuleML and R-entailment rules. In Proceedings of the International Conference on Rules and Rule Markup Languages for the Semantic Web, Athens, GA, USA, 10–11 November 2006; pp. 67–74. [Google Scholar]
  61. Bhavnani, S.K.; Wilson, C.S. Information scattering. In Encyclopedia of Library and Information Sciences; CRC Press: Boca Raton, FL, UAS, 2009; pp. 2564–2569. [Google Scholar]
  62. Morse, P.M.; Leimkuhler, F.F. Exact solution for the Bradford distribution and its use in modeling informational data. Oper. Res. 1979, 27, 187–198. [Google Scholar] [CrossRef]
Figure 1. Information flow in situation assessment.
Figure 1. Information flow in situation assessment.
Information 15 00651 g001
Figure 2. Top-level classes of STO-L.
Figure 2. Top-level classes of STO-L.
Information 15 00651 g002
Figure 3. Situational information sharing case.
Figure 3. Situational information sharing case.
Information 15 00651 g003
Figure 4. Use case diagram of the relevance reasoning process.
Figure 4. Use case diagram of the relevance reasoning process.
Information 15 00651 g004
Figure 5. Relevance reasoning process overview.
Figure 5. Relevance reasoning process overview.
Information 15 00651 g005
Figure 6. STIX ontology.
Figure 6. STIX ontology.
Information 15 00651 g006
Figure 7. Distribution of query-answering time.
Figure 7. Distribution of query-answering time.
Information 15 00651 g007
Figure 8. Precision and recall for Q based on training queries.
Figure 8. Precision and recall for Q based on training queries.
Information 15 00651 g008
Figure 9. Trade-off between accuracy and efficiency.
Figure 9. Trade-off between accuracy and efficiency.
Information 15 00651 g009
Table 1. Saracevic’s pattern for relevance definitions [20].
Table 1. Saracevic’s pattern for relevance definitions [20].
Relevance Is the A of a B Existing Between a C and a D as Determined by an E
ABCDE
measurecorrespondencedocumentqueryperson
degreeutilityarticlerequestjudge
dimensionconnectiontextual forminformation useduser
estimatesatisfactionreferencepoint of viewrequestor
appraisalfitinformation providedinformation requirement statementinformation specialist
relationbearingfact
Table 2. OWL 2 RL/RDF entailment rule example [59].
Table 2. OWL 2 RL/RDF entailment rule example [59].
If (Body): ρ l Then (Head): ρ r
rdfs:subClassOf Axiom<?c1, rdfs:subClassOf, ?c2><?x, rdf:type, ?c2>
< ?x, rdf:type, ?c1>
owl:inverseOf Axiom<?p1, owl:inverseOf, ?p2><?y, ?p2, ?x>
< ?x, ?p1, ?y>
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, S.; Kokar, M. Inference-Based Information Relevance Reasoning Method in Situation Assessment. Information 2024, 15, 651. https://doi.org/10.3390/info15100651

AMA Style

Lu S, Kokar M. Inference-Based Information Relevance Reasoning Method in Situation Assessment. Information. 2024; 15(10):651. https://doi.org/10.3390/info15100651

Chicago/Turabian Style

Lu, Shan, and Mieczyslaw Kokar. 2024. "Inference-Based Information Relevance Reasoning Method in Situation Assessment" Information 15, no. 10: 651. https://doi.org/10.3390/info15100651

APA Style

Lu, S., & Kokar, M. (2024). Inference-Based Information Relevance Reasoning Method in Situation Assessment. Information, 15(10), 651. https://doi.org/10.3390/info15100651

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop