*2.2. Probabilistic DeLP: DeLP3E Framework*

We now provide a brief introduction to DeLP3E; for full details, we refer the reader to [18]. A DeLP3E KB *P* = (*AM*, *EM*, *af*) consists of three parts that correspond to *two separate models of the world*, and a function linking the two; these components are illustrated in Figure 1.

**Figure 1.** Overview of the DeLP3E framework.

The *environmental model* (EM) is used to describe background knowledge that is probabilistic in nature, while the *analytical model* (AM) is used to analyze competing hypotheses that can account for a given phenomenon. The EM *must be consistent*, while the AM allows for contradictory information as the system must have the capability to reason about competing explanations for a given event. In general, the EM contains knowledge such as evidence, intelligence reporting, or uncertain knowledge about actors, software, and systems, while the AM contains elements that the analyst can leverage on the basis of information in the EM. AMs correspond to DeLP programs, while EMs in this paper are abstracted away, assuming that the well-known Bayesian network model is used.

Finally, the third component is the *annotation function*, which links components in the AM with conditions over the EM (the conditions under which statements in the AM can potentially be true). We use *GEM* to denote the sets of all ground atoms for the EM; here, we concentrate on subsets of ground atoms from *GEM*, called *worlds*. Atoms that belong to the set are *true* in the world, while those that do not are *false* (Therefore, there are 2|*GEM*<sup>|</sup> possible worlds in the EM). This set is denoted with W*EM*. Logical formulas arise from the combination of atoms using the traditional connectives (∧, ∨, and ¬); we use *formEM* to denote the set of all possible (ground) formulas in the EM. Annotation functions then assign formulas in *formEM* to components in the AM to indicate the conditions (probabilistic events) under which they hold. In this way, each world *λ* ∈ W*EM induces* a subset of the AM, comprised of all elements whose annotations are satisfied by *λ*; for DeLP3E program *P*, we denote the subset of the AM induced by *λ* with *PAM*(*λ*) (cf. Figure 1). Exact probabilistic query answering is carried out via Algorithm 1.


Since the number of worlds in W*EM* is exponential in the number of EM random variables, this procedure quickly becomes intractable. However, a *sound approximation* of the exact interval can be obtained by simply selecting a subset of W*EM* and executing the same procedure. We refer to this algorithm as approximate query answering via *world sampling*. It is easy to see that this approximation scheme is sound since it always yields intervals [- , *u* ] ⊆ [-, *u*]. Section 5 is dedicated to studying the effectiveness and efficiency of this approach.

#### A Simple Illustrative Example

In order to clearly illustrate the model and query-answering procedure in DeLP3E, we present the following simple example of knowledge base *P* = (*AM*, *EM*, *af*):

*Analytical Model θ*<sup>1</sup> : *L*<sup>1</sup>

```
θ2 : L2
```
*θ*<sup>3</sup> : ∼*L*<sup>1</sup> *Annotation Function*

*af*(*θ*1) : *a* ∧ ¬*b af*(*θ*2) : *b af*(*θ*3) : *b*

*Environmental Model*


We have an AM consisting of three literals, an EM consisting of two variables, and an annotation function that relates these two models; suppose we query for the literal *L*1. To compute the exact probability interval, we go world by world as described above, generating the corresponding subprogram and querying each one of them for the status of the query. Lastly, in order to arrive at the probability interval with which *L*<sup>1</sup> is warranted in *P*, we keep track of the probability of the worlds where the query is warranted (for the lower limit of the interval) and the probability of the worlds where the *complement* of

the query is warranted (for the upper limit). In our example, the result for query *L*<sup>1</sup> is [0.20, 0.70]; the details of this calculation are as follows:

	- **–** *PAM*(*λ*1) = {*L*2, ∼*L*1}
	- **–** *PAM*(*λ*2) = {*L*1}
	- **–** *PAM*(*λ*3) = {*L*2, ∼*L*1}
	- **–** *PAM*(*λ*4) = {∅}

Query *L*<sup>1</sup> is, thus, clearly warranted only in world *λ*2, while its complement (∼*L*1) is warranted in *λ*<sup>1</sup> and *λ*3.

• **Probability interval calculation:**

$$\left[\ell = \sum P\_r(\lambda\_2)\_\prime \quad \mu = 1 - \sum\_{i=1,3} P\_r(\lambda\_i)\right]$$

• **Result:** 0.20 ≤ *Pr*(*L*1) ≤ 0.70

The resulting probability interval represents *two kinds of uncertainty*: the first, called *probabilistic* uncertainty, arises from the environmental model since we have a probability distribution over possible worlds; the second, *epistemic* uncertainty, arises from the fact that we we generally have a probability interval instead of a point probability, which happens when there are worlds in which neither the query nor its complement are warranted (as is the case of world *λ*<sup>4</sup> above).

Having presented the preliminary concepts, in the next section, we illustrate the application of DeLP3E in a cybersecurity domain.

#### **3. Cyberthreat Analysis with DeLP3E**

We now present a use case leveraging several datasets developed and maintained by the MITRE Corporation (a not-for-profit organization that works with governments, industry, and academia) and National Institute of Standards and Technology (NIST) (MITRE datasets: ATT&CK (https://attack.mitre.org, accessed on 21 August 2022), CAPEC (https: //capec.mitre.org, accessed on 21 August 2022), and CWE (https://cwe.mitre.org, accessed on 21 August 2022). NIST manages the National Vulnerability Database (NVD) (https: //nvd.nist.gov, accessed on 21 August 2022) that includes CVE and CPE). Figure 2 shows an overview of our approach. We first describe the basic components and then show how the DeLP3E components are specified, along with two queries for addressing specific problems in the CTA domain.

The ATT&CK model is a curated knowledge base and model geared towards adversarial behavior in cybersecurity settings; it contains information on the various phases of an attack and the platforms that are most commonly targeted. The behavioral model consists of several core components:


The supporting datasets provide information on *attack patterns* (Common Attack Pattern Enumeration and Classification—CAPEC), software and hardware *weakness types* (Common Weakness Enumeration—CWE), and the *National Vulnerability Database* (NVD). The latter is a rich repository of data; here, we distinguish two subsets including data about *vulnerabilities* (Common Vulnerabitlities and Exposures—CVE) and *platforms* (Common Platform Enumeration—CPE).

**Figure 2.** Designing a DeLP3E KB for cyberthreat analysis from a variety of publicly available cyber security datasets.

Figure 2 shows the information provided by each dataset, and how they are related to each other via foreign keys. For instance, attack techniques included in ATT&CK link to entries in CAPEC, which in turn link to CWE and NVD. We augmented this structure with two features towards deriving a DeLP3E KB. First, we labeled connections between datasets (and components within ATT and CK) with either "[*strict*]" or "[*defeasible*]", indicating the type of knowledge being encoded. For instance, observed examples of a weakness included in CWE are linked to CVEs included in the NVD as strict, since this is well-established knowledge. On the other hand, mitigation strategies are linked to techniques as defeasible knowledge, since the relationship between the two is tentative in nature. The second feature, which appears in the figure as a small icon depicting a pair of dice, indicates relationships that are subject to *probabilistic events*. For the purposes of this use case, we label all defeasible relations in this way.

We used all this information to create the AM, EM, and annotation function, and create a DeLP3E KB; an introductory example is shown in Listing 1. On the left-hand side, we have the elements of the AM that can be used to create arguments for and against conclusions; for instance:

	- A<sup>1</sup> = {*δ*3, *θ*1(*adv*\_*group*(*apt*29))}
	- A<sup>2</sup> = {*δ*6, *δ*1(*prev\_techsub(os\_credential\_dumping)*), *φ*1(*mitigation(credential\_access\_protection)*)}.



The former indicates that *account discovery* is used as an attack technique, since the advanced persistent threat group 29 (APT29, also known as Cozy Bear) is active and uses it. The latter refers to the use of *credential access protection* as a mitigation technique to prevent the use of *OS credential dumping*. This is a clear example of an argument that involves uncertainty, since credential access protection is not a foolproof endeavor. An example of this is the well-known *Heartbleed* vulnerability (CVE-2014-0160) that affected OpenSSL implementations, leaving them open to credential dumping. For reasons of space, in this simple example, we only label AM components with probabilistic events (*e*1–*e*9; elements with no annotation are simply labeled with *true*) and do not describe how they are related in the EM. One example could be to simply assume pairwise independence (as in many probabilistic database models [26]), or a Bayesian network [27], as described in Section 5.

**Queries.** We lastly present two queries that we revisit in the next section:

• *pos\_threat*(*T*1134, *SO*344):

What is the probability that *access token manipulation* (technique T1134) uses leveraging the *Azorult* malware (software id SO344) to attack our systems?

• *intensify\_mit*(*M*1026):

What is the probability that *privileged account management* (mitigation strategy M1026) should be deployed? M1026 mitigates T1134.

In the next two sections, we discuss the design of a software system for implementing this kind of functionalities based on DeLP3E, and a preliminary evaluation of query answering in DeLP3E via sampling techniques.
