*5.2. Results*

Figure 4a shows the average running time taken *per sample* over all configurations based on a set of 100 runs. We calculated the running time in this manner to adequately compare the times for the different EM sizes. Even though the impact of this dimension on individual running time is not significant, it may become so when sampling hundreds of thousands of worlds. For example, consider the difference between running time per sample for 1 billion worlds vs. 1 million worlds: 0.0289420 − 0.0289165 = 0.0000255 s; for a sample size of 100,000 worlds, this difference amounts to 2.55 s. In the third column, we include an estimation of running times of the brute-force algorithm based on these values. Both running times are worst-case since optimization is possible (for instance, in our system we avoid recomputing warrant statuses of induced subprograms for which these values had been computed).

Figure 4b shows results concerning approximation quality; the metric was calculated with respect to the exact result for up to 20 EM variables (≈1M worlds). For the case of 30 EM variables (≈1B worlds), we approximated the metric using 250,000 worlds (which amounts to approximately 0.023% of the set of possible worlds), since the exact algorithm becomes intractable for instances of this size.

The following general observations arise from these results:

	- 1. For the 20 EM variable case, the quality obtained by 5000 vs. 10,000 samples was not statistically significant (two-tailed two-sample unequal variance Student's t-tests yielded p-values greater than 0.08 for *α* = 0.7 and greater than 0.16 for *α* = 0.9), which means that only 5000 samples sufficed to obtain a good approximation.
	- 2. The proportion of repeated samples (i.e., wasted effort) was quite high for both entropy levels; for *α* = 0.7 (higher entropy) on average 52% of samples were repeated, while for *α* = 0.9 (lower entropy), an average of 87% were not unique. For the 20 EM variable case, the quality levels were achieved with only 2293 and 469 unique samples, respectively. Larger sample sizes also lead to lower variation in quality (shorter error bars).

These results shed light on the applicability of P-DAQAP on real-world problems such as the CTA use case, given that relatively low numbers of effective (i.e., nonrepeated) samples yield good approximations of the exact values.


**Figure 4.** (**a**) Average running times per world sampled (*n* = 100 runs). For each case, we estimate the running time (in hours) required to run the exact (brute force) algorithm. (**b**) Average solution quality varying #EM variables (log of #worlds), #samples, and the parameter that controls the entropy (*H*) of the probability distribution. For 30 EM variables (1B worlds, bottom right), quality is *approximated* on the basis of a sample of 250,000 worlds. Error bars correspond to standard deviation (*n* > 50 for the top charts, *n* > 15 for the bottom charts).

#### *5.3. Results in the Context of Practical Applications*

We now analyze the results we obtained in these experiments in the context of the MITRE ATT and CK data that we focused on for our use case in Section 3. For the purposes of this brief analysis, let us consider the Enterprise segment of the dataset, which contains 191 techniques and 385 subtechniques, and this translates into a large number of constants that would certainly lead to an intractable probabilistic model if tackled directly. Fortunately, there is a well-understood independence relation among such techniques, and they can, thus, be effectively pruned depending on the tactics to which they are associated. For instance, the *Privilege Escalation* tactic (TA0004) that we refer to in the use case has 13 associated techniques, while the rest of the techniques in the dataset associated at most 30 (with the exception of *Defense Evasion* (TA0005) that has 42, though additional filtering according to the specific operating system in question allows to bring this number down significantly). Our preliminary results therefore show that having the capacity to scale to 30 EM variables is within the realm of this kind of application, though further efforts are required to effectively arrive at submodels derived from the general one that can be used to solve specific query answering tasks. In this same vein, there are multiple research and

development efforts to manipulate, adapt, and export data and knowledge from the ATT and CK dataset [32–35].
