Computing the Commonalities of Clusters in Resource Description Framework: Computational Aspects

Colucci, Simona; Donini, Francesco Maria; Di Sciascio, Eugenio

doi:10.3390/data9100121

Open AccessArticle

Computing the Commonalities of Clusters in Resource Description Framework: Computational Aspects

by

Simona Colucci

^1,†

,

Francesco Maria Donini

^2,*,†

and

Eugenio Di Sciascio

¹

Dipartimento di Ingegneria Elettrica e dell’Informazione (DEI), Politecnico di Bari, Via Orabona 4, 70125 Bari, Italy

²

Dipartimento di Scienze Umanistiche, della Comunicazione e del Turismo (DISUCOM), Università della Tuscia, Via Santa Maria in Gradi, 4, 01100 Viterbo, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Data 2024, 9(10), 121; https://doi.org/10.3390/data9100121

Submission received: 10 August 2024 / Revised: 3 October 2024 / Accepted: 18 October 2024 / Published: 20 October 2024

(This article belongs to the Section Information Systems and Data Management)

Download

Browse Figures

Versions Notes

Abstract

:

Clustering is a very common means of analysis of the data present in large datasets, with the aims of understanding and summarizing the data and discovering similarities, among other goals. However, despite the present success of the use of subsymbolic methods for data clustering, a description of the obtained clusters cannot rely on the intricacies of the subsymbolic processing. For clusters of data expressed in a Resource Description Framework (RDF), we extend and implement an optimized, previously proposed, logic-based methodology that computes an RDF structure—called a Common Subsumer—describing the commonalities among all resources. We tested our implementation with two open, and very different, RDF datasets: one devoted to public procurement, and the other devoted to drugs in pharmacology. For both datasets, we were able to provide reasonably concise and readable descriptions of clusters with up to 1800 resources. Our analysis shows the viability of our methodology and computation, and paves the way for general cluster explanations to be provided to lay users.

Keywords:

Clusterization; Explanation in Artificial Intelligence (XAI); Least Common Subsumer (LCS); Resource Description Framework (RDF)

1. Introduction and Motivation

Data clusterization is the following problem: given a set of samples (objects, instances, points, etc.) “group samples with similar feature structures or patterns into the same group (cluster) and samples with dissimilar ones into different groups” [1]. Since it can be used without supervision, clusterization is one of the main forms of analysis that can be conducted on big datasets, and it has been successfully applied in various fields, including fraud detection in student assignments [2] and finance [3], collaborative filtering in recommendation ystems [4], image and pattern recognition (e.g., facial recognition systems [5]), document classification and topic modeling [6], gene expression analysis in bioinformatics [7], market basket analysis for product bundling and promotion [8], customer segmentation [9], identifying communities in social network analyses [10], and healthcare analytics about patients’ conditions, treatment responses, or genetic profiles [11].

In addition to the above areas, clustering has been also applied to collections of nodes in knowledge graphs [12], and in the more general setting of RDF graphs, with the aim of optimizing the storage and retrieval of RDF resources [13,14,15].

However, depending on the problem, the output of a clusterization algorithm may need to be presented to a human user, either for parameter tuning and debugging, or as a plain result at the end of a workflow [16]. In this situation, clusterization methods are also confronted with the general problem of eXplanation in Artificial Intelligence (XAI) [17] and achine learning; the problem is that very effective subsymbolic processing lends itself poorly to providing explanations for lay users, and even for expert users.

1.1. Previous Work on Cluster Explanations

With the aim of overcoming the explanatory limitations of the subsymbolic clustering of resources in RDF datasets, a post- hoc methodology was previously proposed [18]. Using this methodology, once the cluster has been subsymbolically computed, the following steps are carried out:

The logic substrate that RDF relies on is exploited to compute the most specific RDF graph that is common to all resources in the cluster, known as the Least Common Subsumer (LCS) [19]; this phase makes use of blank nodes in RDF, which are existential variables that can abstract—like placeholders, but with a logical semantics—the single values on which the resources differ. However, although the LCS is logically complete, it is full of irrelevant details [20];
A Common Subsumer (CS) is computed, an RDF structure which is a generalization of the LCS—so, logically, a CS is not the most specific description of the cluster—but still specific enough to capture the relevant features common to all resources;
This structure is used to generate a phrase in constrained, but plain English, with the original idea of using English pronouns (that, which) to verbalize blank nodes in relative sentences. As an example, the cluster of contracting processes addressed in Figure 1 in Section 4 may be explained by the following sentence, generated by a verbalizer module [18], fed by a CS composed of six RDF triples:

The resources in analysis present the following properties in common:

1) They all have a release referencing some resource

which has publisher web page “https://openopps.com”

and has publisher name “Open Opps”

and has publisher schema “Companies House”

and has release publisher “TICON UK LIMITED”

and has type of release initiation” “tender”

1.2. Contribution of This Paper

This paper enhances the cluster explanation methodology in several ways:

We propose an optimized algorithm for the computation of the CS of a cluster, scaling cluster dimensions that have not been reached before;
We validate this computation through extensive experimentation with two, very different, real datasets;
We collect and analyze data on the experiments and discuss the computational properties of the implementation, such as the convergence, expected runtime, and possible heuristics.

To prove the effectiveness and generality of our methodology and its implementation, we analyzed two open RDF datasets devoted to very different subjects:

TheyBuyForYou1, dealing with public procurement;
Drugbank2, dealing with drugs used in pharmacology.

Our results prove the soundness and viability of the modified methodology for real datasets. Our implementation provides a human-oriented tool that could be used in interactive clusterization, a recent evolution of clusterization in which the subsymbolic process can be tuned by a human-in-the-loop feedback [21].

With respect to the classification of the semantic web applications that was recently proposed by Colucci et al. [22], this application lays at level II for the blank nodes dimension (“consider blank nodes with no denotation”), at level I for deductive capabilities (“no capability”), and at level IV for explanation (“human-readable format”).

1.3. Outline of the Paper

The paper continues as follows: in the next section, some preliminary knowledge is provided to make the paper self-contained. Section 3 describes our computation in terms of its methodology and analysis. The experimental results following the computation are given in Section 4, before the paper is closed with a final discussion. We move to Appendix A to provide more details.

2. RDF and Common Subsumers: Background Notions

In this section, we first present the definitions of RDF and common subsumers in RDF that are necessary for understanding the rest of the paper. Readers familiar with RDF may skip the first subsection.

2.1. Background of RDF Syntax and Simple Entailment

Information in RDF datasets is structured in a so-called RDF-graph, in which resources are connected through RDF triples, which we denote in the text with the form

≪ s p o ≫

(a

\underset{̲}{s}

ubject, a

\underset{̲}{p}

redicate, an

\underset{̲}{o}

bject) [23]3. These terms can be IRIs, XSD-typed literals, and blank nodes, which we discuss below. In the above triple, s is an IRI or a blank node, p is an IRI, and o is either an IRI, a blank node, or a literal. Triples can be serialized in one of several text-based, machine-readable formats, such as RDF/XML, NTriples, and Turtle, so that files/streams of triples can be easily exchanged through HTTP or other text-based Internet protocols. For instance, in Berners-Lee’s Turtle format4, the above triple would be written as

s p o .

(with a full stop at the end). When most of the IRI shares the same prefix, the Turtle format allows this to be declared, as in the following example, taken from Drugbank:

@prefix ns2: <http://bio2rdf.org/drugbank_vocabulary:> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
ns2:Humans-and-other-mammals rdf:type ns2:Affected-organism .

whose triple (third line) can almost be read as plain English, but with prefixes that provide a machine with the Web reference showing where such terms are defined (intuitively, using a context that defines what they mean).

In addition to being a widely used data interchange format, RDF is also equipped with a formal semantics that is clearly defined and explained by the W3C [25]. A set of RDF triples is a set of formulas. Triples involving only IRIs and literals correspond to atomic ground sentences whose First-Order Logic (FOL) notation would be

p (s, o)

. Following a well-known (non-normative) convention [25], we denote blank nodes with an underscore and a colon, followed by an identifier, e.g., _:x. Blank nodes are existentially quantified variables, whose scope spans the entire set of triples—which is usually a file. Hence, a triple containing a blank node corresponds to an open formula, and if a blank node _:x occurs in several triples of the same file, e.g., if _:x occurs in the two triples

ex:a ex:r _:x . and _:x ex:q ex:b .

it is consideredsthe same variable and existentially quantified once, equivalent to the FOL formula

\exists x . [r (a, x) \land q (x, b)]

5. Sets of RDF triples are usually called RDF graphs, although they might not be graphs in the ordinary sense, since the following pair of triples is legal:

ex:a ex:p ex:b . and ex:p ex:q ex:d .

where the subject of the second triple is the predicate of the first one.

An instance

M (G)

of an RDF graph G maps IRIs and literals to themselves, while a blank node is mapped to a term (another blank node, or IRI, or literal).

Thanks to its higher-order semantics (which we do not discuss here), RDF is equipped with a simple entailment relation [25], denoted by ⊧ : an RDF graph G entails another RDF graph H, denoted by

G ⊧ H

, if and only if there exists an instance

M (H)

such that

M (H) \subseteq G

, i.e.,

M (H)

is a subset of G. By logical equivalence we mean mutual simple entailment, i.e., G and H are equivalent if both

G ⊧ H

and

H ⊧ G

hold. Note also the following fact:

if G = \emptyset, i t c a n e n t a i l o n l y o t h e r e m p t y g r a p h s

(1)

Since RDF graphs may expand to billions of triples through linked open data repositories, for each resource r, it is common to isolate only a connected portion

T_{r}

of an RDF graph centered around the resource [26,27], and to compare such structures, in order to decide how to group resources. To select the triples that form the relevant properties of r, we apply both a notion of distance and a filter to stop patterns [28], denoted by a boolean predicate

ϕ

. For example, stop patterns include labels of the resource in all but one chosen language, or the date/authorship of the last edit of a resource’s properties. We call such portions of an RDF dataset, centered around a root r, a rooted RDF-graph (r-graph), denoted by

〈 r, T_{r} 〉

. Simple entailment in RDF can be extended to r-graphs, denoted similarly as

〈 a, T_{a} 〉 ⊧ 〈 b, T_{b} 〉

, by carefully mapping roots onto roots whenever possible (and otherwise fail) [19].

2.2. Background of Common Subsumers in RDF

To describe the commonalities of the RDF resources, the notion of Least Common Subsumer (LCS) was proposed [19,26], borrowing its name from an analogous notion in description logics [29]. The LCS of a set of r-graphs

〈 r_{1}, T_{r_{1}} 〉, \dots, 〈 r_{n}, T_{r_{n}} 〉

, is a new r-graph

〈 x, T_{x} 〉

, represented by a new blank node x (one that does not occur in the previous graphs), along with a set of connected triples

T_{x}

whose features are only the ones shared by all resources—this can be written more formally as follows:

P1:: $〈 r_{i}, T_{r_{i}} 〉 ⊧ 〈 x, T_{x} 〉$ for all $i = 1, \dots, n$ ;
P2:: Any other r-graph with this Property (P1) is logically equivalent to $〈 x, T_{x} 〉$ .

Intuitively, since blank nodes in RDF represent existential variables, it is natural to use a new one as an abstraction of several resources

r_{1}, \dots, r_{n}

—the inverse operation being to instantiate the blank node with each resource. Moreover, when the resources share only part of the information in a triple, the LCS reports the common information and abstracts the different characteristics with (another) blank node, say y. We illustrate this intuition below with an example excerpted from a real dataset.

Consider two drugs, Piroxicam and Tolterodine, which are modeled in Drugbank as RDF resources (accessed on 17 October 2024):

Piroxicam: http://bio2rdf.org/drugbank:DB00554;
Tolterodine: http://bio2rdf.org/drugbank:DB01036.

Let us suppose that the RDF graphs

T_{P i r o x i c a m}

and

T_{T o l t e r o d i n e}

below (simplified from the original ones to increase the readability of the example) are used for the computation of their LCS. The RDF graphs are serialized in Turtle syntax as follows (we declare common prefixes for subsequent readings):

@prefix ns1: <http://bio2rdf.org/drugbank:> .
@prefix ns2: <http://bio2rdf.org/drugbank_vocabulary:> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

T_{P i r o x i c a m}

=

ns1:DB00554 ns2:affected-organism ns2:Humans-and-other-mammals;

ns2:category ns2:Anti-Inflammatory-Agents,-Non-Steroidal;

ns2:enzyme ns1:BE0002793.

ns2:Humans-and-other-mammals rdf:type ns2:Affected-organism.

ns2:Anti-Inflammatory-Agents,-Non-Steroidal rdf:type ns2:Category.

ns1:BE0002793 rdf:type ns2:Enzyme;

ns2:cellular-location “Endoplasmic reticulum”.

T_{T o l t e r o d i n e}

=

ns1:DB01036> ns2:affected-organism ns2:Humans-and-other-mammals;

ns2:category ns2:Muscarinic-Antagonists.

ns2:enzyme ns1:BE0002793.

ns2:Humans-and-other-mammals rdf:type ns2:Affected-organism.

ns2:Muscarinic-Antagonists rdf:type ns2:Category.

ns1:BE0002793 rdf:type ns2:Enzyme;

ns2:cellular-location “Endoplasmic reticulum”.

Then, an RDF graph that is logically implied by both

T_{P i r o x i c a m}

and

T_{T o l t e r o d i n e}

can be presented as follows:

T_{x}

=

_:x ns2:affected-organism ns2:Humans-and-other-mammals;

ns1:category _:y;

ns2:enzyme ns1:BE0002793.

ns2:Humans-and-other-mammals rdf:type ns2:Affected-organism.

_:y rdf:type ns2:Category.

ns1:BE0002793 rdf:type ns2:Enzyme;

ns2:cellular-location “Endoplasmic reticulum”.

It was proved [19] that

T_{x}

is a unique modulo logical equivalence, and that the operation of computing an LCS is idempotent, commutative and associative. However, when used in real contexts [28], the LCS was revealed to be too specific and full of details to be useful as the basis for an explanation of commonalities. Moreover, the theoretical results [27,30] show that there are sets of resources whose LCS has an inherently exponential size.

Therefore, we reverted to computing a Common Subsumer (CS) of a set of resources, which is more general (less specific) but retains enough useful information to be described. Formally, a common subsumer shares the LCS Property (P1) presented above, but not Property (P2). However, once the method for computing the CS is fixed and uniform, it still enjoys the three following properties:

CSP1:: idempotent: $C S (r_{1}, r_{1}) \equiv r_{1}$ ;
CSP2:: commutative: $C S (r_{1}, r_{2}) \equiv C S (r_{2}, r_{1})$ ;
CSP3:: associative: $C S (r_{1}, C S (r_{2}, r_{3})) \equiv C S (C S (r_{1}, r_{2}), r_{3})$ .

We omitted the triples associated with the r-graph for readability. Properties 2–3 ensure that it can be computed incrementally by starting from any “seed pair” of resources, combining the result with the next resource, and repeating this iteratively until the last one is reached, since each possible ordering of the resources used in this iteration leads to an RDF graph that is logically equivalent to the one obtained by any other ordering. However, the size of the final (all equivalent) CSs may vary, ranging from the most succinct to a very redundant one—for instance, when several blank nodes are used to represent a structure that could be represented by a single one. Moreover, the time needed to compute such a (equivalent form of a) CS may vary a lot.

Fact (1) implies Fact (2), presented below, which we will use later to optimize the CS of a cluster:

if either \{\begin{matrix} T_{a} = \emptyset, or \\ T_{b} = \emptyset \end{matrix} then C S (〈 a, T_{a} 〉, 〈 b, T_{b} 〉) = 〈 x, \emptyset 〉

(2)

Being the only possible one, this Common Subsumer is also the LCS. Intuitively, this fact says that if one of the two resources has no property at all, then there are no commonalities with any other resource.

3. Computation Methodology and Analysis

In this section, we present the methodology at the basis of the proposed computation and raise some research questions related to the analysis of such a computation.

The discussion follows the steps below:

We present an algorithm computing the CS of two resources, which builds on a previously published one, but with a crucial optimization that reduces the size of the resulting CS—it is still an RDF graph, with blank nodes and far fewer triples than the output of the original algorithm.
Exploiting the associativity, we iterate the above algorithm, computing the CS of a “running” CS (starting with a pair $r_{1}, r_{2}$ and the next resource $r_{i}, i = 3, \dots, n$ , as expression $C S (\dots C S (C S (r_{1}, r_{2}), r_{3}), \dots r_{n})$ suggests.
Since the time (and the size) needed to compute the (equivalent form of) CS of a complete cluster may vary depending on the ordering in which $r_{1}, \dots, r_{n}$ are given, in order to estimate the expected size of a CS of an entire cluster and the time needed to compute it, we set up a Monte Carlo method, which probes only a random fraction of all the $n!$ possible orderings, one of which could be used to incrementally compute the CS. We consider the increasing-size heuristic (see Section 3.3 below) as a special trial, and compare its size and time with those of the other trials.

In the above context, we address the following experimental research questions:

RQ1:: Does the computation of the CS always converge to one size when changing the order of resources incrementally added to the CS?
RQ2:: How quickly (depending on the number of resources added to the CS) does the incremental computation of a CS of a given cluster converge?
RQ3:: How much do the different choices for the next resource to include influence the convergence, and are there simple heuristics that can be used to choose the initial pair and the next resource?

3.1. An Improved Algorithm for the CS of Two Resources (Algorithm 1)

The algorithm performs a joint post-order search in a pair of n-ary trees. It incrementally computes

〈 x, T_{x} 〉

by enumerating triples that directly originate from each of the two resources

r_{1}, r_{2}

, recursively calling on both the pair of predicates and the pair of objects of these triples. Triples not relevant to the application are filtered away by a predicate

ϕ ()

. Specifically, for each pair of triples

t_{1} = ≪ a p c ≫ \in T_{r_{1}}, t_{2} = ≪ b q e ≫ \in T_{r_{2}}

, Algorithm 1 determines with a recursive call a CS

〈 y, T_{y} 〉

for the two predicates

p, q

(Line 13) and a CS

〈 z, T_{z} 〉

for the objects

c, e

(Line 14). It then constructs a provisional r-graph

〈 x, T_{s} 〉

with a support variable

T_{s} = {≪ x y z ≫} \cup T_{y} \cup T_{z}

. Then, the triples in

T_{s}

are added to

T_{x}

only if

〈 x, T_{x} 〉

does not already entail

〈 x, T_{s} 〉

(Line 19, boxed for reader’s convenience). The conditional addition overcomes a major drawback of the previously known algorithm, which always added a new subgraph to the result, even when that subgraph was already entailed by the CS being built.

Regarding the time complexity, Algorithm 1 has the same worst-case complexity as the original algorithm, namely,

O (| T_{a} | \cdot | T_{b} |)

, and are hence quadratic when the two r-graphs have the same size [19]. This is ensured by Line 1, which checks that each resource in

T_{a}

cannot be examined more than once in conjunction with each

T_{b}

resource. Hence, from a worst-case complexity perspective, Line 19 is merely an optimization step.

Nonetheless, in the real RDF datasets we tested, our optimization significantly reduces the CS size, i.e., the output of the computation, dramatically changing the applicability of the algorithm in real datasets. This optimization is in fact crucial for scaling up the computation methodology to to hundreds or thousands of resources, as illustrated in the next section.

Algorithm 1: Computing a CS of

〈 a, T_{a} 〉

and

〈 b, T_{b} 〉

. The optimization discussed in the text is highlighted in a box.

3.2. Computing the CS of a Cluster of Resources (Algorithm 2)

Let

[r_{1}, \dots, r_{n}]

be an enumeration of the resources of the cluster in a given order, where each resource

r_{j}

is equipped with the triples

T_{r_{j}}

describing its relevant characteristics. We can compute the CS of a cluster of resources as

C S (\dots C S (C S (〈 r_{1}, T_{r_{1}} 〉, 〈 r_{2}, T_{r_{2}} 〉), 〈 r_{3}, T_{r_{3}} 〉) \dots, 〈 r_{n}, T_{r_{n}} 〉)

; that is, by first computing with Algorithm 1 the CS of

〈 r_{1}, T_{r_{1}} 〉

,

〈 r_{2}, T_{r_{2}} 〉

, then computing the CS of the previous result and Resource

r_{3}

, etc., until the CS with the last resource

r_{n}

is computed. Associativity and commutativity ensure that any order in which the resources are taken leads to the same result, up to logical equivalence.

Algorithm 2: Computing a CS of

〈 r_{1}, T_{r_{1}} 〉

,…,

〈 r_{n}, T_{r_{n}} 〉

by iteratively using Algorithm 1.

We can observe that the optimization introduced at Line 19 in Algorithm 1 is crucial for the scalability of Algorithm 2. In fact, thanks to this optimization, we are able to compute a CS of clusters up to thousands of resources. In contrast, when Algorithm 2 iteratively calls the previously known Algorithm

F i n d L C S

[19] (instead of Algorithm 1) to compute the CS of a cluster, it may require really high execution runtimes even for very small clusters. As an example, Table 1 presents a comparison between the two solutions in terms of their performance at each iterative step in the computation of a CS of a cluster of 13 items (contracting processes in TheyBuyForYou). Both solutions were run on a Windows desktop, endowed with an 11th Gen Intel(R) Core(TM) i7-11700K CPU, working at 3.60 GHz and with 32 G RAM. Performance is measured in terms of the following:

CS size: size of the triple set returned as CS;
#Blank Nodes: number of blank nodes in the CS;
#Uninf. Triples: number of triples computed and discarded from the results, according to Lines 16–18 in Algorithm 1 and Formula (9) in Algorithm $F i n d L C S$ [19]—the so-called uninformative triples;
Exec. Time: execution time of the iteration step.

The values in Table 1 show that the adoption of Algorithm 1 (

F i n d_R e d u c e d C S

) leads, at each iterative step (one per table row), to the computation of a CS of smaller size that is logically equivalent to the one returned by

F i n d L C S

. In fact,

F i n d L C S

also returns many redundant RDF-paths that are logically implied by the already computed CS (these paths are discarded in Line 19 by

F i n d_R e d u c e d C S

). Redundant paths involve several blank nodes, which cause

F i n d L C S

to obtain a larger value in terms of #Blank Nodes. The execution time is much longer when

F i n d L C S

is used at each iteration step and reaches two hours for more than 13 resources. Such results make any analysis of large clusters with

F i n d L C S

unfeasible.

Regarding the worst-case CS size, it has been demonstrated in the related area of Description Logics that the size of the least common subsumer can increase exponentially with the number of resources [30] for the LCS of several tree-shaped concepts. More recently, an exponential worst case was also achieved for the CS of the knowledge graphs [27] with a cyclic structure. However, such worst cases did not show up for the real datasets we tested, a fact that may be significant in itself.

3.3. Expected Size of the Final CS, and Overall Runtime of the Algorithm 2

With the real datasets we experimented with, the size of a cluster can increase to around 1800 resources. Hence, the incremental computation of the CS of all resources may vary significantly, depending on the order in which the resources are added to the CS. When using Big Data, one cannot even expect to have all resources available at once in order to decide which is the best one to next combine in the CS; resources can arrive as a stream and might need to be collected as they arrive. Hence, we could only estimate the runtime needed and the size of the resulting CS for the clusters in the datasets we analyzed. We achieved this with the Monte Carlo method [31], mediating on 100 trials over the random orderings of the resources.

When all the resources are available at once, an intuitive heuristic is the increasing-size heuristic. This begins by computing the CS of the resources

r_{1}, r_{2}

whose attached set of triples

T_{r_{1}}

,

T_{r_{2}}

are the smallest, and proceeding with the resources with an increasing number of triples—after all, the presence of few characteristics should mean there are few commonalities to start with in Line 1, and one may imagine that, while examining other resources, only such few established commonalities will be confirmed as the common ones. To confirm or deny this hypothesis, we treated this as a special, non-random, case, with an increasing-size order, and marked it in the experiments to see if its runtime placed it in the lower half (possibly the minimum) of the distribution of runtimes or not. We show in the next section that this intuition quite often failed.

4. Results

Our experiments aimed to answer the above research questions by computing the CS of several clusters of RDF resources, obtained by applying the well-known clustering algorithm k-means [32] to the resources in the two datasets, TheyBuyForYou and Drugbank, mentioned earlier in this paper. For the sake of synthesis, we only discuss the full experiment with reference to TheyBuyForYou. Details and results about the same experiment in Drugbank may be found in the Appendix.

The knowledge graph TheyBuyForYou [33] includes an ontology for procurement data, based on the Open Contracting Data Standard (OCDS) [34]. The OCDS data model is built around the concept of a contracting process, whose main phases are planning, tender, award, contract, and implementation. Our experiment starts by clustering the contracting processes emitted on 30 January 2019 (3198 resources) by the k-means algorithm. To achieve this aim, such resources need to undergo a vector embedding process that exploits the embedding strategies proposed by Ristoski et al. [35] implemented in the pyRDF2Vec6 library. The optimal number k of clusters to be returned by the k-means algorithm is determined by applying two different optimization methods, the Elbow method [36] and Silhouettes analysis [37], to the feature vector resulting from the embedding. Both of these methods suggest

k = 10

. Thus, 10 different clusters were returned by the k-means algorithms applied to the set of 3198 contracting processes.

For each cluster of resources, we randomly chose an ordering

[r_{1}, r_{2}, \dots, r_{n}]

(i.e., an ordering of indices); then, we started by computing the CS

C S_{12}

of the first couple of resources (Algorithm 2, Line 1), before computing

C S_{3}

, i.e., the CS of

C S_{12}

and the third resource in the ordering, and so on (Algorithm 2, Line 4), until we reached the common subsumer of all resources

C S_{n}

. We repeated this process one hundred times for each cluster, memorizing the following for each trial:

The sequence of Common Subsumers $C S_{12}, C S_{3}, \dots, C S_{n} = 〈 c s, T_{c s} 〉$ that was progressively computed;
The sequence of sizes $| T_{C S_{12}} |, | T_{C S_{3}} |, \dots, | T_{C S_{n}} |$ of the CS that was progressively computed;
The overall runtime of the Algorithm 2 in that trial.

We discuss how we used the memorized information in a separate subsection (below), and finally compare the research questions presented in Section 2 with the results obtained.

4.1. Logical Convergence

At the end of each trial, we compared the logical equivalence of the computed r-graph

〈 c s, T_{c s} 〉

(rooted at the blank node

c s

) with that previously obtained with a different ordering. In all cases, we obtained a positive result; that is, all the 100 final r-graphs were logically equivalent, confirming Property CSP3 of the Common Subsumers. Then, any one of these r-graphs can be passed on to our verbalizer [18], yielding a plain English explanation of the cluster. For example, the CS computed for Cluster 1, when passed to our verbalizer module, generates the explanation shown in the Introduction (Section 1.1), which explains the commonalities of Cluster 1 to the end user.

4.2. Dependency of the Rate of Convergence on the Order of Added Resources

We analyzed the size sequence

| T_{C S_{12}} |, | T_{C S_{3}} |, \dots, | T_{C S_{n}} |

of the Common Subsumers and observed that while they generally decrease, the rate of this decrement may vary a lot.

Figure 1 shows how the CS size varies with the number of resources compared in the computation of the CS of Cluster 1. The chart includes 100 lines, with each line corresponding to a test with a different random ordering of resources.

Figure 1. Convergence of Cluster 1 in TheyBuyForYou: the chart includes 100 lines, one for each tested random ordering of resources. In every one of the 100 tests, the number of triples of the CS (vertical axis) converges to a single size (six triples) as the number of resources added to the CS (horizontal axis) approaches the size of the cluster.

Figure 2 collects the convergence results for all 10 clusters returned by the clustering process.

For the sake of visibility, in Figure 3, we show only one line of Figure 1 (referring to Cluster 1); all other lines showed a similar behavior. The reader may notice that the initial size of the CS is 13: the set describing the commonalities of the first two resources includes 13 triples. By adding resources to the computation, the size of the CS may decrease, increase, or remain constant. Generally speaking, the final number of triples (six in the case of Cluster 1) is reached by adding resources (convergence to six is reached at 23 resources in the test shown in Figure 3): the larger the collection, the smaller the number of shared features, in most cases. Nevertheless, the CS size may also increase with the number of resources: sometimes the addition of one more resource

r_{ℓ + 1}

to the computation causes

| C S_{ℓ + 1} | > | C S_{ℓ} |

. In Figure 3, this happens twice:

| C S_{14} | > | C S_{13} |

and

| C S_{19} | > | C S_{18} |

. This phenomenon highlights a failure in the idea that the more resources are used, the fewer commonalities they will have, hence decreasing the size of the CSe For the resources presented in RDF, there is also the commonality when a se of ℓ resources achieves the same literal value, say v, through the same path, so that

C S_{ℓ}

contains that path. If the next resource reaches v too, but through a different path, then

C S_{ℓ + 1}

might require more blank nodes (and more triples) to represent the different ways in which v is reached.

4.3. Analysis of Computation Time

Figure 4 plots the runtime required to compute the CS of Cluster 3 in all 100 orderings considered in the experiments, shown in blue. We show Cluster 3 as a worst-case scenario, because it is the one with the highest mean runtime value (around 935 s). We treated the test in which resources are ordered in increasing size order (in orange) as a special case. Figure 4 also reports the lines for the mean value

μ

(red line) plus/minus the standard deviation

σ

(green lines), showing that only a small number of tests fall outside this range

[μ - σ, μ + σ]

. The distribution is heavy-tailed according to statistics terminology, in the sense that most of the outliers (i.e., runtimes outside

[μ \pm σ]

) are much bigger than

μ + σ

.

Surprisingly, the idea that starting the computation from the resources

r_{1}, r_{2}

with the smallest size of

T_{r_{1}}, T_{r_{2}}

, and adding to the CS resources in increasing size order, would keep the number of triples in the running CS low, and consequently, the overall execution time is not true in any of the clusters analyzed. In Cluster 3 (see Figure 4), the orange point is under the mean value

μ

for Cluster 3, but it is not the minimal one; in other clusters (not shown), instead, the orange point is close to

μ + σ

or even higher (in some cases, it corresponds to the maximum computation time). Thus, our experiments did not show any heuristic that could reduce the execution time.

4.4. Final Answers to Research Questions

Given the above results, we can answer the research questions posed in Section 2 (repeated in italic below for convenience) as follows:

RQ1:: Does the computation of the CS always converge when changing the sequence of resources incrementally added to the CS? Yes, independently of the sequence in which resources are added to the CS, the CS converges to the same information, which is represented as logically equivalent, although possibly syntactically different, RDF graphs.
RQ2:: How quickly (depending on the number of resources added to the CS) does the the incremental computation of a CS of a given cluster converge? The rate of convergence to a final CS may vary widely; the experiments reveal that the size of the final CS generally decreases, but not monotonically.
RQ3:: How do much the different choices regarding the next included resource influence the convergence, and are there simple heuristics that can be used to choose the initial pair and the next resource? It appears that the heuristic for choosing the resource with the minimum number of triples as the next one does not pay off in real datasets. For two real datasets, we proved that the patterns of the cases that were theoretically proved to be the exponential worst ones were not present.

5. Discussion

We proposed an optimized algorithm for the computation of a common subsumer of clusters of RDF resources. This optimization allows for clusters with up to about 1800 resources to be processed.

The performed experiments show that, independently of the chosen ordering of resources, the computed CS always converges to a final set of triples of the same size, which is logically equivalent to the CS corresponding to any other ordering. The incremental addition of resources to the computation causes the CS triples to decrease in size, although not monotonically.

The rate of convergence to the final CS varies a lot depending on the ordering pf resources, and the experiments do not reveal any heuristics that could make this convergence faster. Future work will investigate possible heuristics that could be used to speed up the CS computation by choosing the order of resources.

We also notice that the execution times are much higher in Drugbank than in TheyBuyForYou, despite the emptiness of the CSs of all Drugbank clusters. In other words, the algorithm counterintuitively takes a longer time to collect less information. By analyzing the resources involved in the computation, we found that drugs do not follow a modeling schema in Drugbank and are modelled according to different patterns. On the contrary, every contracting process in TheyBuyForYou follows the same description schema, which seems to ease the common subsumer calculations.

In our experiments, we computed the sommon Subsumers of clusters returned by k-means clustering, applied to a numerical representation of RDF resources that uses vector embedding techniques. By applying state-of-the-art methods (e.g., the elbow method) to determine the optimal number of clusters k, we obtained a small number (10 in TheyBuyForYou and 6 in Drugbank) of large clusters (up to 400 in TheyBuyForYou and up to 1800 in Drugbank). As a consequence, their CSs contain very few triple sets (with Drubgbank even containing empty sets): the larger the cluster, the lower the number of features shared by all cluster items. Such a result suggests different possible corrective solutions. First, one can assume bigger values for the optimal number of clusters k by denying the optimality o thef Elbow method, as was recently proposed by other researchers [38]. Also, the poorness of the shared information that is returned may depend on the loss of information in the embedding process; thus, techniques to make embedding more conservative in terms of information may be investigated. In particular, in an interactive clustering mechanism [21], new walking strategies in the original RDF graph may be proposed based on the end users’ evaluation of the importance of returned commonalities. In other words, users can evaluate the human-readable explanation derived by the computed CS and this feedback is used to train the walking strategy at the basis of the embeddings.

Author Contributions

Conceptualization, S.C., E.D.S. and F.M.D.; methodology, S.C., E.D.S. and F.M.D.; software, S.C. and F.M.D.; validation, S.C. and F.M.D.; formal analysis, S.C. and F.M.D.; investigation, S.C., E.D.S. and F.M.D.; resources, S.C., E.D.S. and F.M.D.; data curation, S.C. and F.M.D.; writing—original draft preparation, S.C., E.D.S. and F.M.D.; writing—review and editing, S.C., E.D.S. and F.M.D.; visualization, S.C. and F.M.D.; supervision, S.C., E.D.S. and F.M.D.; project administration, S.C., E.D.S. and F.M.D.; funding acquisition, E.D.S. and F.M.D. All authors have read and agreed to the published version of the manuscript.

Funding

We acknowledge support by project “LIFE: the itaLian system wIde Frailty nEtwork” founded by Ministry of Health (CUP D93C22000640001).

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets we used in our experiment are publicly available (links checked on 17 October 2024): 1. Drugbank: https://download.bio2rdf.org/files/current/drugbank/drugbank.html; 2. TheyBuyForYou: https://tbfy.github.io/data/.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RDF	Resource Description Framework
LCS	Least Common Subsumer
CS	Common Subsumer

Appendix A

Here, we report the results of the experiments run in the Drugbank dataset [39], which are useful to the discussion in this paper. DrugBank is a bioinformatics and chemoinformatics resource that combines detailed drug data (i.e., chemical, pharmacological, and pharmaceutical) with comprehensive drug target information (i.e., sequence, structure, and pathway).

In our experiments, we used the RDF representation of Drugbank (https://download.bio2rdf.org/files/current/drugbank/drugbank.html, accessed on 17 October 2024) as a data source. We applied the k-means clustering algorithm to all drugs in Drugbank after converting the RDF knowledge graph modeling of all of them (7670 resources) using pyRDF2Vec embedding libraries. The analysis of optimal k suggests

k = 6

. However, our experiments reveal that most clusters have an empty CS; thus, the clusterized resources seem to share no such features. Such a result suggests a re-thinking of the embedding process focused on the maintenance of the informative content held by resources and modeled in the original knowledge graph. Interactive clusterization, suggested as a future research direction, may contribute to this focus.

Independently of the quality of clustering results, we used this experiment to check the behaviour of Algorithm 2. In Figure A1, we show the number of triples in the CS of Cluster 1 (1438 resources) as a function of the number of resources considered in the computation.

Figure A1. Drugbank Dataset. Graphical representation of the convergence of the number of triples in the CS with respect to the number of resources considered in the computation (maximum cluster dimension 1438). The chart refers to Cluster 1 and includes 100 lines, one for each tested random permutation of resources. For each line, when the number of triples in the CS decreases to 0, the number does does not increase anymore (all lines collapse on the x axis).

The chart includes 100 lines, which all converge to a number of triples in the CS that is equal to 0. Intuitively, no logical convergence needs to be proved here. In this experiment, the convergence rate varies a lot in the 100 tests: some tests rapidly reach the final empty set, while others analyze almost all resources in the set (see the orange line in Figure A1, converging at resource 1323, as an example).

In Figure A2, we report the execution times of each of the 100 tests performed in our experiment. The reader may notice that the mean computation time is much higher in Drugbank dataset w.r.t. TheyBuyForYou, despite the emptiness of the returned set of triples. In other words, computing a CS, although empty, in Drugbank takes much more time than computing a CS populated by some triples in TheyBuyForYou. This is probably due to the lack of a schema for data description in Drugbank, as discussed in the Conclusions.

Figure A2. Drugbank Dataset. Time needed to compute the CS of resources in Cluster 1. The orange point plots the computation time for the permutation of resources in increasing size order.

[custom] References

Notes

1	https://tbfy.github.io/data/ (accessed on 17 October 2024).
2	https://download.bio2rdf.org/files/current/drugbank/drugbank.html (accessed on 17 October 2024).
3	In this presentation, we follow the W3C recommendation of RDF 1.1 of 2014 [23]. A new version for RDF 1.2 is on the way [24], but it is still a working draft as of July 2024.
4	https://www.w3.org/TR/2014/REC-turtle-20140225/.
5	Note that this interpretation requires that when RDF files are merged, name conflicts in blank nodes must be standardized and separated. This aspect is carefully discussed in the W3C recommendations.
6	https://pyrdf2vec.readthedocs.io/en/latest/index.html, accessed on 17 October 2024.

References

Zhou, L.; Du, G.; Lü, K.; Wang, L.; Du, J. A Survey and an Empirical Evaluation of Multi-View Clustering Approaches. ACM Comput. Surv. 2024, 56, 1–38. [Google Scholar] [CrossRef]
Demidova, L.A.; Sovietov, P.N.; Andrianova, E.G.; Demidova, A.A. Anomaly Detection in Student Activity in Solving Unique Programming Exercises: Motivated Students against Suspicious Ones. Data 2023, 8, 129. [Google Scholar] [CrossRef]
Hilal, W.; Gadsden, S.A.; Yawney, J. Financial Fraud: A Review of Anomaly Detection Techniques and Recent Advances. Expert Syst. Appl. 2022, 193, 116429. [Google Scholar] [CrossRef]
He, X.; Liu, S.; Keung, J.; He, J. Co-clustering for Federated Recommender System. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; Association for Computing Machinery: New York, NY, USA, 2024. WWW ’24. pp. 3821–3832. [Google Scholar] [CrossRef]
Kim, M.; Liu, F.; Jain, A.K.; Liu, X. Cluster and Aggregate: Face Recognition with Large Probe Set. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: New York, NY, USA, 2022; Volume 35, pp. 36054–36066. [Google Scholar]
Cozzolino, I.; Ferraro, M.B. Document clustering. Wiley Interdiscip. Rev. Comput. Stat. 2022, 14, e1588. [Google Scholar] [CrossRef]
Oyelade, J.; Isewon, I.; Oladipupo, F.; Aromolaran, O.; Uwoghiren, E.; Ameh, F.; Achas, M.; Adebiyi, E. Clustering Algorithms: Their Application to Gene Expression Data. Bioinform. Biol. Insights 2016, 10, BBI.S38316. [Google Scholar] [CrossRef]
Valle, M.A.; Ruz, G.A. Finding Hierarchical Structures of Disordered Systems: An Application for Market Basket Analysis. IEEE Access 2021, 9, 1626–1641. [Google Scholar] [CrossRef]
Tabianan, K.; Velu, S.; Ravi, V. K-Means Clustering Approach for Intelligent Customer Segmentation Using Customer Purchase Behavior Data. Sustainability 2022, 14, 7243. [Google Scholar] [CrossRef]
Škrjanc, I.; Andonovski, G.; Iglesias, J.A.; Sesmero, M.P.; Sanchis, A. Evolving Gaussian on-line clustering in social network analysis. Expert Syst. Appl. 2022, 207, 117881. [Google Scholar] [CrossRef]
Das, S.; Nayak, S.P.; Sahoo, B.; Nayak, S.C. Machine Learning in Healthcare Analytics: A State-of-the-Art Review. Arch. Comput. Methods Eng. 2024, 31, 3923–3962. [Google Scholar] [CrossRef]
Xiao, H.; Chen, Y.; Shi, X. Knowledge Graph Embedding Based on Multi-View Clustering Framework. IEEE Trans. Knowl. Data Eng. 2021, 33, 585–596. [Google Scholar] [CrossRef]
Bamatraf, S.A.; BinThalab, R.A. Clustering RDF data using K-medoids. In Proceedings of the 2019 First International Conference of Intelligent Computing and Engineering (ICOICE), Hadhramout, Yemen, 15–16 December 2019; pp. 1–8. [Google Scholar] [CrossRef]
Aluç, G.; Özsu, M.T.; Daudjee, K. Building self-clustering RDF databases using Tunable-LSH. VLDB J. 2019, 28, 173–195. [Google Scholar] [CrossRef]
Guo, X.; Gao, H.; Zou, Z. WISE: Workload-Aware Partitioning for RDF Systems. Big Data Res. 2020, 22, 100161. [Google Scholar] [CrossRef]
Bandyapadhyay, S.; Fomin, F.V.; Golovach, P.A.; Lochet, W.; Purohit, N.; Simonov, K. How to find a good explanation for clustering? Artif. Intell. 2023, 322, 103948. [Google Scholar] [CrossRef]
Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 2019, 267, 1–38. [Google Scholar] [CrossRef]
Colucci, S.; Donini, F.M.; Di Sciascio, E. Explaining Commonalities of Clusters of RDF Resources in Natural Language. In Foundations of Intelligent Systems, Proceedings of the 27th International Symposium, ISMIS 2024, Poitiers, France, 17–19 June 2024; Lecture Notes in Computer Science; Appice, A., Azzag, H., Hacid, M., Hadjali, A., Ras, Z.W., Eds.; Springer: Berlin/Heidelberg, Germany, 2024; Volume 14670, pp. 160–169. [Google Scholar] [CrossRef]
Colucci, S.; Donini, F.M.; Giannini, S.; Di Sciascio, E. Defining and computing Least Common Subsumers in RDF. Web Semant. Sci. Serv. Agents World Wide Web 2016, 39, 62–80. [Google Scholar] [CrossRef]
Colucci, S.; Donini, F.M.; Di Sciascio, E. On the Relevance of Explanation for RDF Resources Similarity. In Model-Driven Organizational and Business Agility, Proceedings of the Third International Workshop, MOBA 2023, Zaragoza, Spain, 12–13 June 2023; Springer: Berlin/Heidelberg, Germany, 2023; Volume 488, pp. 96–107. [Google Scholar]
Bae, J.; Helldin, T.; Riveiro, M.; Nowaczyk, S.; Bouguelia, M.R.; Falkman, G. Interactive clustering: A comprehensive review. ACM Comput. Surv. (CSUR) 2020, 53, 1–39. [Google Scholar] [CrossRef]
Colucci, S.; Donini, F.M.; Di Sciascio, E. A review of reasoning characteristics of RDF-based Semantic Web systems. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2024, 14, e1537. [Google Scholar] [CrossRef]
Cyganiak, R.; Wood, D.; Lanthaler, M. RDF 1.1 Concepts and Abstract Syntax, W3C Recommendation, 2014.
Hartig, O.; Champin, P.A.; Kellogg, G.; Seaborne, A. RDF 1.2 Concepts and Abstract Syntax, W3C Working Draft, 2024.
Patel-Schneider, P.; Arndt, D.; Haudebourg, T. RDF 1.2 Semantics, W3C Recommendation, 2023.
Colucci, S.; Donini, F.M.; Di Sciascio, E. Common Subsumbers in RDF. In AI*IA-2013: Advances in Artificial Intelligence, Proceedings of the XIIIth International Conference of the Italian Association for Artificial Intelligence, Turin, Italy, 4–6 December 2013; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8249, pp. 348–359. [Google Scholar]
Amendola, G.; Manna, M.; Ricioppo, A. A logic-based framework for characterizing nexus of similarity within knowledge bases. Inf. Sci. 2024, 664, 120331. [Google Scholar] [CrossRef]
Colucci, S.; Donini, F.M.; Sciascio, E.D. Logical comparison over RDF resources in bio-informatics. J. Biomed. Inform. 2017, 76, 87–101. [Google Scholar] [CrossRef]
Cohen, W.W.; Borgida, A.; Hirsh, H. Computing Least Common Subsumers in Description Logics. In Proceedings of the 10th National Conference on Artificial Intelligence, San Jose, CA, USA, 12–16 July 1992; Swartout, W.R., Ed.; AAAI Press/The MIT Press: Cambridge, MA, USA, 1992; pp. 754–760. [Google Scholar]
Baader, F.; Küsters, R.; Molitor, R. Computing least common subsumers in description logics with existential restrictions. IJCAI 1999, 99, 96–101. [Google Scholar]
Rubinstein, R.Y. Simulation and the Monte Carlo Method, 1st ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1981. [Google Scholar]
Jain, A.K.; Dubes, R.C. Algorithms for Clustering Data; Prentice-Hall, Inc.: Upper Saddle River, NJ, USA, 1988. [Google Scholar]
Soylu, A.; Corcho, O.; Elvesater, B.; Badenes-Olmedo, C.; Blount, T.; Yedro Martinez, F.; Kovacic, M.; Posinkovic, M.; Makgill, I.; Taggart, C.; et al. TheyBuyForYou platform and knowledge graph: Expanding horizons in public procurement with open linked data. Semant. Web 2022, 13, 265–291. [Google Scholar] [CrossRef]
Soylu, A.; Elvesæter, B.; Turk, P.; Roman, D.; Corcho, O.; Simperl, E.; Konstantinidis, G.; Lech, T.C. Towards an Ontology for Public Procurement Based on the Open Contracting Data Standard. In Digital Transformation for a Sustainable Society in the 21st Century, Proceedings of the 18th IFIP WG 6.11 Conference on e-Business, e-Services, and e-Society, I3E 2019, Trondheim, Norway, 18–20 September 2019; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
Ristoski, P.; Rosati, J.; Noia, T.D.; Leone, R.D.; Paulheim, H. RDF2Vec: RDF graph embeddings and their applications. Semant. Web 2019, 10, 721–752. [Google Scholar] [CrossRef]
Marutho, D.; Hendra Handaka, S.; Wijaya, E.; Muljono. The Determination of Cluster Number at k-Mean Using Elbow Method and Purity Evaluation on Headline News. In Proceedings of the 2018 International Seminar on Application for Technology of Information and Communication, Semarang, Indonesia, 21–22 September 2018; pp. 533–538. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Schubert, E. Stop using the elbow criterion for k-means and how to choose the number of clusters instead. SIGKDD Explor. Newsl. 2023, 25, 36–42. [Google Scholar] [CrossRef]
Wishart, D.S.; Knox, C.; Guo, A.C.; Cheng, D.; Shrivastava, S.; Tzur, D.; Gautam, B.; Hassanali, M. DrugBank: A knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008, 36, D901–D906. [Google Scholar] [CrossRef]

Figure 2. Each chart refers to one cluster in TheyBuyForYou and includes 100 lines, one for each tested random ordering of resources. The first chart (upper left) is enlarged in Figure 1. The other nine charts show a similar behaviour as for convergence.

Figure 3. Convergence of the number of triples in the CS of Cluster 1 with respect to the number of resources considered in the computation. The chart refers to one random ordering of resources.

Figure 4. Time needed to compute the CS of the resources in Cluster 3, showing the one with the maximum mean computation time in 100 different orderings. The orange point plots the computation time for ordering resources according to increasing size.

Table 1. Comparison of the performance when computing a CS using Algorithm 1, presented here, and a previously known Algorithm

F i n d L C S

[19]. Table data refer to a cluster of 13 items (all contracting processes in TheyBuyForYou) and show how the two algorithms perform by progressively adding resources (one per row) to the CS. Performance is measured in terms of the size of the triple set returned as CS (CS size); the number of blank nodes in the CS (#Blank Nodes); the number of triples computed and discarded from the results—the so-called uninformative triples (#Uninf. Triples); execution time (Exec. Time). With more than 13 resources, the previous Algorithm

F i n d L C S

[19] reaches a 2 h timeout.

Table 1. Comparison of the performance when computing a CS using Algorithm 1, presented here, and a previously known Algorithm

F i n d L C S

[19]. Table data refer to a cluster of 13 items (all contracting processes in TheyBuyForYou) and show how the two algorithms perform by progressively adding resources (one per row) to the CS. Performance is measured in terms of the size of the triple set returned as CS (CS size); the number of blank nodes in the CS (#Blank Nodes); the number of triples computed and discarded from the results—the so-called uninformative triples (#Uninf. Triples); execution time (Exec. Time). With more than 13 resources, the previous Algorithm

F i n d L C S

[19] reaches a 2 h timeout.

	Algorithm $Find_ReducedCS$ (1)				Algorithm $FindLCS$ [19]
#Resources	CS Size	#Blank Nodes	#Uninf. Triples	Exec. Time (s)	CS Size	#Blank Nodes	#Uninf. Triples	Exec. Time (s)
2	17	9	639	12.79	37	19	639	12.63
3	14	6	169	4.79	36	19	324	9.19
4	14	6	138	3.83	36	19	302	8.40
5	14	6	276	6.00	232	119	589	13.30
6	14	6	138	3.88	232	119	1773	64.56
7	16	9	278	6.18	904	463	4330	112.75
8	14	6	153	4.46	904	463	6813	327.37
9	14	7	176	5.01	960	499	10,222	389.53
10	16	9	263	5.86	3816	1951	17,427	682
11	14	7	190	5.34	4264	2207	40,658	4035.52
12	14	7	168	4.74	5160	2719	44,242	3272.60
13	16	9	175	4.09	18,600	9455	48,168	5538.85

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Colucci, S.; Donini, F.M.; Di Sciascio, E. Computing the Commonalities of Clusters in Resource Description Framework: Computational Aspects. Data 2024, 9, 121. https://doi.org/10.3390/data9100121

AMA Style

Colucci S, Donini FM, Di Sciascio E. Computing the Commonalities of Clusters in Resource Description Framework: Computational Aspects. Data. 2024; 9(10):121. https://doi.org/10.3390/data9100121

Chicago/Turabian Style

Colucci, Simona, Francesco Maria Donini, and Eugenio Di Sciascio. 2024. "Computing the Commonalities of Clusters in Resource Description Framework: Computational Aspects" Data 9, no. 10: 121. https://doi.org/10.3390/data9100121

APA Style

Colucci, S., Donini, F. M., & Di Sciascio, E. (2024). Computing the Commonalities of Clusters in Resource Description Framework: Computational Aspects. Data, 9(10), 121. https://doi.org/10.3390/data9100121

Article Menu

Computing the Commonalities of Clusters in Resource Description Framework: Computational Aspects

Abstract

1. Introduction and Motivation

1.1. Previous Work on Cluster Explanations

1.2. Contribution of This Paper

1.3. Outline of the Paper

2. RDF and Common Subsumers: Background Notions

2.1. Background of RDF Syntax and Simple Entailment

2.2. Background of Common Subsumers in RDF

3. Computation Methodology and Analysis

3.1. An Improved Algorithm for the CS of Two Resources (Algorithm 1)

3.2. Computing the CS of a Cluster of Resources (Algorithm 2)

3.3. Expected Size of the Final CS, and Overall Runtime of the Algorithm 2

4. Results

4.1. Logical Convergence

4.2. Dependency of the Rate of Convergence on the Order of Added Resources

4.3. Analysis of Computation Time

4.4. Final Answers to Research Questions

5. Discussion

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Notes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI