Computing the Commonalities of Clusters in Resource Description Framework: Computational Aspects
Abstract
:1. Introduction and Motivation
1.1. Previous Work on Cluster Explanations
- The logic substrate that RDF relies on is exploited to compute the most specific RDF graph that is common to all resources in the cluster, known as the Least Common Subsumer (LCS) [19]; this phase makes use of blank nodes in RDF, which are existential variables that can abstract—like placeholders, but with a logical semantics—the single values on which the resources differ. However, although the LCS is logically complete, it is full of irrelevant details [20];
- A Common Subsumer (CS) is computed, an RDF structure which is a generalization of the LCS—so, logically, a CS is not the most specific description of the cluster—but still specific enough to capture the relevant features common to all resources;
- This structure is used to generate a phrase in constrained, but plain English, with the original idea of using English pronouns (that, which) to verbalize blank nodes in relative sentences. As an example, the cluster of contracting processes addressed in Figure 1 in Section 4 may be explained by the following sentence, generated by a verbalizer module [18], fed by a CS composed of six RDF triples:
1.2. Contribution of This Paper
- We propose an optimized algorithm for the computation of the CS of a cluster, scaling cluster dimensions that have not been reached before;
- We validate this computation through extensive experimentation with two, very different, real datasets;
- We collect and analyze data on the experiments and discuss the computational properties of the implementation, such as the convergence, expected runtime, and possible heuristics.
1.3. Outline of the Paper
2. RDF and Common Subsumers: Background Notions
2.1. Background of RDF Syntax and Simple Entailment
- @prefix ns2: <http://bio2rdf.org/drugbank_vocabulary:> .
- @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
- ns2:Humans-and-other-mammals rdf:type ns2:Affected-organism .
ex:a ex:r _:x . | and | _:x ex:q ex:b . |
ex:a ex:p ex:b . | and | ex:p ex:q ex:d . |
2.2. Background of Common Subsumers in RDF
- P1:
- for all ;
- P2:
- Any other r-graph with this Property (P1) is logically equivalent to .
- Piroxicam: http://bio2rdf.org/drugbank:DB00554;
- Tolterodine: http://bio2rdf.org/drugbank:DB01036.
- @prefix ns1: <http://bio2rdf.org/drugbank:> .
- @prefix ns2: <http://bio2rdf.org/drugbank_vocabulary:> .
- @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
- CSP1:
- idempotent: ;
- CSP2:
- commutative: ;
- CSP3:
- associative: .
3. Computation Methodology and Analysis
- We present an algorithm computing the CS of two resources, which builds on a previously published one, but with a crucial optimization that reduces the size of the resulting CS—it is still an RDF graph, with blank nodes and far fewer triples than the output of the original algorithm.
- Exploiting the associativity, we iterate the above algorithm, computing the CS of a “running” CS (starting with a pair and the next resource , as expression suggests.
- Since the time (and the size) needed to compute the (equivalent form of) CS of a complete cluster may vary depending on the ordering in which are given, in order to estimate the expected size of a CS of an entire cluster and the time needed to compute it, we set up a Monte Carlo method, which probes only a random fraction of all the possible orderings, one of which could be used to incrementally compute the CS. We consider the increasing-size heuristic (see Section 3.3 below) as a special trial, and compare its size and time with those of the other trials.
- RQ1:
- Does the computation of the CS always converge to one size when changing the order of resources incrementally added to the CS?
- RQ2:
- How quickly (depending on the number of resources added to the CS) does the incremental computation of a CS of a given cluster converge?
- RQ3:
- How much do the different choices for the next resource to include influence the convergence, and are there simple heuristics that can be used to choose the initial pair and the next resource?
3.1. An Improved Algorithm for the CS of Two Resources (Algorithm 1)
Algorithm 1: Computing a CS of and . The optimization discussed in the text is highlighted in a box. |
3.2. Computing the CS of a Cluster of Resources (Algorithm 2)
Algorithm 2: Computing a CS of ,…, by iteratively using Algorithm 1. |
- CS size: size of the triple set returned as CS;
- #Blank Nodes: number of blank nodes in the CS;
- #Uninf. Triples: number of triples computed and discarded from the results, according to Lines 16–18 in Algorithm 1 and Formula (9) in Algorithm [19]—the so-called uninformative triples;
- Exec. Time: execution time of the iteration step.
3.3. Expected Size of the Final CS, and Overall Runtime of the Algorithm 2
4. Results
- The sequence of Common Subsumers that was progressively computed;
- The sequence of sizes of the CS that was progressively computed;
- The overall runtime of the Algorithm 2 in that trial.
4.1. Logical Convergence
4.2. Dependency of the Rate of Convergence on the Order of Added Resources
4.3. Analysis of Computation Time
4.4. Final Answers to Research Questions
- RQ1:
- Does the computation of the CS always converge when changing the sequence of resources incrementally added to the CS? Yes, independently of the sequence in which resources are added to the CS, the CS converges to the same information, which is represented as logically equivalent, although possibly syntactically different, RDF graphs.
- RQ2:
- How quickly (depending on the number of resources added to the CS) does the the incremental computation of a CS of a given cluster converge? The rate of convergence to a final CS may vary widely; the experiments reveal that the size of the final CS generally decreases, but not monotonically.
- RQ3:
- How do much the different choices regarding the next included resource influence the convergence, and are there simple heuristics that can be used to choose the initial pair and the next resource? It appears that the heuristic for choosing the resource with the minimum number of triples as the next one does not pay off in real datasets. For two real datasets, we proved that the patterns of the cases that were theoretically proved to be the exponential worst ones were not present.
5. Discussion
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
RDF | Resource Description Framework |
LCS | Least Common Subsumer |
CS | Common Subsumer |
Appendix A
1 | https://tbfy.github.io/data/ (accessed on 17 October 2024). |
2 | https://download.bio2rdf.org/files/current/drugbank/drugbank.html (accessed on 17 October 2024). |
3 | |
4 | |
5 | Note that this interpretation requires that when RDF files are merged, name conflicts in blank nodes must be standardized and separated. This aspect is carefully discussed in the W3C recommendations. |
6 | https://pyrdf2vec.readthedocs.io/en/latest/index.html, accessed on 17 October 2024. |
References
- Zhou, L.; Du, G.; Lü, K.; Wang, L.; Du, J. A Survey and an Empirical Evaluation of Multi-View Clustering Approaches. ACM Comput. Surv. 2024, 56, 1–38. [Google Scholar] [CrossRef]
- Demidova, L.A.; Sovietov, P.N.; Andrianova, E.G.; Demidova, A.A. Anomaly Detection in Student Activity in Solving Unique Programming Exercises: Motivated Students against Suspicious Ones. Data 2023, 8, 129. [Google Scholar] [CrossRef]
- Hilal, W.; Gadsden, S.A.; Yawney, J. Financial Fraud: A Review of Anomaly Detection Techniques and Recent Advances. Expert Syst. Appl. 2022, 193, 116429. [Google Scholar] [CrossRef]
- He, X.; Liu, S.; Keung, J.; He, J. Co-clustering for Federated Recommender System. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; Association for Computing Machinery: New York, NY, USA, 2024. WWW ’24. pp. 3821–3832. [Google Scholar] [CrossRef]
- Kim, M.; Liu, F.; Jain, A.K.; Liu, X. Cluster and Aggregate: Face Recognition with Large Probe Set. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: New York, NY, USA, 2022; Volume 35, pp. 36054–36066. [Google Scholar]
- Cozzolino, I.; Ferraro, M.B. Document clustering. Wiley Interdiscip. Rev. Comput. Stat. 2022, 14, e1588. [Google Scholar] [CrossRef]
- Oyelade, J.; Isewon, I.; Oladipupo, F.; Aromolaran, O.; Uwoghiren, E.; Ameh, F.; Achas, M.; Adebiyi, E. Clustering Algorithms: Their Application to Gene Expression Data. Bioinform. Biol. Insights 2016, 10, BBI.S38316. [Google Scholar] [CrossRef]
- Valle, M.A.; Ruz, G.A. Finding Hierarchical Structures of Disordered Systems: An Application for Market Basket Analysis. IEEE Access 2021, 9, 1626–1641. [Google Scholar] [CrossRef]
- Tabianan, K.; Velu, S.; Ravi, V. K-Means Clustering Approach for Intelligent Customer Segmentation Using Customer Purchase Behavior Data. Sustainability 2022, 14, 7243. [Google Scholar] [CrossRef]
- Škrjanc, I.; Andonovski, G.; Iglesias, J.A.; Sesmero, M.P.; Sanchis, A. Evolving Gaussian on-line clustering in social network analysis. Expert Syst. Appl. 2022, 207, 117881. [Google Scholar] [CrossRef]
- Das, S.; Nayak, S.P.; Sahoo, B.; Nayak, S.C. Machine Learning in Healthcare Analytics: A State-of-the-Art Review. Arch. Comput. Methods Eng. 2024, 31, 3923–3962. [Google Scholar] [CrossRef]
- Xiao, H.; Chen, Y.; Shi, X. Knowledge Graph Embedding Based on Multi-View Clustering Framework. IEEE Trans. Knowl. Data Eng. 2021, 33, 585–596. [Google Scholar] [CrossRef]
- Bamatraf, S.A.; BinThalab, R.A. Clustering RDF data using K-medoids. In Proceedings of the 2019 First International Conference of Intelligent Computing and Engineering (ICOICE), Hadhramout, Yemen, 15–16 December 2019; pp. 1–8. [Google Scholar] [CrossRef]
- Aluç, G.; Özsu, M.T.; Daudjee, K. Building self-clustering RDF databases using Tunable-LSH. VLDB J. 2019, 28, 173–195. [Google Scholar] [CrossRef]
- Guo, X.; Gao, H.; Zou, Z. WISE: Workload-Aware Partitioning for RDF Systems. Big Data Res. 2020, 22, 100161. [Google Scholar] [CrossRef]
- Bandyapadhyay, S.; Fomin, F.V.; Golovach, P.A.; Lochet, W.; Purohit, N.; Simonov, K. How to find a good explanation for clustering? Artif. Intell. 2023, 322, 103948. [Google Scholar] [CrossRef]
- Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 2019, 267, 1–38. [Google Scholar] [CrossRef]
- Colucci, S.; Donini, F.M.; Di Sciascio, E. Explaining Commonalities of Clusters of RDF Resources in Natural Language. In Foundations of Intelligent Systems, Proceedings of the 27th International Symposium, ISMIS 2024, Poitiers, France, 17–19 June 2024; Lecture Notes in Computer Science; Appice, A., Azzag, H., Hacid, M., Hadjali, A., Ras, Z.W., Eds.; Springer: Berlin/Heidelberg, Germany, 2024; Volume 14670, pp. 160–169. [Google Scholar] [CrossRef]
- Colucci, S.; Donini, F.M.; Giannini, S.; Di Sciascio, E. Defining and computing Least Common Subsumers in RDF. Web Semant. Sci. Serv. Agents World Wide Web 2016, 39, 62–80. [Google Scholar] [CrossRef]
- Colucci, S.; Donini, F.M.; Di Sciascio, E. On the Relevance of Explanation for RDF Resources Similarity. In Model-Driven Organizational and Business Agility, Proceedings of the Third International Workshop, MOBA 2023, Zaragoza, Spain, 12–13 June 2023; Springer: Berlin/Heidelberg, Germany, 2023; Volume 488, pp. 96–107. [Google Scholar]
- Bae, J.; Helldin, T.; Riveiro, M.; Nowaczyk, S.; Bouguelia, M.R.; Falkman, G. Interactive clustering: A comprehensive review. ACM Comput. Surv. (CSUR) 2020, 53, 1–39. [Google Scholar] [CrossRef]
- Colucci, S.; Donini, F.M.; Di Sciascio, E. A review of reasoning characteristics of RDF-based Semantic Web systems. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2024, 14, e1537. [Google Scholar] [CrossRef]
- Cyganiak, R.; Wood, D.; Lanthaler, M. RDF 1.1 Concepts and Abstract Syntax, W3C Recommendation, 2014.
- Hartig, O.; Champin, P.A.; Kellogg, G.; Seaborne, A. RDF 1.2 Concepts and Abstract Syntax, W3C Working Draft, 2024.
- Patel-Schneider, P.; Arndt, D.; Haudebourg, T. RDF 1.2 Semantics, W3C Recommendation, 2023.
- Colucci, S.; Donini, F.M.; Di Sciascio, E. Common Subsumbers in RDF. In AI*IA-2013: Advances in Artificial Intelligence, Proceedings of the XIIIth International Conference of the Italian Association for Artificial Intelligence, Turin, Italy, 4–6 December 2013; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8249, pp. 348–359. [Google Scholar]
- Amendola, G.; Manna, M.; Ricioppo, A. A logic-based framework for characterizing nexus of similarity within knowledge bases. Inf. Sci. 2024, 664, 120331. [Google Scholar] [CrossRef]
- Colucci, S.; Donini, F.M.; Sciascio, E.D. Logical comparison over RDF resources in bio-informatics. J. Biomed. Inform. 2017, 76, 87–101. [Google Scholar] [CrossRef]
- Cohen, W.W.; Borgida, A.; Hirsh, H. Computing Least Common Subsumers in Description Logics. In Proceedings of the 10th National Conference on Artificial Intelligence, San Jose, CA, USA, 12–16 July 1992; Swartout, W.R., Ed.; AAAI Press/The MIT Press: Cambridge, MA, USA, 1992; pp. 754–760. [Google Scholar]
- Baader, F.; Küsters, R.; Molitor, R. Computing least common subsumers in description logics with existential restrictions. IJCAI 1999, 99, 96–101. [Google Scholar]
- Rubinstein, R.Y. Simulation and the Monte Carlo Method, 1st ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1981. [Google Scholar]
- Jain, A.K.; Dubes, R.C. Algorithms for Clustering Data; Prentice-Hall, Inc.: Upper Saddle River, NJ, USA, 1988. [Google Scholar]
- Soylu, A.; Corcho, O.; Elvesater, B.; Badenes-Olmedo, C.; Blount, T.; Yedro Martinez, F.; Kovacic, M.; Posinkovic, M.; Makgill, I.; Taggart, C.; et al. TheyBuyForYou platform and knowledge graph: Expanding horizons in public procurement with open linked data. Semant. Web 2022, 13, 265–291. [Google Scholar] [CrossRef]
- Soylu, A.; Elvesæter, B.; Turk, P.; Roman, D.; Corcho, O.; Simperl, E.; Konstantinidis, G.; Lech, T.C. Towards an Ontology for Public Procurement Based on the Open Contracting Data Standard. In Digital Transformation for a Sustainable Society in the 21st Century, Proceedings of the 18th IFIP WG 6.11 Conference on e-Business, e-Services, and e-Society, I3E 2019, Trondheim, Norway, 18–20 September 2019; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
- Ristoski, P.; Rosati, J.; Noia, T.D.; Leone, R.D.; Paulheim, H. RDF2Vec: RDF graph embeddings and their applications. Semant. Web 2019, 10, 721–752. [Google Scholar] [CrossRef]
- Marutho, D.; Hendra Handaka, S.; Wijaya, E.; Muljono. The Determination of Cluster Number at k-Mean Using Elbow Method and Purity Evaluation on Headline News. In Proceedings of the 2018 International Seminar on Application for Technology of Information and Communication, Semarang, Indonesia, 21–22 September 2018; pp. 533–538. [Google Scholar] [CrossRef]
- Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
- Schubert, E. Stop using the elbow criterion for k-means and how to choose the number of clusters instead. SIGKDD Explor. Newsl. 2023, 25, 36–42. [Google Scholar] [CrossRef]
- Wishart, D.S.; Knox, C.; Guo, A.C.; Cheng, D.; Shrivastava, S.; Tzur, D.; Gautam, B.; Hassanali, M. DrugBank: A knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008, 36, D901–D906. [Google Scholar] [CrossRef]
Algorithm (1) | Algorithm [19] | |||||||
---|---|---|---|---|---|---|---|---|
#Resources | CS Size | #Blank Nodes | #Uninf. Triples | Exec. Time (s) | CS Size | #Blank Nodes | #Uninf. Triples | Exec. Time (s) |
2 | 17 | 9 | 639 | 12.79 | 37 | 19 | 639 | 12.63 |
3 | 14 | 6 | 169 | 4.79 | 36 | 19 | 324 | 9.19 |
4 | 14 | 6 | 138 | 3.83 | 36 | 19 | 302 | 8.40 |
5 | 14 | 6 | 276 | 6.00 | 232 | 119 | 589 | 13.30 |
6 | 14 | 6 | 138 | 3.88 | 232 | 119 | 1773 | 64.56 |
7 | 16 | 9 | 278 | 6.18 | 904 | 463 | 4330 | 112.75 |
8 | 14 | 6 | 153 | 4.46 | 904 | 463 | 6813 | 327.37 |
9 | 14 | 7 | 176 | 5.01 | 960 | 499 | 10,222 | 389.53 |
10 | 16 | 9 | 263 | 5.86 | 3816 | 1951 | 17,427 | 682 |
11 | 14 | 7 | 190 | 5.34 | 4264 | 2207 | 40,658 | 4035.52 |
12 | 14 | 7 | 168 | 4.74 | 5160 | 2719 | 44,242 | 3272.60 |
13 | 16 | 9 | 175 | 4.09 | 18,600 | 9455 | 48,168 | 5538.85 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Colucci, S.; Donini, F.M.; Di Sciascio, E. Computing the Commonalities of Clusters in Resource Description Framework: Computational Aspects. Data 2024, 9, 121. https://doi.org/10.3390/data9100121
Colucci S, Donini FM, Di Sciascio E. Computing the Commonalities of Clusters in Resource Description Framework: Computational Aspects. Data. 2024; 9(10):121. https://doi.org/10.3390/data9100121
Chicago/Turabian StyleColucci, Simona, Francesco Maria Donini, and Eugenio Di Sciascio. 2024. "Computing the Commonalities of Clusters in Resource Description Framework: Computational Aspects" Data 9, no. 10: 121. https://doi.org/10.3390/data9100121
APA StyleColucci, S., Donini, F. M., & Di Sciascio, E. (2024). Computing the Commonalities of Clusters in Resource Description Framework: Computational Aspects. Data, 9(10), 121. https://doi.org/10.3390/data9100121