From Pixels to Insights: Unsupervised Knowledge Graph Generation with Large Language Model
Abstract
:1. Introduction
2. Related Work
2.1. Knowledge Graph Construction
2.2. LLMs for Image Understanding
3. Problem Formulation
- Different data source: Unlike existing methods that primarily utilize text or multimodal data, our approach focuses exclusively on leveraging image data for knowledge graph construction.
- Different training method: While many existing methods rely on supervised or semi-supervised approaches, we employ an unsupervised methodology tailored to extracting knowledge directly from images.
- Different construction results: Rather than constructing a comprehensive knowledge graph from an all-input dataset, our method generates a knowledge graph from each individual input datum, specifically from a single image.
- Different implementation method: Traditional methods typically involve multistep operations such as entity recognition and relation extraction to piece together a knowledge graph. In contrast, we develop an end-to-end model that directly produces a knowledge graph from image data, streamlining the process.
4. Method
4.1. Training Data Generation
4.2. Iterative Fine-Tuning
4.3. Filtering Strategies
4.4. Model Discussion
5. Experiments
5.1. Experimental Setup
5.2. Results and Analysis
5.2.1. Overall Performance
5.2.2. Ablation Study
5.2.3. Case Study
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chen, Y.; Ge, X.; Yang, S.; Hu, L.; Li, J.; Zhang, J. A survey on multimodal knowledge graphs: Construction, completion and applications. Mathematics 2023, 11, 1815. [Google Scholar] [CrossRef]
- Zhu, X.; Li, Z.; Wang, X.; Jiang, X.; Sun, P.; Wang, X.; Xiao, Y.; Yuan, N.J. Multi-modal knowledge graph construction and application: A survey. IEEE Trans. Knowl. Data Eng. 2022, 36, 715–735. [Google Scholar] [CrossRef]
- Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Philip, S.Y. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 494–514. [Google Scholar] [CrossRef] [PubMed]
- Zhong, L.; Wu, J.; Li, Q.; Peng, H.; Wu, X. A comprehensive survey on automatic knowledge graph construction. ACM Comput. Surv. 2023, 56, 1–62. [Google Scholar] [CrossRef]
- Pan, S.; Luo, L.; Wang, Y.; Chen, C.; Wang, J.; Wu, X. Unifying large language models and knowledge graphs: A roadmap. IEEE Trans. Knowl. Data Eng. 2024, 36, 3580–3599. [Google Scholar] [CrossRef]
- Xu, Z.; Cruz, M.J.; Guevara, M.; Wang, T.; Deshpande, M.; Wang, X.; Li, Z. Retrieval-augmented generation with knowledge graphs for customer service question answering. In Proceedings of the SIGIR, Washington, DC, USA, 14–18 July 2024; pp. 2905–2909. [Google Scholar]
- Gaur, M.; Gunaratna, K.; Srinivasan, V.; Jin, H. Iseeq: Information seeking question generation using dynamic meta-information retrieval and knowledge graphs. AAAI 2022, 36, 10672–10680. [Google Scholar] [CrossRef]
- Ma, X. Knowledge graph construction and application in geosciences: A review. Comput. Geosci. 2022, 161, 105082. [Google Scholar] [CrossRef]
- Vrandečić, D.; Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. ACM 2014, 57, 78–85. [Google Scholar] [CrossRef]
- Al-Khatib, K.; Hou, Y.; Wachsmuth, H.; Jochim, C.; Bonin, F.; Stein, B. End-to-end argumentation knowledge graph construction. AAAI 2020, 34, 7367–7374. [Google Scholar] [CrossRef]
- Asprino, L.; Daga, E.; Gangemi, A.; Mulholland, P. Knowledge graph construction with a façade: A unified method to access heterogeneous data sources on the web. TOIT 2023, 23, 1–31. [Google Scholar] [CrossRef]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
- Gao, P.; Han, J.; Zhang, R.; Lin, Z.; Geng, S.; Zhou, A.; Zhang, W.; Lu, P.; He, C.; Yue, X.; et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv 2023, arXiv:2304.15010. [Google Scholar]
- Zhang, B.; Soh, H. Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction. arXiv 2024, arXiv:2404.03868. [Google Scholar]
- Agarwal, O.; Kale, M.; Ge, H.; Shakeri, S.; Al-Rfou, R. Machine translation aided bilingual data-to-text generation and semantic parsing. In Proceedings of the WebNLG+, Dublin, Ireland, 18 December 2020; pp. 125–130. [Google Scholar]
- Dognin, P.L.; Padhi, I.; Melnyk, I.; Das, P. ReGen: Reinforcement learning for text and knowledge base generation using pretrained language models. arXiv 2021, arXiv:2108.12472. [Google Scholar]
- Melnyk, I.; Dognin, P.; Das, P. Grapher: Multi-stage knowledge graph construction using pretrained language models. In Proceedings of the 2021 NeurIPS Workshop, Online, 14 December 2021. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Wang, M.; Qi, G.; Wang, H.; Zheng, Q. Richpedia: A comprehensive multi-modal knowledge graph. In JIST; Springer: Cham, Switzerland, 2020; pp. 130–145. [Google Scholar]
- Liu, Y.; Li, H.; Garcia-Duran, A.; Niepert, M.; Onoro-Rubio, D.; Rosenblum, D.S. MMKG: Multi-modal knowledge graphs. In Proceedings of the ESWC, Portorož, Slovenia, 2–6 June 2019; pp. 459–474. [Google Scholar]
- Alberts, H.; Huang, N.; Deshpande, Y.; Liu, Y.; Cho, K.; Vania, C.; Calixto, I. VisualSem: A high-quality knowledge graph for vision and language. In Proceedings of the MRL Workshop, Online, 26–27 May 2021; pp. 138–152. [Google Scholar]
- Wang, X.; Meng, B.; Chen, H.; Meng, Y.; Lv, K.; Zhu, W. TIVA-KG: A multimodal knowledge graph with text, image, video and audio. In Proceedings of the MM, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 2391–2399. [Google Scholar]
- Huang, S.; Dong, L.; Wang, W.; Hao, Y.; Singhal, S.; Ma, S.; Lv, T.; Cui, L.; Mohammed, O.K.; Patra, B.; et al. Language is not all you need: Aligning perception with language models. NeurIPS 2024, 36, 72096–72109. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the ICML, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-language models for vision tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Wang, W.; Chen, Z.; Chen, X.; Wu, J.; Zhu, X.; Zeng, G.; Luo, P.; Lu, T.; Zhou, J.; Qiao, Y.; et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. NeurIPS 2024, 36, 61501–61513. [Google Scholar]
- Chen, P.; Liu, S.; Jia, J. Jigsaw clustering for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11526–11535. [Google Scholar]
- Noroozi, M.; Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision, Las Vegas, NV, USA, 27–30 June 2016; pp. 69–84. [Google Scholar]
- Ferreira, T.C.; Gardent, C.; Ilinykh, N.; Van Der Lee, C.; Mille, S.; Moussallem, D.; Shimorina, A. The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation results (WebNLG+ 2020). In Proceedings of the WebNLG+, Dublin, Ireland, 18 December 2020. [Google Scholar]
- Li, R.; Zhang, S.; Wan, B.; He, X. Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the CVPR, Nashville, TN, USA, 20–25 June 2021; pp. 11109–11119. [Google Scholar]
- Yoon, K.; Kim, K.; Moon, J.; Park, C. Unbiased heterogeneous scene graph generation with relation-aware message passing neural network. AAAI Conf. Artif. Intell. 2023, 37, 3285–3294. [Google Scholar] [CrossRef]
- Zellers, R.; Yatskar, M.; Thomson, S.; Choi, Y. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5831–5840. [Google Scholar]
- Yang, X.; Wu, Y.; Yang, M.; Chen, H.; Geng, X. Exploring diverse in-context configurations for image captioning. NeurIPS 2024, 36, 40924–40943. [Google Scholar]
Datasets | Train | Test |
---|---|---|
StateGrid_S | 7000 | 1000 |
StateGrid_L | 3500 | 500 |
Model | Match | F1 | Precision | Recall |
---|---|---|---|---|
ImageCaption | Exact | 0.678 | 0.678 | 0.680 |
Partial | 0.684 | 0.686 | 0.685 | |
Strict | 0.673 | 0.674 | 0.674 | |
ChatGPT-4o | Exact | 0.442 | 0.438 | 0.449 |
Partial | 0.463 | 0.456 | 0.472 | |
Strict | 0.409 | 0.406 | 0.415 | |
ImgGraph-LLM | Exact | 0.808 | 0.801 | 0.822 |
Partial | 0.834 | 0.823 | 0.850 | |
Strict | 0.807 | 0.801 | 0.815 |
Model | Match | F1 | Precision | Recall |
---|---|---|---|---|
ImageCaption | Exact | 0.675 | 0.676 | 0.679 |
Partial | 0.684 | 0.686 | 0.685 | |
Strict | 0.671 | 0.670 | 0.671 | |
ChatGPT-4o | Exact | 0.440 | 0.437 | 0.448 |
Partial | 0.463 | 0.455 | 0.472 | |
Strict | 0.408 | 0.407 | 0.413 | |
ImgGraph-LLM | Exact | 0.731 | 0.727 | 0.745 |
Partial | 0.778 | 0.764 | 0.793 | |
Strict | 0.732 | 0.726 | 0.749 |
Model | Match | F1 | Precision | Recall |
---|---|---|---|---|
Our | Exact | 0.808 | 0.801 | 0.822 |
Partial | 0.834 | 0.823 | 0.850 | |
Strict | 0.807 | 0.801 | 0.815 | |
Our w/o | Exact | 0.784 | 0.776 | 0.796 |
Partial | 0.808 | 0.793 | 0.827 | |
Strict | 0.779 | 0.769 | 0.785 | |
Our w/o | Exact | 0.754 | 0.741 | 0.764 |
Partial | 0.779 | 0.763 | 0.795 | |
Strict | 0.748 | 0.740 | 0.742 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, L.; Chen, Z.; Yang, W.; Liu, S.; Li, Y. From Pixels to Insights: Unsupervised Knowledge Graph Generation with Large Language Model. Information 2025, 16, 335. https://doi.org/10.3390/info16050335
Chen L, Chen Z, Yang W, Liu S, Li Y. From Pixels to Insights: Unsupervised Knowledge Graph Generation with Large Language Model. Information. 2025; 16(5):335. https://doi.org/10.3390/info16050335
Chicago/Turabian StyleChen, Lei, Zhenyu Chen, Wei Yang, Shi Liu, and Yong Li. 2025. "From Pixels to Insights: Unsupervised Knowledge Graph Generation with Large Language Model" Information 16, no. 5: 335. https://doi.org/10.3390/info16050335
APA StyleChen, L., Chen, Z., Yang, W., Liu, S., & Li, Y. (2025). From Pixels to Insights: Unsupervised Knowledge Graph Generation with Large Language Model. Information, 16(5), 335. https://doi.org/10.3390/info16050335