Next Article in Journal
Building a Cybersecurity Culture in Higher Education: Proposing a Cybersecurity Awareness Paradigm
Previous Article in Journal
Early Heart Attack Detection Using Hybrid Deep Learning Techniques
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

From Pixels to Insights: Unsupervised Knowledge Graph Generation with Large Language Model

1
Department of Electrical Engineering, Tsinghua University, Beijing 100084, China
2
Big Data Center, State Grid Corporation of China, Beijing 100052, China
*
Authors to whom correspondence should be addressed.
Information 2025, 16(5), 335; https://doi.org/10.3390/info16050335
Submission received: 27 February 2025 / Revised: 13 April 2025 / Accepted: 15 April 2025 / Published: 22 April 2025

Abstract

:
The role of image data in knowledge extraction and representation has become increasingly significant. This study introduces a novel methodology, termed Image to Graph via Large Language Model (ImgGraph-LLM), which constructs a knowledge graph for each image in a dataset. Unlike existing methods that rely on text descriptions or multimodal data to build a comprehensive knowledge graph, our approach focuses solely on unlabeled individual image data, representing a distinct form of unsupervised knowledge graph construction. To tackle the challenge of generating a knowledge graph from individual images in an unsupervised manner, we first design two self-supervised operations to generate training data from unlabeled images. We then propose an iterative fine-tuning process that uses this self-supervised information, enabling the fine-tuned LLM to recognize the triplets needed to construct the knowledge graph. To improve the accuracy of triplet extraction, we introduce filtering strategies that effectively remove low-confidence training data. Finally, experiments on two large-scale real-world datasets demonstrate the superiority of our proposed model.

1. Introduction

With the growing popularity of mobile devices, the State Grid Corporation of China has started using them to photograph and document operations in scenarios such as maintenance and emergency repairs. These photos are numerous and lack label information. Currently, the State Grid Corporation of China relies on manual comparison and identification to assess risks in actual operations. For example, Figure 1d reveals that an operator on a power pole is not wearing a safety belt. To identify risks in operational behavior, it is usually necessary for individuals with expert knowledge to compare the complex operational specifications described in business standards and the operational behavior depicted in captured images. To bridge the semantic gap between operational risks in images and business standards, knowledge graphs have become a feasible solution [1,2]. By constructing a knowledge graph of the entities and their relationships depicted in an image, it is possible to compare these graphs with those of images that meet business standards, thereby identifying potential operational risks. For instance, we constructed a knowledge graph for the actual image Figure 1d. By comparing it with the standard operation images in Figure 1c,e, we observe that the entities in Figure 1d are the same as those in the standard operation image in Figure 1c, but the relationship between the entity of the safety belt and the entity of the telegraph pole is missing. Therefore, we can identify the risk of not wearing a safety belt. Potential operational risks include issues such as missing entities or incorrect relationships (e.g., not wearing a safety belt). Thus, we propose an unsupervised task of building a knowledge graph for each unlabeled image in order to improve the efficiency of identifying operational risks, eliminating the need for experts to manually compare images.
Knowledge graphs [3,4,5], as an important tool for describing semantic structures including entities and their relationships, have been widely used in various intelligent applications, such as search engines and question-answering systems. By constructing a network of relationships between entities, knowledge graphs provide support for complex queries and decisions, making information retrieval and processing more efficient and accurate [6,7]. The existing method of knowledge graph construction (KGC) mainly relies on textual data [4,8], such as extracting entities and relationships from corpora through text mining techniques like those used in Wikidata [9,10]. With the rise of multimodal data, including non-textual data such as images, attention to its information content and expressive power is increasing. The multimodal knowledge graph construction method [1,2] utilizes multiple data types to provide more comprehensive information support for understanding entities and their relationships, such as attributing a set of images to each entity. Nevertheless, as shown in Figure 1a,b, existing knowledge graph construction [2,10,11] often focuses on using a large amount of textual or multimodal data to build a large and comprehensive knowledge graph, and these methods are more focused on how to model the relationships between multiple data. In contrast, as shown in Figure 1c–e, we focus on constructing a knowledge graph for unlabeled individual data (each image), necessitating a detailed examination of the information within a single image. Therefore, existing methods are not suitable for the unsupervised task of constructing an independent knowledge graph for single data.
With the development of large language models (LLMs) in image understanding, such as MiniGPT-4 and Llama-vid [12,13], which have demonstrated the ability to accurately and detailedly describe images, using LLMs to understand images has become a feasible direction for research and application. At the same time, recent research has begun to explore the use of LLMs to assist in building knowledge graphs [5,14], particularly in modeling text data, and has made some progress. Most mainstream methods still use serialized sequences as inputs and have not fully explored the non-sequential structure of images. This limitation poses challenges in using LLMs to build a knowledge graph based on a single image, especially in effectively combining the non-sequential information in the image with the modeling of the knowledge graph. This paper aims to explore how to utilize LLMs’ ability to understand images and their potential to construct knowledge graphs, in order to improve the efficiency and accuracy of single-image-based KGC.
In this paper, we propose an innovative approach for constructing an image-to-graph knowledge representation (ImgGraph-LLM) through an LLM, which constructs an independent knowledge graph for each unlabeled image in the dataset. This method first uses a self-supervised learning strategy to compile the training set and then fine-tunes the LLM to recognize key information related to knowledge graph construction. Then, the fine-tuned LLM can directly extract relationship triplets, namely entities and their relationships, from images and then construct a knowledge graph. To improve the accuracy and confidence of triplet extraction, we adopt filtering strategies to filter out triplets with low confidence, thereby enhancing the accuracy of knowledge graph construction. This image-based knowledge graph construction method provides a new perspective for intelligent systems, with images as the center, expanding the application of knowledge graphs.

2. Related Work

2.1. Knowledge Graph Construction

Knowledge graph construction (KGC) involves creating a structured representation of knowledge, which includes identifying entities and their interrelationships [4,5,10]. Existing methods can be divided into two categories. The first category involves constructing knowledge graphs based on textual data. With the advancement of language models, recent works have employed the language models to process text data and generate knowledge graphs, such as BT5 [15], ReGen [16], and Grapher [17]. For example, Grapher [17] uses a pre-trained T5 model [18] to construct a multi-stage framework that efficiently extracts knowledge graphs from text descriptions. The second category focuses on using multimodal data to construct knowledge graphs [1,2], as seen in methods like Richpedia [19], MMKG [20], VisualSem [21] and TIVA-KG [22]. For example, TIVA-KG [22] produces universal knowledge graphs that encompass text, images, videos, and audio. Our approach differs from these existing solutions in that it is the first attempt to construct a knowledge graph from a single image, rather than integrating all data into a comprehensive knowledge graph.

2.2. LLMs for Image Understanding

The integration of LLMs as decoders in vision–language tasks has surged in popularity, leveraging cross-modal learning to facilitate knowledge exchange between the language and visual domains [23,24,25]. With the release of GPT-4 [26] by OpenAI—demonstrating enhanced visual comprehension and reasoning skills following pre-training on extensive image–text datasets—other models like VisionLLM [27], LLaMA-VID [13], and MiniGPT-4 [12] have also significantly advanced the field. These open-source models not only excel in performance but also serve as crucial tools for tailored fine-tuning. Inspired by these works, we employ an LLM to model images and complete knowledge graph construction.

3. Problem Formulation

Now, we define the knowledge graph construction problem. Let the images I = { i 1 , , i N } denote the data sources of knowledge graph construction, where N is the number of images. The knowledge graph construction K G C is a procedure that maps each image i I onto a knowledge graph G i G , denotes as K G C ( i ) = G i . The knowledge graph set G is defined as G = { G i 1 , , G i N } . Notably, the input data only contain the images without any other data, such as labels or descriptions. Hence, the knowledge graph construction problem is an unsupervised structure prediction task.
Although there has been considerable research on knowledge graph construction, there are distinct differences between our task definition and existing work, specifically manifested in the following aspects:
  • Different data source: Unlike existing methods that primarily utilize text or multimodal data, our approach focuses exclusively on leveraging image data for knowledge graph construction.
  • Different training method: While many existing methods rely on supervised or semi-supervised approaches, we employ an unsupervised methodology tailored to extracting knowledge directly from images.
  • Different construction results: Rather than constructing a comprehensive knowledge graph from an all-input dataset, our method generates a knowledge graph from each individual input datum, specifically from a single image.
  • Different implementation method: Traditional methods typically involve multistep operations such as entity recognition and relation extraction to piece together a knowledge graph. In contrast, we develop an end-to-end model that directly produces a knowledge graph from image data, streamlining the process.

4. Method

Our proposed ImgGraph-LLM comprises three components. Firstly, we extract self-supervised information from unlabeled image data to construct the training dataset. Secondly, we design an iterative fine-tuning method to use the constructed training data to fine-tune the LLM. Thirdly, we employ mutual information to implement filtering strategies. Finally, as shown in Figure 2, the fine-tuned model is capable of generating a knowledge graph for each image.

4.1. Training Data Generation

The goal of training data generation is to use self-supervised information to generate data for fine-tuning the LLM. When processing image data from the State Grid, we propose two key assumptions to ensure that the generated data can be effectively used for model fine-tuning and can help the LLM remove irrelevant information from images and focus on information related to the knowledge graph.
Assumption 1.
In general, the photos place the main operational behavior in a central position, while the content around the photos (such as background information and unrelated objects) is considered irrelevant information. We should enable the LLM to effectively discriminate and filter out this irrelevant information.
Assumption 2.
This assumption emphasizes the importance of main entities and their relationships in photos. Due to the potential impact of hallucinations on the LLM, they may result in missing or incorrectly modeled entities and relationships in actual images during output. Therefore, our method needs to have high recognition and reinforcement capabilities to ensure that the model can accurately capture and express key entities and their relationships in photos.
Guided by the above two assumptions, the aim is to promote the application of self-supervised operations in image understanding and knowledge graph construction, providing efficient and accurate training data for fine-tuning the LLM.
We design two self-supervised operations for generating training data to meet Assumption 1 and Assumption 2, focusing on the layout of different information in the image. Given the image i I , based on image size and edge width parameters, the center region i c and edge region i e of the image are divided. The size of the image is H × W with the edge width X; then, the size of each area can be calculated as follows: the height of the central region i c is H 2 X and the width of the central region i c is W 2 X . Therefore, the size of the central region i c is ( H 2 X ) × ( W 2 X ) . The edge region i e includes four edges, each with the width X. The height of the upper and lower edges is X. The width of the left and right edges is X. Therefore, the size of the edge region i e is ( 2 X × W ) + ( 2 X × ( H 2 X ) ) . Finally, the image can be split as i = { i c , i e } .
Since our data lack label information, there are no labels to indicate center or edge regions. To leverage the self-supervised information within the image itself, we set the edge width X to a random value within the range of 0.05–0.15. The minimum value of 0.05 is chosen to ensure that blank space and irrelevant information are excluded, while the maximum value of 0.15 maintains the integrity of the central area. During experiments, we generated multiple random values for the edge width X for each image, creating multiple center and edge regions. For simplicity, we used the unified notation i = i c , i e to denote the center and edge regions derived from the image, without distinguishing cases with varying edge widths.
Self-supervised Operation 1: Irrelevant information from the edge region i e is deducted. Given the original image i, irrelevant information around it (such as background and extra objects) is removed, and only the main operational behavior and key entities in the center region i c are retained. During the training process, the model learns to only focus on the important information in the center region i c , while ignoring the unrelated information around it, thereby improving the model’s accurate recognition ability for key entities and their relationships.
We design a set of image transformations, T e , which include transformations such as cutout. Our goal is to remove irrelevant information from the edge region i e using these image transformations T e . In the transformations T e , random cutout and color transformation are two key operations. The random cutout operation allows us to randomly remove a portion of the image, which often contains background or irrelevant details, thereby increasing the model’s focus on the core area of the image. Meanwhile, through color conversion, we can fill the cutout areas with blocks of different colors, ensuring that the overall structure and content of the image remain unchanged even after the color change. Through these methods, we aim to reduce interference from irrelevant information in the edge regions, ensuring the effectiveness and accuracy of model training.
Self-supervised Operation 2: Relevant information from the center region i c is disrupted. Given the original image i, entities and relationships in the center region i c are disturbed, such as by removing some entities. After that, the entities or relationships are different while maintaining the same background and operational behaviors. Using these constructed data can guide the LLM to correctly identify entities and relationships, thereby avoiding recognition errors caused by hallucinations and other common-sense.
We design a set of image transformations, T c , which include jigsaw puzzles [28,29] and geometric transformations. Specifically, a jigsaw puzzle [28,29] is a classical self-supervised learning task that involves rearranging shuffled image patches into their correct spatial order to reconstruct the original image. Our goal is to disturb relevant information from the center region i c using these image transformations T c . Specifically, we employ jigsaw puzzles to displace scrambled patches within the central region as shown in Figure 3. This scrambled image serves to not only disrupt entity relationships but also hinder entity recognition by controlling the size of these scrambled patches. For instance, setting the scrambled patches to 32 blocks divides the entities within the image into various scrambled segments, rendering entity recognition and their relationships unfeasible. Due to the fact that geometric transformations can effectively disrupt the structure of an image without changing the actual pixel information, we choose operations that have a significant impact on the original image, such as random cropping and flipping. The original central area can be viewed as a global view, while the transformed image forms a local view. Through these transformation operations, we can significantly change the layout and content distribution of the image, making it difficult to recognize the original entities and their relationships after transformation. For example, random cropping may cut the information of the entity into different local segments, while flipping can change the relation of entities.
Pseudo-label Generator: This is an unsupervised image labeling technique. We assume that there is a corresponding label, l i , for an image, i, represented as l i = F ( i ) . In the process of generating training data, we perform various transformation operations on the image i. Specifically, for the transformations T e around the image, the key information is not destroyed and only irrelevant information related to the construction of the knowledge graph is removed. Therefore, after these transformations, the labels of the image should still remain unchanged, that is, l T e ( i ) = F ( T e ( i ) ) . Regarding the transformation T c of the image center, these transformations will completely destroy the information of entities and their relationships. Therefore, it is impossible to construct a meaningful knowledge graph from the images that have undergone these transformations. Therefore, the labels of these images will be set to empty, represented as l T c ( i ) = F ( T c ( i ) ) . Through this approach, we can appropriately label transformed images in a way that reflects their utility for training data in constructing knowledge graphs.
We generate a dataset D using self-supervised operations and pseudo-labels. For any image i, we apply the transformations T c and T e to generate multiple images. Due to the randomness of the transformations T c and T e , the same transformation of T c or T e produces different image results. For simplicity, we randomly generate M instances for each transformation operation, denoted as T c ( i ) = { T c 1 ( i ) , , T c m ( i ) , , T c M ( i ) } and T e ( i ) = { T e 1 ( i ) , , T e m ( i ) , , T e M ( i ) } . The corresponding pseudo-labels are L T e ( i ) = { l T c 1 ( i ) = F ( T c 1 ( i ) ) , , l T c m ( i ) = F ( T c m ( i ) ) , , l T c M ( i ) = F ( T c M ( i ) ) } and L T e ( i ) = { l T e 1 ( i ) = F ( T e 1 ( i ) ) , , l T e m ( i ) = F ( T e m ( i ) ) , , l T e M ( i ) = F ( T e M ( i ) ) } .
The generated dataset is based on assumed labels ( L = l i 1 , , l i N ) and is represented as D ( L ) :
D ( L ) = { ( i 1 , l i 1 ; T c ( i 1 ) , L T c ( i 1 ) ; T e ( i 1 ) , L T e ( i 1 ) ) , , ( i N , l i N ; T c ( i N ) , L T c ( i N ) ; T e ( i N ) , L T e ( i N ) ) } .
Please note that the original dataset does not contain labels. The labels in the generated dataset D ( L ) are pseudo-labels created based on assumed labels ( L = l i 1 , , l i N ) and self-supervised operations (such as L T c ( i 1 ) , L T e ( i 1 ) ). During the model training phase, these pseudo-labels dynamically change to fully utilize the self-supervised information.

4.2. Iterative Fine-Tuning

Based on the generated training data D ( L ) , we fine-tune the large language model (LLM) using the generated data D ( L ) . Due to the lack of real labels for each piece of training data (image), existing data cannot be directly used for supervised fine-tuning methods. For this purpose, we design an iterative fine-tuning method. By iteratively utilizing self-supervised information of data D ( L ) , the model can learn effective representations and features even without explicit labels, gradually optimizing its ability to construct knowledge graphs and alleviating the limitations of supervised learning caused by the lack of real labels.
Iterative Pseudo-label Supervision: In our dataset D ( L ) , each image i lacks a corresponding label l i L . To address this issue, we design an iterative self-supervised training method. Firstly, we use a pretrained LLM to generate a preliminary knowledge graph of the image as the initial label l i 1 (first epoch). Although the initial label L 1 = { l i 1 1 , , l i N 1 } generated is not entirely accurate, due to the presence of self-supervised information in the generated training dataset D ( L 1 ) , such as the generated pseudo-labels L T c ( i ) 1 , the fine-tuned LLM can better distinguish between background information and entity information. During the fine-tuning process, the model gradually optimizes and adjusts the generated knowledge graph by learning the intrinsic structure and patterns in the data. Therefore, as the number of iterations increases, the knowledge graph constructed by the fine-tuned LLM gradually becomes more accurate and reliable. Through multiple epochs of the iterative self-supervised training method, we can ensure that the final generated knowledge graph has accuracy in expressing and capturing entities and their relationships. This iterative training strategy fully utilizes the concept of self-supervised learning, gradually improving the understanding ability of complex information in images through feedback and the correction of the model itself, thereby effectively improving the performance of the model.
Lightweight Fine-tuning: Currently, many studies are dedicated to proposing lightweight fine-tuning methods for LLMs to reduce expensive computational resource consumption. LoRA technology introduces an efficient fine-tuning method for LLMs by approximating weight updates using low-rank matrices, effectively reducing the number of parameters that need to be adjusted. Therefore, we employ LoRA as the technical means for the lightweight fine-tuning of the LLM. The objective of LoRA can be computed as Equation (1).
max Θ L o R A ( x , y ) D ( L ) k = 1 | y | log ( P Θ L L M + Θ L o R A ( y t | x , y < t } ) ,
where Θ L L M represents the original LLM parameters, which are frozen. Θ L o R A denotes the LoRA parameters. D ( L ) is the constructed training data. x and y represent the image data and the pseudo-label, respectively. For instance, for a given image, i, and its corresponding initial label, l i 1 L 1 , ( x , y ) would be ( i , l i 1 ) .
Iterative Training Processing: The training process of our model involves multiple iterations, as shown in Figure 4. To ensure the sufficient optimization of the LLM, we set the total number of iterations to K rounds. This means that the proposed model undergoes K complete fine-tuning training cycles. In each fine-tuning iteration, the LLM learns and adjusts parameters based on the current training data to continuously improve its performance and accuracy. Through such multiple iterations, we expect the LLM to gradually converge to an ideal state, achieving the expected accuracy and effectiveness. Setting the total number of iterations to K ensures that the model has sufficient time and opportunity to optimize and adjust, thereby improving the final performance.
In the initial training phase, we construct a training dataset containing images and corresponding pseudo-labels, denoted as D ( L 1 = F Θ L L M ( I ) ) , while the function F Θ L L M is our model with the parameters Θ L L M . Next, we utilize this initial training dataset D 1 for the first iteration to optimize the following loss function shown in Equation (2).
max Θ L o R A 1 ( x , y ) D ( L 1 ) k = 1 | y | log ( P Θ L L M + Θ L o R A 1 ( y t | x , y < t } ) .
Although the pseudo-labels in the training dataset are derived from the LLM, introducing self-supervised information into the dataset and using it as the input for the first fine-tuning training can enable the model to extract additional useful information from it, thereby enhancing its performance and generalization ability.
In the k iteration, we use the labels generated from the LLM model parameters Θ L L M + Θ L o R A k 1 obtained from the k 1 iteration to construct a new dataset, D ( L k = F Θ L L M + Θ L o R A k 1 ( I ) ) . Specifically, we employ the model parameters Θ L L M + Θ L o R A obtained from the k 1 iteration to annotate all images in the training data, generating a new training dataset, D k , used for the k iteration. This process aims to improve model performance by optimizing the following loss function shown in Equation (3).
max Θ L o R A k ( x , y ) D ( L k ) k = 1 | y | log ( P Θ L L M + Θ L o R A k ( y t | x , y < t } ) .
By incorporating the optimization results from the k 1 iteration as supervisory information, the labels of the training data D ( L k ) become more accurate. This iterative approach gradually adjusts and refines the model through fine-tuning new data distributions, thereby enhancing its predictive capability. This iterative fine-tuning process can be viewed as incremental learning, which proves particularly effective in adapting the model to continuously changing data distributions and label variations throughout the iteration process. This adaptability ultimately leads to improved prediction accuracy over time.

4.3. Filtering Strategies

Due to our iterative fine-tuning approach relying on self-supervised information introduced from the constructed training dataset, we design a filtering strategy to assist the LLM in removing low-confidence constructed training data during the fine-tuning phase, as shown in Figure 4. The core intuition is derived from information bottleneck theory, which suggests that when information is compressed through transformations, some useful signals for prediction may be lost.
We assume that the original image i contains a richer set of semantic and structural signals compared to its transformed counterpart T c ( i ) , especially when T c deliberately removes or distorts salient parts of the image. Therefore, when these images are processed by the LLM, the output derived from the original image should retain higher mutual information (MI) with the original semantic content than that derived from the transformed image.
To quantify this, we use mutual information, M I ( X ; Y ) , which measures the amount of shared information between two variables. In our case, X is the input image (either original or transformed), and Y is the LLM’s modeling result (e.g., predicted entities or relations). A higher MI indicates that the model’s output preserves more of the original input’s informative content. We propose the following three MI-based filtering strategies.
Strategy 1: The images T c ( i ) generated by the transformation T c should contain less information than the original image i after being modeled by the LLM. This is because the crucial information required for identifying entities and their relationships is removed from the images T c ( i ) , thereby reducing the information content in the LLM’s output results.
Strategy 2: The operation T e is used to produce multiple images T e ( i ) , inputting these images into LLM to obtain multiple results. Then, compare the consistency among these results. Since T e is designed to filter out irrelevant information, each transformed image should exhibit similar information content. Significant discrepancies among these results may indicate lower confidence levels in certain transformed images T e ( i ) , necessitating their filtration.
Strategy 3: This strategy compares two types of transformation operations, T c and T e . T e removes unnecessary information from the original image, while T c removes useful information. Consequently, in the LLM’s output results, the information content derived from T e should surpass that from T c .
These filtering strategies enable the effective identification and removal of generated training data with low confidence, ensuring that only high-confidence data is used for model fine-tuning. The measurement method based on mutual information ( M I ) can effectively measure changes in the amount of information. Next, we explain how to use mutual information to implement the three filtering strategies for the k iteration model F Θ L L M + Θ L o R A k , as shown in Equations (4)–(6).
M I ( F Θ L L M + Θ L o R A k ( i ) , F Θ L L M + Θ L o R A k ( T c m ( i ) ) > α 1
M I ( F Θ L L M + Θ L o R A k ( T e ( i ) ) , F Θ L L M + Θ L o R A k ( T e m ( i ) ) ) < α 2
M I ( F Θ L L M + Θ L o R A k ( T e m ( i ) ) , F Θ L L M + Θ L o R A k ( T c m ( i ) ) > α 3 ,
T c m ( i ) T c ( i ) and T e m ( i ) T e ( i ) . Here, α 1 , α 2 , and α 3 are the hyperparameters. If Equation (4), Equation (5), or Equation (6) is not satisfied, then we need to delete the corresponding training samples. For example, Equation (4) measures the mutual information between i and T c m ( i ) . If this value is less than a l p h a 1 , it indicates that i is similar to T c m ( i ) . Consequently, the image T c m ( i ) is deemed to retain significant entities and relationships that are at risk of being overlooked. As a result, T c m ( i ) fails to fulfill the necessary criteria and should be removed from the dataset. This exclusion ensures that T c m ( i ) will not be part of the k iteration of the training process.

4.4. Model Discussion

1. Comprehensive Comparison with Existing Solutions: Unlike most current approaches that rely heavily on textual or multimodal data, our method exclusively utilizes image data as the sole input source for knowledge graph construction. This unique focus on visual information allows us to explore underutilized semantic cues embedded in images, providing a fresh perspective on visual knowledge representation. Moreover, while conventional methods often depend on supervised or semi-supervised training, our approach adopts an unsupervised learning framework, enabling the model to automatically extract structured knowledge from raw images without requiring annotated labels or human-defined schemas.
2. Theoretical Background and Innovation: The core innovation of our method lies in its image-centric, end-to-end KG construction pipeline. Traditional pipelines typically involve multi-stage operations such as entity recognition, attribute detection, and relation extraction, often designed separately and optimized independently. In contrast, our model learns a direct mapping from a visual input to a structured knowledge representation, enabling a more cohesive and scalable solution. This end-to-end paradigm not only simplifies the workflow but also opens new possibilities for learning richer visual semantics in a unified architecture. Theoretically, our work builds on advances in contrastive learning, visual representation learning, and graph construction, integrating them into a pipeline that pushes the boundaries of visual knowledge extraction.
3. Novel Methodology and Field Impact: At the heart of our approach is an instance-level KG construction mechanism, where each individual image is treated as a unique knowledge source. Instead of building a single, unified graph across the dataset, our model generates a mini-knowledge graph per image, capturing localized semantics and supporting fine-grained reasoning. By introducing this methodology, we provide a framework that can potentially reshape how knowledge graphs are constructed and applied in image-centric applications.
In summary, our proposed ImgGraph-LLM introduces a different approach to visual knowledge extraction, characterized by unique data sources, unsupervised training, instance-specific graph outputs, and an end-to-end architecture. These innovations not only distinguish our work from existing methods but also contribute novel tools and perspectives to the broader field of knowledge representation and visual understanding.

5. Experiments

5.1. Experimental Setup

Dataset: Due to the lack of an existing real image dataset from the State Grid Corporation of China, we collected and cleaned data from real scenarios, including transmission, distribution, substation, and safety supervision scenes. Our dataset was unlabeled, aligning with the objective of unsupervised knowledge graph construction. To evaluate the performance and generalization capability of the proposed model, we implemented the following strategies: (a) Manual Annotation for Testing: We randomly selected a subset of the data and created a test set with manually annotated labels. This allowed for a direct assessment of the model’s performance. (b) Dataset Variations for Performance Testing: We prepared two datasets of different sizes, as detailed in Table 1, to investigate the model’s performance across varying data scales. These strategies helped in comprehensively understanding the model’s effectiveness and its ability to generalize across different data sizes.
Evaluation Metrics: To evaluate the constructed knowledge graph, we employed three standard metrics: precision, recall, and F1 scores [16,17,30]. These metrics were calculated by comparing the generated triples with the gold-standard triples from the test data. Notably, since the order of triples did not influence the results, the evaluation script identified the best possible match between the generated and gold triples by examining all potential permutations. We then utilized Named Entity Evaluation metrics to assess three levels of match: Exact, Partial, and Strict [14,30]. These categories reflected varying degrees of tolerance for how closely an entity’s match needed to align with the actual entity in both content and its placement within a triple.
Baselines: Since this was unsupervised knowledge graph generation, existing related work typically relies on supervised signals, whether for knowledge graph construction [3,4,5] or scene graph generation models [31,32,33]. To assess the efficacy of ImgGraph-LLM, we compared it against two established baselines: ChatGPT-4o and ImageCaption. ChatGPT-4o is capable of real-time visual reasoning, taking an image as input and directly producing a knowledge graph. Access to ChatGPT-4o was granted on 1 June 2024, via https://chatgpt.com. For the image-to-knowledge graph process, we also employed an image captioning model [34], which provides comprehensive descriptions of visual content and uses Grapher [17] to facilitate the subsequent development of a knowledge graph. Thus, we used image captioning (referred to as ImageCaption) as a benchmark method for constructing knowledge graphs.
Parameter Settings: Our model was fine-tuned, based on an LLM. Given the availability of several LLMs, we selected MiniGPT-4 as our backbone for fine-tuning, after considering two key requirements: the ability to process image data and access to open-source code for easy training. It is important to note that other models meeting these criteria could also be used. We set M to 32, which ensured training efficiency while maximizing the use of self-supervised information. We searched for the parameters α 1 , α 2 , and α 3 within the set 0.1, 0.2, 0.3, 0.4, and 0.5; then, we set α 1 to 0.2, α 2 to 0.2, and α 3 to 0.3. The parameters α 1 , α 2 , and α 3 were =0.5. All experiments were conducted on one NVIDIA RTX A100 GPU.

5.2. Results and Analysis

5.2.1. Overall Performance

Table 2 and Table 3 present the comprehensive experimental results across two datasets, evaluated under different metrics. We can obtain the following observations:
1. Superior Performance of ImgGraph-LLM: Our ImgGraph-LLM model achieved the best performance across all metrics. Specifically, on the StateGrid_S and StateGrid_L datasets, ImgGraph-LLM demonstrated an average improvement of 10.6% and 20.4%, respectively, compared to the best baseline model. Although the performance improvement differed between the two datasets, it was consistently superior to that of the baseline model, indicating that our model is robust and reliable across datasets of varying scales. These experimental results confirm the effectiveness of our method in unsupervised image-based knowledge graph construction (KGC) tasks.
2. Baseline Model Performance: The performance of the baseline model was relatively consistent across both datasets. Since our task involved constructing an unsupervised knowledge graph without labeled data to train ImageCaption, we used an existing pre-trained model for testing, resulting in similar performance on both datasets. Similarly, for ChatGPT-4o, we directly used OpenAI’s API for testing, and its performance remained relatively stable.
3. Comparison of ChatGPT-4o and ImageCaption: The performance of the ChatGPT-4o model was significantly lower than that of the ImageCaption model. This discrepancy arose because ChatGPT-4o not only recognizes entities in images but also models more implicit information using its logical reasoning capabilities, such as electric installation entity. This additional information modeling introduced interference from irrelevant information, leading to decreased performance. In contrast, the ImageCaption method focused solely on recognizing and modeling images without additional inference, resulting in less irrelevant information and, consequently, better performance than ChatGPT-4o.

5.2.2. Ablation Study

To investigate the effects of self-supervised operations, we separately removed the operation T c and operation T e , referred to as “Our w/o T c ” and “Our w/o T e ”, respectively. According to the results in Table 4, we observe that removing either the operation T c or operation T e led to a decrease in performance; however, both variations still outperformed all baselines. This result validates the effectiveness of the two self-supervised operations: they provide self-supervised signals that enhance the fine-tuning of LLMs, enabling the better identification of entities and relationships. Additionally, the performance of “Our w/o T e ” was superior to that of “Our w/o T c ”, possibly because the transformation operation T c directly involves the modification of entities and relationships, which is more closely related to the core information required for constructing knowledge graphs. Please note that the same trend was also present for the StateGrid_S dataset. Overall, by integrating these two approaches, which utilize distinct self-supervised signals for LLM fine-tuning, optimal performance was achieved.

5.2.3. Case Study

We conducted a case study to visually demonstrate the efficacy of our proposed model. As depicted in Figure 5, our model precisely established the relationship between the entities “seatbelt”, “telegraph pole”, and “person”, whereas baseline models failed to accurately model the relationship between “seatbelt” and “telegraph pole”. Additionally, our model effectively filtered out irrelevant background information, a feature that was lacking in other baseline models. As a result, the knowledge graph constructed by our model exhibits superior accuracy.

6. Conclusions and Future Work

ImgGraph-LLM represents a significant advancement in the domain of image-based knowledge graph construction. It not only enhances the automation and precision of knowledge extraction from images but also paves the way for downstream applications in multimodal reasoning, visual question-answering, and cross-modal retrieval. We evaluated ImgGraph-LLM using two challenging real-world datasets. The experimental results demonstrate consistent and significant performance gains, with average improvements of 10% and 20.4%, respectively, over existing baselines in relational extraction accuracy. These results validate the effectiveness of our method in capturing complex visual-semantic relationships.
While ImgGraph-LLM demonstrates strong performance in constructing knowledge graphs from image data, it currently relies on domain-specific priors to generate pseudo-labels, particularly tailored for power grid imagery, where structured components and visual patterns are well defined. This reliance on prior knowledge may limit the generalizability of our approach when applied to other domains where such visual regularities are absent or less pronounced, such as in natural scenes or consumer photography. In future work, we plan to explore the adaptability of ImgGraph-LLM to a broader range of application scenarios. Specifically, we aim to investigate domain-agnostic strategies for pseudo-label generation to enhance the robustness and transferability of our model across diverse visual contexts.

Author Contributions

Conceptualization, L.C., Z.C., W.Y. and Y.L.; Software, Z.C. and W.Y.; Validation, L.C.; Resources, S.L.; Data curation, Z.C., W.Y. and S.L.; Writing—original draft, L.C. and Y.L.; Supervision, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Project from State Grid Corporation of China (No. 5700-202390301A-1-1-ZN).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy restrictions related to the sensitivity of power grid information.

Acknowledgments

We gratefully acknowledge the significant support and insightful contributions of Ming Li and Hailu Wang from the Information & Telecommunication Branch, State Grid Anhui Electric Power Company. Their assistance during the final revision stage was crucial to the completion of this study. Furthermore, their support on behalf of the funding institution was instrumental in enabling this research.

Conflicts of Interest

Authors Zhenyu Chen, Wei Yang, Shi Liu were employed by the company State Grid Corporation of China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Chen, Y.; Ge, X.; Yang, S.; Hu, L.; Li, J.; Zhang, J. A survey on multimodal knowledge graphs: Construction, completion and applications. Mathematics 2023, 11, 1815. [Google Scholar] [CrossRef]
  2. Zhu, X.; Li, Z.; Wang, X.; Jiang, X.; Sun, P.; Wang, X.; Xiao, Y.; Yuan, N.J. Multi-modal knowledge graph construction and application: A survey. IEEE Trans. Knowl. Data Eng. 2022, 36, 715–735. [Google Scholar] [CrossRef]
  3. Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Philip, S.Y. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 494–514. [Google Scholar] [CrossRef] [PubMed]
  4. Zhong, L.; Wu, J.; Li, Q.; Peng, H.; Wu, X. A comprehensive survey on automatic knowledge graph construction. ACM Comput. Surv. 2023, 56, 1–62. [Google Scholar] [CrossRef]
  5. Pan, S.; Luo, L.; Wang, Y.; Chen, C.; Wang, J.; Wu, X. Unifying large language models and knowledge graphs: A roadmap. IEEE Trans. Knowl. Data Eng. 2024, 36, 3580–3599. [Google Scholar] [CrossRef]
  6. Xu, Z.; Cruz, M.J.; Guevara, M.; Wang, T.; Deshpande, M.; Wang, X.; Li, Z. Retrieval-augmented generation with knowledge graphs for customer service question answering. In Proceedings of the SIGIR, Washington, DC, USA, 14–18 July 2024; pp. 2905–2909. [Google Scholar]
  7. Gaur, M.; Gunaratna, K.; Srinivasan, V.; Jin, H. Iseeq: Information seeking question generation using dynamic meta-information retrieval and knowledge graphs. AAAI 2022, 36, 10672–10680. [Google Scholar] [CrossRef]
  8. Ma, X. Knowledge graph construction and application in geosciences: A review. Comput. Geosci. 2022, 161, 105082. [Google Scholar] [CrossRef]
  9. Vrandečić, D.; Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. ACM 2014, 57, 78–85. [Google Scholar] [CrossRef]
  10. Al-Khatib, K.; Hou, Y.; Wachsmuth, H.; Jochim, C.; Bonin, F.; Stein, B. End-to-end argumentation knowledge graph construction. AAAI 2020, 34, 7367–7374. [Google Scholar] [CrossRef]
  11. Asprino, L.; Daga, E.; Gangemi, A.; Mulholland, P. Knowledge graph construction with a façade: A unified method to access heterogeneous data sources on the web. TOIT 2023, 23, 1–31. [Google Scholar] [CrossRef]
  12. Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
  13. Gao, P.; Han, J.; Zhang, R.; Lin, Z.; Geng, S.; Zhou, A.; Zhang, W.; Lu, P.; He, C.; Yue, X.; et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv 2023, arXiv:2304.15010. [Google Scholar]
  14. Zhang, B.; Soh, H. Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction. arXiv 2024, arXiv:2404.03868. [Google Scholar]
  15. Agarwal, O.; Kale, M.; Ge, H.; Shakeri, S.; Al-Rfou, R. Machine translation aided bilingual data-to-text generation and semantic parsing. In Proceedings of the WebNLG+, Dublin, Ireland, 18 December 2020; pp. 125–130. [Google Scholar]
  16. Dognin, P.L.; Padhi, I.; Melnyk, I.; Das, P. ReGen: Reinforcement learning for text and knowledge base generation using pretrained language models. arXiv 2021, arXiv:2108.12472. [Google Scholar]
  17. Melnyk, I.; Dognin, P.; Das, P. Grapher: Multi-stage knowledge graph construction using pretrained language models. In Proceedings of the 2021 NeurIPS Workshop, Online, 14 December 2021. [Google Scholar]
  18. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
  19. Wang, M.; Qi, G.; Wang, H.; Zheng, Q. Richpedia: A comprehensive multi-modal knowledge graph. In JIST; Springer: Cham, Switzerland, 2020; pp. 130–145. [Google Scholar]
  20. Liu, Y.; Li, H.; Garcia-Duran, A.; Niepert, M.; Onoro-Rubio, D.; Rosenblum, D.S. MMKG: Multi-modal knowledge graphs. In Proceedings of the ESWC, Portorož, Slovenia, 2–6 June 2019; pp. 459–474. [Google Scholar]
  21. Alberts, H.; Huang, N.; Deshpande, Y.; Liu, Y.; Cho, K.; Vania, C.; Calixto, I. VisualSem: A high-quality knowledge graph for vision and language. In Proceedings of the MRL Workshop, Online, 26–27 May 2021; pp. 138–152. [Google Scholar]
  22. Wang, X.; Meng, B.; Chen, H.; Meng, Y.; Lv, K.; Zhu, W. TIVA-KG: A multimodal knowledge graph with text, image, video and audio. In Proceedings of the MM, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 2391–2399. [Google Scholar]
  23. Huang, S.; Dong, L.; Wang, W.; Hao, Y.; Singhal, S.; Ma, S.; Lv, T.; Cui, L.; Mohammed, O.K.; Patra, B.; et al. Language is not all you need: Aligning perception with language models. NeurIPS 2024, 36, 72096–72109. [Google Scholar]
  24. Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the ICML, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
  25. Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-language models for vision tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef]
  26. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
  27. Wang, W.; Chen, Z.; Chen, X.; Wu, J.; Zhu, X.; Zeng, G.; Luo, P.; Lu, T.; Zhou, J.; Qiao, Y.; et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. NeurIPS 2024, 36, 61501–61513. [Google Scholar]
  28. Chen, P.; Liu, S.; Jia, J. Jigsaw clustering for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11526–11535. [Google Scholar]
  29. Noroozi, M.; Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision, Las Vegas, NV, USA, 27–30 June 2016; pp. 69–84. [Google Scholar]
  30. Ferreira, T.C.; Gardent, C.; Ilinykh, N.; Van Der Lee, C.; Mille, S.; Moussallem, D.; Shimorina, A. The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation results (WebNLG+ 2020). In Proceedings of the WebNLG+, Dublin, Ireland, 18 December 2020. [Google Scholar]
  31. Li, R.; Zhang, S.; Wan, B.; He, X. Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the CVPR, Nashville, TN, USA, 20–25 June 2021; pp. 11109–11119. [Google Scholar]
  32. Yoon, K.; Kim, K.; Moon, J.; Park, C. Unbiased heterogeneous scene graph generation with relation-aware message passing neural network. AAAI Conf. Artif. Intell. 2023, 37, 3285–3294. [Google Scholar] [CrossRef]
  33. Zellers, R.; Yatskar, M.; Thomson, S.; Choi, Y. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5831–5840. [Google Scholar]
  34. Yang, X.; Wu, Y.; Yang, M.; Chen, H.; Geng, X. Exploring diverse in-context configurations for image captioning. NeurIPS 2024, 36, 40924–40943. [Google Scholar]
Figure 1. An example of comparing different knowledge graph construction (KGC).
Figure 1. An example of comparing different knowledge graph construction (KGC).
Information 16 00335 g001
Figure 2. The overall architecture of our proposed model.
Figure 2. The overall architecture of our proposed model.
Information 16 00335 g002
Figure 3. An example of the jigsaw shuffling operation. This operation guides the LLM to accurately identify entities and their relationships, thereby mitigating recognition errors caused by hallucinations and common-sense mistakes. Since the jigsaw shuffling operation is applied to the center region of the original image, the absence of this component would result in a lack of training data, making the effective training of our proposed model impossible.
Figure 3. An example of the jigsaw shuffling operation. This operation guides the LLM to accurately identify entities and their relationships, thereby mitigating recognition errors caused by hallucinations and common-sense mistakes. Since the jigsaw shuffling operation is applied to the center region of the original image, the absence of this component would result in a lack of training data, making the effective training of our proposed model impossible.
Information 16 00335 g003
Figure 4. An example of the training process. We present a sample from the training data and illustrate the outputs of the LLM in the initial iteration (iteration 0) and after K iterations. In iteration 0, both node and edge predictions are inaccurate. After K iterations, the predictions become more accurate. We also apply filtering operations to the LLM outputs to remove incorrect results. In the figure, the elements marked with gray blocks represent the parts that are filtered out.
Figure 4. An example of the training process. We present a sample from the training data and illustrate the outputs of the LLM in the initial iteration (iteration 0) and after K iterations. In iteration 0, both node and edge predictions are inaccurate. After K iterations, the predictions become more accurate. We also apply filtering operations to the LLM outputs to remove incorrect results. In the figure, the elements marked with gray blocks represent the parts that are filtered out.
Information 16 00335 g004
Figure 5. The results of the case study obtained by different methods. Red indicates a lack of a match with the true value in the test.
Figure 5. The results of the case study obtained by different methods. Red indicates a lack of a match with the true value in the test.
Information 16 00335 g005
Table 1. Statistics for the two datasets. The test set was manually annotated.
Table 1. Statistics for the two datasets. The test set was manually annotated.
DatasetsTrainTest
StateGrid_S70001000
StateGrid_L3500500
Table 2. Recommendation accuracy on StateGrid_L. The best result is highlighted in bold.
Table 2. Recommendation accuracy on StateGrid_L. The best result is highlighted in bold.
ModelMatchF1PrecisionRecall
ImageCaptionExact0.6780.6780.680
Partial0.6840.6860.685
Strict0.6730.6740.674
ChatGPT-4oExact0.4420.4380.449
Partial0.4630.4560.472
Strict0.4090.4060.415
ImgGraph-LLMExact0.8080.8010.822
Partial0.8340.8230.850
Strict0.8070.8010.815
Table 3. Recommendation accuracy on StateGrid_S. The best result is highlighted in bold.
Table 3. Recommendation accuracy on StateGrid_S. The best result is highlighted in bold.
ModelMatchF1PrecisionRecall
ImageCaptionExact0.6750.6760.679
Partial0.6840.6860.685
Strict0.6710.6700.671
ChatGPT-4oExact0.4400.4370.448
Partial0.4630.4550.472
Strict0.4080.4070.413
ImgGraph-LLMExact0.7310.7270.745
Partial0.7780.7640.793
Strict0.7320.7260.749
Table 4. Ablation study of self-supervised operations using StateGrid_L. The best results are highlighted in bold.
Table 4. Ablation study of self-supervised operations using StateGrid_L. The best results are highlighted in bold.
ModelMatchF1PrecisionRecall
OurExact0.8080.8010.822
Partial0.8340.8230.850
Strict0.8070.8010.815
Our w/o  T e Exact0.7840.7760.796
Partial0.8080.7930.827
Strict0.7790.7690.785
Our w/o  T c Exact0.7540.7410.764
Partial0.7790.7630.795
Strict0.7480.7400.742
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, L.; Chen, Z.; Yang, W.; Liu, S.; Li, Y. From Pixels to Insights: Unsupervised Knowledge Graph Generation with Large Language Model. Information 2025, 16, 335. https://doi.org/10.3390/info16050335

AMA Style

Chen L, Chen Z, Yang W, Liu S, Li Y. From Pixels to Insights: Unsupervised Knowledge Graph Generation with Large Language Model. Information. 2025; 16(5):335. https://doi.org/10.3390/info16050335

Chicago/Turabian Style

Chen, Lei, Zhenyu Chen, Wei Yang, Shi Liu, and Yong Li. 2025. "From Pixels to Insights: Unsupervised Knowledge Graph Generation with Large Language Model" Information 16, no. 5: 335. https://doi.org/10.3390/info16050335

APA Style

Chen, L., Chen, Z., Yang, W., Liu, S., & Li, Y. (2025). From Pixels to Insights: Unsupervised Knowledge Graph Generation with Large Language Model. Information, 16(5), 335. https://doi.org/10.3390/info16050335

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop