Large Language Models for Structured and Semi-Structured Data, Recommender Systems and Knowledge Base Engineering: A Survey of Recent Techniques and Architectures

Smajić, Alma; Karlović, Ratomir; Bobanović Dasko, Mieta; Lorencin, Ivan

doi:10.3390/electronics14153153

Open AccessReview

Large Language Models for Structured and Semi-Structured Data, Recommender Systems and Knowledge Base Engineering: A Survey of Recent Techniques and Architectures

Faculty of Informatics, Juraj Dobrila University of Pula, Alda Negrija 6, 52100 Pula, Croatia

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(15), 3153; https://doi.org/10.3390/electronics14153153

Submission received: 9 June 2025 / Revised: 21 July 2025 / Accepted: 6 August 2025 / Published: 7 August 2025

(This article belongs to the Special Issue Advances in Algorithm Optimization and Computational Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Large Language Models (LLMs) are reshaping recommendation systems through enhanced language understanding, reasoning, and integration with structured data. This systematic review analyzes 88 studies published between 2023 and 2025, categorized into three thematic areas: data processing, technical identification, and LLM-based recommendation architectures. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, the review highlights key trends such as the use of knowledge graphs, Retrieval-Augmented Generation (RAG), domain-specific fine-tuning, and robustness improvements. Findings reveal that while LLMs significantly advance semantic reasoning and personalization, challenges remain in hallucination mitigation, fairness, and domain adaptation. Technical innovations, including graph-augmented retrieval methods and human-in-the-loop validation, show promise in addressing these limitations. The review also considers the broader macroeconomic implications associated with the deployment of LLM-based systems, particularly as they relate to scalability, labor dynamics, and resource-intensive implementation in real-world recommendation contexts, emphasizing both productivity gains and potential labor market shifts. This work provides a structured overview of current methods and outlines future directions for developing reliable and efficient LLM-based recommendation systems.

Keywords:

information extraction; large language models (LLMs); prompt engineering; recommender systems; Retrieval-Augmented Generation (RAG); vector databases

1. Introduction

1.1. Large Language Models: Definition and Emergence

In recent years, Large Language Models (LLMs) have become a central focus of research and development within AI, significantly influencing diverse sectors including healthcare, finance, education, and entertainment. LLMs are sophisticated computational models trained on large collections of textual data (corpora), encompassing a variety of linguistic patterns, syntactic structures, and semantic relationships [1]. Through exposure to such extensive datasets, LLMs are able to generate coherent text, summarize content, translate between languages, engage in dialogue, and, increasingly, provide personalized recommendations [2]. Given the rapid pace of adoption and the expanding range of applications, there is a growing need to critically examine how LLMs are being integrated into complex decision-support mechanisms such as recommendation systems. This systematic review was, therefore, conducted to synthesize recent developments, assess technical foundations, and identify emerging best practices in this evolving domain.

1.2. Global Adoption Trends of LLMs

The adoption of LLMs is accelerating globally. Surveys indicate that approximately 50% of organizations have already experimented with LLM applications, with projections estimating that over 75% of enterprises will operationalize LLM-based systems by 2030 [3,4]. This trend is driven by advancements in computational efficiency, the decreasing cost of large-scale model training, and the expanding availability of high-quality data resources, enabling LLMs to be increasingly embedded in operational and strategic workflows.

1.3. Application of LLMs in Recommendation Systems

One of the most dynamic application areas for LLMs is the development of recommendation systems (RS). Traditional recommendation techniques, such as collaborative filtering and content-based filtering, are increasingly being enhanced or replaced by LLM-driven architectures that offer deeper semantic comprehension, context-aware reasoning, and dynamic content generation. LLM-based recommendation systems are being deployed across a wide range of industries, enabling more precise product suggestions, educational content personalization, media recommendations, and advanced decision support tools [5,6,7].

Although this review primarily focuses on the technical and application aspects of LLM-based recommendation systems, it is important to briefly acknowledge their potential macroeconomic implications. The integration of LLMs into business processes is expected to contribute to productivity growth and accelerate innovation. However, recent analyses caution that such integration could exacerbate labor market inequalities, displace routine cognitive tasks, and consolidate market power among technology-leading firms [8,9]. These broader effects underscore the importance of considering both the opportunities and challenges presented by large-scale LLM adoption.

1.4. Objectives and Structure

The aim of this paper is to systematically review and synthesize the current state of research on LLM-based recommendation systems, with a particular emphasis on technical foundations, application domains, methodological approaches, and key implementation challenges. The review also briefly addresses the macroeconomic impacts associated with LLM deployment.

To achieve these goals, Section 2 describes the materials and methods employed, including the search strategy, inclusion and exclusion criteria, screening procedures, and data extraction techniques. Section 3 presents the results, synthesizing technical characteristics, application domains, and best practices identified in the literature. Section 4, Section 5, Section 6 and Section 7 offer a critical discussion of the main findings and briefly reflects on broader economic considerations. Section 8 concludes with final remarks and policy considerations. Author contributions, funding disclosures, and conflict of interest statements are provided at the end.

2. Materials and Methods

This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 statement to ensure methodological transparency, rigor, and reproducibility of the search and selection process. The objective was to identify, synthesize, and evaluate recent research concerning the application of large language models (LLMs) in processing structured and semi-structured data, the integration of LLMs within recommendation systems augmented by knowledge management components, and the technical approaches employed in engineering knowledge bases specifically for LLM utilization.

Relevant literature was retrieved from a curated set of academic databases and repositories selected for their comprehensive and up-to-date coverage of peer-reviewed and preprint research in computer science, artificial intelligence, and recommendation systems. Specifically, we included MDPI, SpringerLink, ScienceDirect, Taylor & Francis, SagePub, IEEE Xplore, ACM Digital Library, ACL Anthology, Semantic Scholar (including arXiv), OpenReview (covering ICLR, ICML, and NeurIPS), ResearchGate, and Academia.edu. These sources were chosen based on their indexing of top-tier journals, major AI/ML conference proceedings (e.g., NeurIPS, ACL, ICLR, ICML), and technical reports, as well as their inclusion of emergent research from preprint repositories. Semantic Scholar was particularly important for capturing cutting-edge LLM studies disseminated via arXiv and conference preprints. Together, this selection provides disciplinary relevance, methodological diversity, and temporal coverage appropriate to the fast-evolving field of LLM-based systems. These sources collectively index the majority of top-tier journals, conferences, and technical reports in the domain, ensuring coverage of both validated and emergent research.

The identified studies were subsequently categorized into three primary thematic groups to structure the analysis:

Data category: Queries aimed at capturing studies on LLMs processing structured/semi-structured data, using combinations like:
- (LLM OR "large language model") AND ("structured data" OR
  "semi-structured data")
- AND ("vectorization" OR "knowledge graph" OR "embedding" OR
  "data transformation")
Recommendation category: Queries focused on systems integrating LLMs with auxiliary knowledge components for recommendation purposes, such as the following:
- (LLM OR "large language model") AND ("recommendation system" OR
  "recommender system"
  OR "retrieval augmented generation" OR RAG) AND ("vector store" OR
- "vector database"
  OR "graph database" OR "knowledge base")
Technical identification category: Searches targeting architectural and methodological studies on knowledge base construction and LLM integration, such as the following:
- LLM AND "knowledge base"
- (LLM OR "language model") AND ("knowledge base" OR "knowledge graph")
- AND (integration OR architecture OR evaluation OR design)

2.1. Search Strategy

The literature search was performed during April and July 2025. Structured keyword-based queries were developed and executed across the selected databases and repositories. To maximize both relevance and coverage, search strings incorporated combinations of core concepts using Boolean operators (AND, OR) and phrase searching (utilizing quotation marks). These queries were carefully adapted to the specific syntax and indexing mechanisms of each platform (e.g., utilizing field-specific searches such as title, abstract, and keywords where available).

Search terms were formulated to align with the three core thematic areas of the review. Representative examples include the following:

Data category: Queries aimed at capturing studies on LLMs processing structured/semi-structured data, using combinations like the following:
- (LLM OR "large language model") AND ("structured data" OR
  "semi-structured data")
- AND ("vectorization" OR "knowledge graph" OR "embedding" OR
  "data transformation")
Recommendation category: Queries focused on systems integrating LLMs with auxiliary knowledge components for recommendation purposes, such as the following:
- (LLM OR "large language model") AND ("recommendation system" OR
  "recommender system" OR "retrieval augmented generation" OR RAG)
  AND ("vector store" OR "vector database" OR "graph database" OR
- "knowledge base")
Technical identification category: Searches targeting architectural and methodological studies on knowledge base construction and LLM integration, such as the following:
- LLM AND "knowledge base",
- (LLM OR "language model") AND ("knowledge base" OR "knowledge graph")
  AND (integration OR architecture OR evaluation OR design)

2.2. Inclusion and Exclusion Criteria

The selection of literature adhered to predefined criteria based on relevance to the research objectives, publication time frame, language and study type. The period from January 2023 to July 2025 was chosen to capture the most recent advancements in LLM applications, particularly following the widespread availability and impact of models like GPT-3.5/4.0 and their subsequent integration into diverse systems handling structured data and knowledge components. This time frame reflects the rapid acceleration of research at the intersection of LLMs, structured data and knowledge engineering.

Studies were included if they met the following criteria:

Were published between 1 January 2023 and 15 July 2025;
Were written in English;
Addressed at least one of the review’s three thematic areas:
- the use of LLMs with structured or semi-structured data;
- the integration of LLMs into recommendation systems, incorporating auxiliary knowledge structures; or
- the design, evaluation or integration techniques for knowledge bases tailored for LLMs;
Provided original empirical results, novel system implementations, architectural frameworks, or quantitative evaluations;
Were available as peer-reviewed journal articles, conference papers, or technical reports with substantive technical contributions (including preprints from trusted repositories such as arXiv or OpenReview).

Particular emphasis during screening was placed on identifying studies that detailed the integration of LLMs with auxiliary knowledge structures (e.g., vector stores, graph databases, knowledge graphs) as part of their core contribution, reflecting the review’s focus on augmented LLM systems. Literature was excluded based on the following criteria:

Focused solely on theoretical discussions, conceptual frameworks or opinion pieces without accompanying technical implementation or empirical validation;
Addressed general-purpose natural language processing (NLP) or LLM capabilities without specific relevance to structured/semi-structured data processing, knowledge integration, or recommendation systems as defined by thematic areas;
Published before January 2023;
Not written in English;
Duplicates of already identified records;
Inaccessible in full-text format after reasonable attempts (e.g., checking institutional access, public repositories);
Non-academic sources such as blog posts, editorials, news articles or promotional materials.

Special attention was paid to identifying contributions that demonstrated concrete integration of LLMs with external knowledge components (e.g., graph databases, retrieval pipelines, vector stores), as this reflects the central aim of the review.

2.3. Screening and Eligibility—PRISMA

The study selection process followed a structured, multi-phase screening protocol in line with PRISMA 2020 guidelines. The screening was carried out in four distinct stages to ensure precision and transparency:

Automated Filtering: An initial batch of records ( $n \approx 61, 678$ ) was retrieved across all databases. The 61,451 records referenced in Figure 1 were not manually reviewed at the title, abstract, and full-text screening level; these were excluded via automated filtering based on metadata issues (e.g., missing abstracts, titles), duplicate entries, non-English language, or thematic irrelevance (e.g., unrelated domains such as biology or pure linguistics). Only entries with valid metadata and matching initial inclusion keywords were retained.
Keyword-Based Thematic Filtering: Remaining records were filtered using structured Boolean queries aligned with the three review categories (structured/semi-structured data processing, LLM-based recommendation systems, and technical architectures for knowledge augmentation). This step reduced the pool to 227 potentially relevant studies.
Title and Abstract Screening: Two independent reviewers (Reviewer A and Reviewer B) evaluated the titles and abstracts of the 227 remaining records. Inclusion required explicit alignment with at least one of the review’s three thematic categories. Studies containing relevant keywords but lacking substantive focus on system-level implementation or integration (e.g., generic NLP tasks) were excluded. Conflicts between reviewers were resolved through discussion; a third reviewer (Reviewer C) was designated to resolve disagreements, but was not needed in practice.
Full-Text Review: A full-text evaluation was conducted on 102 records. Three could not be retrieved due to access limitations. From the remaining 99, 11 studies were excluded—7 were not primary research (e.g., opinion pieces or conceptual overviews), and 4 were duplicate versions of previously assessed work. This led to a final inclusion of 88 studies.

All decisions and counts throughout the screening process are documented in the PRISMA 2020 flow diagram presented in Figure 1, which visually summarizes identification, screening, eligibility assessment, and inclusion.

2.4. Data Collection and Thematic Categorization

A structured data extraction form was developed using a shared spreadsheet format (Google Sheets) to systematically collect pertinent information from each included study. Extracted data fields comprised the study objective, type of data utilized (structured, semi-structured, or hybrid), description of recommendation system components (e.g., vector stores, knowledge graphs), methodological approaches for integrating external knowledge into LLMs, and primary outcomes or evaluation results.

To ensure consistency and completeness, the data extraction template was iteratively refined during the early stages of the review process. Authors verified agreement on the interpretation of each field before proceeding with full extraction. For each included study, we also recorded publication metadata (author, year, venue), recommendation system architecture (e.g., collaborative filtering, prompting-based, graph-based), knowledge augmentation technique (e.g., RAG, KG, vector database), and evaluation metrics (e.g., accuracy, fairness, robustness).

Following full extraction, the collected information was organized both quantitatively and thematically, enabling the identification of dominant trends, methodological variations, and technological gaps across the reviewed literature. This synthesized dataset forms the basis of the results presented in the Section 3, where we detail the characteristics and thematic analysis of the included studies.

2.5. Quality Considerations and Risk of Bias

As this review synthesizes technical and empirical studies rather than clinical or observational trials, a formal risk-of-bias tool (e.g., ROBIS) was not used. Instead, quality assurance was ensured through the inclusion of studies that featured implementation-level detail, empirical validation, or novel architectural contributions.

We further address bias and fairness explicitly in Section 4.1, noting the prevalence of popularity and semantic bias in current LLM-based recommendation systems. The lack of standardized mitigation strategies is identified as a critical limitation and forms part of our future research recommendations.

3. Results

This section summarizes the findings of the systematic review, beginning with the core characteristics of the included studies and moving to an aggregate view of their thematic focus and chronological and thematic trends. Two figures accompany the text to visualize (i) the distribution of studies by publication venue and (ii) the interplay between thematic categories and publication year.

3.1. Characteristics of the Identified Literature

A total of 88 studies were included in the final review. These were sourced from ACM Digital Library (

n = 19

), Semantic Scholar (

n = 16

), ACL Anthology and NIPS (combined

n = 18

), ResearchGate and ICLR (

n = 12

), OpenReview (ICML) and Springer (combined

n = 10

), with the remaining

n = 13

sourced from additional venues.

Prior to full-text screening, 61,451 records were eliminated, primarily duplicates, automated-filter exclusions, or other obvious mismatches. Title and abstract screening was applied to the remaining 227 records, resulting in 125 exclusions for irrelevance. Full texts were sought for 102 records, three of which could not be retrieved. Of the 99 reports assessed for eligibility, 11 were excluded (7 not primary research; 4 duplicate publications).

Consequently, the corpus analyzed here comprises 88 primary research studies that met all inclusion criteria.

3.2. Temporal and Thematic Trends

Figure 2 provides a single, combined visualization. The stacked bars indicate each study’s thematic orientation, while the superimposed green line traces the total number of publications per year. Thematic distribution reveals that 19 studies (21.6%) of the included literature focused on data-driven approaches, followed by recommendation systems (

n = 37

; 42%) and technical identification studies (

n = 32

; 36.4%). The trend line highlights a rapidly expanding body of research: three studies appeared in 2023 (3.4%), 50 in 2024 (56.8%), and 35 in 2025 to date (39.8%).

3.3. Publication Venues and Thematic Focus

Studies were sourced from a diverse array of publication platforms. The majority originated from ACM Digital Library (

n = 19

), with a notable emphasis on recommendation-based research (

n = 15

). Semantic Scholar contributed 16 studies, comprising data-centric investigations (

n = 4

), recommendation-oriented, and studies on technical identification (each

n = 6

). NeurIPS contributed nine studies, most of which (

n = 6

) addressed technical identification. IEEE Xplore and ICLR each accounted for six studies, while OpenReview (ICML) and Springer each provided five studies, all displaying an almost even distribution across the three thematic categories.

To further contextualize the corpus, Figure 3 depicts the distribution of studies by publication venue and thematic category, underscoring the predominance of DL ACM sources, followed by Semantic Scholar, NeurIPS, and the remaining venues.

4. Discussion and Limitations

This systematic review highlights the dynamic evolution and key challenges associated with the integration of Large Language Models (LLMs) into structured data processing, knowledge-based systems, and recommendation frameworks. Across all thematic categories, several important limitations consistently emerge:

First, although notable advancements have been achieved, handling structured and semi-structured data remains a persistent challenge. Studies reveal that LLMs are highly sensitive to input formatting, schema variability, and lack robustness in symbolic reasoning tasks. While approaches such as reinforcement learning-based context reduction and the construction of vectorized knowledge bases show promise, they introduce considerable complexity and demand careful domain-specific adaptation.
Second, within technical integration, methods like RAG and KG enhancement significantly improve factual accuracy and reduce hallucinations. However, critical issues such as knowledge drift, retrieval noise, and representational complexity remain unresolved. Although domain-specific fine-tuning and human-in-the-loop validation strategies offer improvements in reliability, they often require substantial resource investments, raising questions about scalability and broader industrial applicability.
Third, in the domain of LLM-based recommendation systems, research has made strides in enhancing personalization, fairness, and robustness. Nevertheless, challenges persist, including item popularity bias, vulnerability to adversarial inputs, and the efficiency of prompting strategies. Moreover, there is a delicate trade-off between increasing model expressiveness and maintaining strict alignment with explicit user preferences, particularly when integrating user histories or handling ambiguous queries.

Beyond the technical landscape, macroeconomic considerations merit serious attention. While LLM integration is poised to drive significant productivity gains, accelerate innovation, and bolster firm competitiveness, it also carries risks. Chief among these are labor market polarization, potential displacement of knowledge-intensive jobs, and the concentration of technological advantages among a small group of dominant firms. Additionally, the high costs of developing and maintaining LLM-augmented systems may deepen inequalities between large enterprises and smaller actors, reinforcing monopolistic dynamics in the digital economy.

Thus, while the integration of LLMs into data processing, technical architectures, and recommendation systems holds substantial promise, fully realizing these benefits will require deliberate efforts to address both technical bottlenecks and broader socio-economic implications to ensure inclusive and sustainable outcomes. These aspects are further detailed and illustrated in the following paragraphs, which provide structured insights into the current challenges, system designs, and application domains relevant to the integration of LLMs.

4.1. Limitations of Current Research

The existing body of literature exhibits several important limitations:

Bias and Fairness: Despite advances such as IFairLRS, most recommender systems continue to display biases towards popular or semantically favored items. Comprehensive strategies for systematic bias mitigation are still underdeveloped.
Robustness to Adversarial Attacks: Few studies adequately address the vulnerability of LLM-augmented systems to adversarial perturbations, though methods like RETURN highlight promising directions.
Data Dependency and Limited Generalization: Many proposed solutions heavily rely on curated datasets or domain-specific knowledge bases, which restricts their generalizability across diverse application areas.
Computational Efficiency: Several frameworks introduce significant computational overhead, with trade-offs between model accuracy, interpretability, and scalability often insufficiently explored.
Evaluation Metrics: A lack of standardized, domain-specific evaluation protocols persists, particularly for multi-modal and conversational recommender systems.

4.2. Broader Implications and Future Research Directions

The macroeconomic implications outlined in the Introduction—including productivity enhancement, labor market disruption, and technological concentration—are expected to intensify as LLM-augmented systems proliferate across sectors. However, technical limitations identified across the 89 studies provide concrete avenues for targeted research.

Future work should address the following domain-specific priorities:

Scalable hybrid architectures: There is a need for modular systems that integrate Retrieval-Augmented Generation (RAG) with knowledge graphs (KGs) and domain-specific ontologies. While models like K-RAGRec [10] and FastRAG [11] demonstrate initial success, future architectures should emphasize transferability across domains without sacrificing retrieval precision or inflating inference latency.
Fairness metrics tailored to recommendation settings: Studies such as Jiang et al. (2024) [12] introduced item-side fairness via the IFairLRS framework, but broader adoption of such metrics is lacking. Future work should develop fairness benchmarks that explicitly account for long-tail item exposure, semantic group fairness, and user-group fairness trade-offs.
Robustness under adversarial and noisy conditions: Despite promising frameworks like RETURN (Ning et al., 2025) [13], few works systematically test LLM-based recommendation systems under perturbed or adversarial conditions. Simulation-based adversarial testing environments should be adopted to evaluate real-world system resilience.
Cross-domain generalization: Many current solutions rely on heavily curated or domain-specific datasets. Techniques such as iterative reasoning (StructGPT) and schema linking show promise but remain underexplored outside of narrow contexts. Research should investigate prompt adaptation, meta-learning, and modular representation learning to reduce overfitting to specific domains.
Evaluation frameworks: There is a lack of unified, multimodal evaluation pipelines. Benchmarks such as RecBench+ (Huang et al., 2025) [14] offer a start but are limited in scope. A comprehensive framework should combine trustworthiness, transparency, hallucination resistance, response latency, and alignment with user intent.

In summary, while LLMs significantly expand the design space of modern recommender systems and knowledge-driven architectures, the field must now pivot toward practical concerns, including robustness, fairness, scalability, and domain generalization. Addressing these areas with concrete technical strategies—rooted in the findings from recent empirical work—will be essential for sustainable and equitable deployment.

5. LLMs for Structured and Semi-Structured Data

The reviewed studies addressing the integration of Large Language Models (LLMs) with structured and semi-structured data reflect a diverse set of technical objectives, methodological innovations, and application domains.

A core research stream focuses on the transformation and semantic interpretation of structured data through LLMs.

Li et al. (2024) [15] present a Transformer-based architecture capable of jointly processing structured and unstructured information, outperforming baseline retrieval systems in complex data fusion tasks. Ko et al. (2024) [16] develop a pipeline for converting semi-structured scientific data into well-formed knowledge graphs using LLMs for domain-specific parsing, contributing to the automation of web data structuring. Moundas et al. (2024) [17] complement this work by proposing prompt pattern templates for extracting structured representations from unstructured sources, thereby standardizing interaction between natural language inputs and tabular schemas. Wu and Tsioutsiouliklis (2024) [18] introduce a method for integrating knowledge graph representations into LLM reasoning workflows, significantly improving multi-hop reasoning and factual coherence. Zhong et al. (2025) [19] add a novel resource-aware dimension to structured data transformation. Their Doctopus system optimizes accuracy-cost trade-offs by dynamically combining LLM and non-LLM extraction strategies under a strict token budget, applying dynamic programming to prioritize high-impact extractions. Huang et al. (2024) [20] apply LLM-based transformation to chemistry data curation, achieving high-fidelity reaction extraction from unstructured text by combining prompt-tuned GPT-4, LoRA-fine-tuned LLaMA-2, and chemical reasoning tools (e.g., RDKit) into a four-stage hybrid pipeline. Sui et al. (2023) [21] introduce the SUC benchmark for evaluating LLM comprehension of table data, showing that model performance improves with self-augmented prompting, especially in tasks requiring fine-grained schema awareness. Lee et al. (2024) [22] propose a “Learning to Reduce” framework that segments and condenses long structured contexts, improving QA precision and latency. Fang et al. (2024) [23] provide a comprehensive survey of LLMs applied to tabular data, identifying persistent challenges in symbolic reasoning, schema variability, and generalization. Zhang et al. (2024) [24] extend this benchmarking perspective by synthesizing progress in LLM-based structured knowledge mining, covering entity/relation extraction, weak supervision, and KG population from large corpora. Their KDD’24 tutorial also highlights latency, hallucination, and trust as key barriers to scale. Several studies advance the integration of retrieval-augmented generation (RAG) and knowledge graphs as mechanisms to ground LLM outputs in structured sources. Zou et al. (2025) [25] exemplify this with ESGReveal, a hybrid framework that combines RAG and LLMs to extract ESG indicators from reports. Abane et al. (2024) [11] introduce FastRAG, a code-centric framework optimized for semi-structured data retrieval. Jiang et al. (2023) [26] propose StructGPT, a modular system that enables iterative reading and reasoning over structured databases. Chen et al. (2025) [27] introduce KG-SFT, a three-stage pipeline for fine-tuning LLMs with subgraphs from knowledge graphs, enabling logic-rich explanation generation and robust conflict detection. The method demonstrates substantial gains in multilingual reasoning and knowledge recall over six languages and fifteen domains. Tan et al. (2024) [28] present a broad survey of LLM-powered data annotation and synthesis, pointing to emerging tools, risks of hallucination, and standardization efforts. Ghiani et al. (2024) [29] propose a framework combining LLMs with optimization algorithms to assist semi-structured decision-making, enabling human-machine collaboration while reducing cognitive load. De Paoli (2024) [30] explores the application of LLMs in qualitative research, using them for inductive thematic analysis of semi-structured interviews. Wang et al. (2024) [31] provide a broader perspective on the use of LLMs in autonomous agents, noting structured input handling as a key bottleneck. Loureiro et al. (2024) [32] examine emotional trust dynamics in LLM-driven tourism assistants, showing how symbolic forgiveness mitigates user dissatisfaction. Collectively, these 19 studies underscore a shared recognition of structured and semi-structured data as both a technical opportunity and a persistent challenge for current LLM architectures. Innovations such as prompt templating [17], semantic reduction [22], graph-enhanced retrieval [18,25], and resource-aware extraction optimization [19] demonstrate substantial progress. Yet, core issues remain—including schema generalization, domain adaptation, and evaluation reproducibility—particularly in multilingual, low-resource, or cost-sensitive environments. Recent contributions that integrate knowledge graphs into fine-tuning [27], hybrid human-machine pipelines [20], and large-scale structured knowledge extraction [24] show promise in expanding LLM applicability across both symbolic and domain-specific tasks. These findings reinforce the importance of cross-disciplinary methods, benchmarking rigor, and scalable design, particularly where cost, trust, and factuality are primary constraints.

6. Architectures and Trends in LLM Recommenders

Several recent surveys and perspective pieces have charted the evolving landscape of recommender systems (RS) in the era of large language models (LLMs). A comprehensive survey by Hou et al. (2025) [33] tracks the progress and future directions specifically within generative recommendation models. Concurrently, other surveys offer a holistic overview of the entire recommendation pipeline; for instance, Lin et al. (2025) [34] categorize the integration of LLMs by examining “where” in the process and “how” through different training strategies they can be applied, while Zhao et al. (2024) [35] provide a systematic overview of existing LLM-empowered systems and prompting techniques. Complementing these surveys, a perspective article by Chen et al. (2024) [36] discusses the broader challenges and opportunities, suggesting that LLMs can shift personalization from passive filtering to active user engagement and extend it to encompass personalized services. These foundational papers collectively document the rapid convergence of LLMs and recommender systems, highlighting the potential to overcome the limitations of traditional models.

Building on these insights, researchers are proposing novel frameworks and paradigms to operationalize the integration of LLMs into recommender systems. The InstructRec model, introduced by Zhang et al. (2025) [37], validates a new paradigm that frames recommendation as an instruction following task, outperforming strong baselines by tuning an LLM on datasets that encapsulate user preferences and intentions. To better understand the mechanics of such systems, Xu et al. (2025) [38] have introduced a comprehensive framework for an empirical analysis of how LLM characteristics and prompt engineering choices impact recommendation performance. In parallel, efforts to build practical tools have emerged, such as the RecAI framework proposed by Lian et al. (2024) [39], which introduces an open-source toolkit designed to facilitate the development of next-generation recommender systems that are more versatile, explainable, and controllable. Together, these studies represent a significant move toward establishing standardized, instruction-based, and tool-augmented architectures for LLM-based recommendations.

A primary focus within this new paradigm is the enhancement of sequential recommender systems, which model the temporal dynamics of user behavior. The SeRALM framework, introduced by Ren et al. (2024) [40], aims to augment sequential recommenders with knowledge from aligned LLMs. They further refine this framework by incorporating an alignment training method and an asynchronous update mechanism to improve the relevance of the generated knowledge and training efficiency. Other approaches seek to improve the temporal awareness of LLMs directly; for example, Chu et al. (2024) [41] propose a principled prompting framework with multiple strategies to better leverage temporal information from user histories. In a different approach, Na et al. (2024) [42] introduce ReLRec, a model that reconstructs pseudo-labels for users and items during training on sequential data to enhance recommendations without sacrificing the LLM’s core sentence generation capabilities. These advancements underscore a concerted effort to better capture sequential patterns and user dynamics by leveraging the generative and reasoning capabilities of LLMs.

The domain of conversational recommender systems (CRS) has also seen significant innovation through the integration of LLMs. Yang and Chen (2024) [43] introduce ReFICR, an end-to-end conversational recommender system that decomposes the task into retrieval and generation subtasks through instruction-following prompts. Fine-tuned for large-scale deployment, ReFICR demonstrates superior performance in both recommendation accuracy and the quality of conversational responses. Similarly, Zhu et al. (2025) [44] propose the CRAG framework to combine LLMs with collaborative filtering in a conversational context. Further experiments by Zhu et al. on public datasets confirm that CRAG achieves superior item coverage and recommendation performance, particularly for newly released items. These studies highlight a clear trend towards decomposing complex conversational tasks into manageable, instruction-driven sub-problems to create more effective and grounded interactive recommendation agents.

A significant body of research is dedicated to synergizing LLMs with traditional collaborative filtering (CF) techniques to create hybrid models. Kim et al. (2024) [45] propose A-LLMRec, a model-agnostic framework that integrates collaborative filtering models with large language models to perform effectively across diverse scenarios, including cold-start, warm-start, few-shot, cold-user, and cross-domain settings. Designed for both efficiency and flexibility, A-LLMRec leverages the complementary strengths of LLMs and collaborative filtering, achieving consistently strong performance without extensive fine-tuning. Beyond this, Zhu et al. (2024) [46] introduce CLLM4Rec, a generative system that bridges the semantic gap between natural language and CF by expanding an LLM’s vocabulary to include user and item ID tokens. Qu et al. (2025) [47] propose DeftRec, a diffusion-based generative framework that enhances LLM recommendation performance by operating on continuous token representations. Through components such as a contrastive denoising diffusion module and a hybrid item retriever, DeftRec achieves superior generalization and significantly reduces inference time. The success of these hybrid approaches demonstrates that combining the nuanced, content-aware understanding of LLMs with the proven power of collaborative knowledge is a highly effective strategy. LLMs are also being leveraged to address persistent challenges in recommender systems, including data sparsity, the cold-start problem, and the need to model implicit user preferences. Wei et al. (2025) [48] propose ER²ALM, a retrieval-augmented generation framework that enriches input data with auxiliary information and applies a noise reduction strategy to improve the reliability of generated content. Validated on two real-world datasets, ER²ALM demonstrates notable improvements in both recommendation accuracy and diversity, illustrating the effectiveness of retrieval-augmented LLMs in enhancing system robustness under data-constrained conditions. Beyond data scarcity, other work focuses on refining the input signals themselves. Jeong et al. (2025) [49] developed the ReFINe framework to address the limitations of fixed-threshold approaches in handling negative feedback within recommender systems. By classifying signals as either “falsely identified negative” or “confirmed negative”, the model improves the interpretation of user preferences, particularly in sparse data environments. As LLM-based recommenders become more powerful, ensuring their responsible deployment has become a critical area of research. To counter security threats, Ning et al. (2025) [13] developed the RETURN framework, which uses retrieval-augmented generation to identify and purify poisoned user profiles, thereby mitigating the impact of adversarial attacks. In the area of algorithmic fairness, Jiang et al. (2024) [12] investigate item-side fairness, highlighting how LLMs can introduce societal biases in content exposure that affect vulnerable populations. To address this, Fayyazi et al. (2025) [50] introduced FACTER, a framework using conformal prediction and dynamic prompt engineering to reduce demographic biases by up to 95.5% without retraining the LLM. Concurrently, to tackle privacy risks and high resource costs, Zhao et al. (2025) [51] proposed a federated learning framework for generative recommenders, aiming for a more private and resource-efficient solution. This body of work underscores the growing importance of building robust, fair, and private recommender systems as a prerequisite for their real-world application. Advanced techniques are being developed to deepen personalization and improve the dynamic modeling of user preferences. Wang et al. (2025) [52] proposed the Interaction-Augmented Learned Policy (iALP), which leverages large language models to pre-train policies in reinforcement learning-based recommender systems. By distilling user preferences into reward functions and introducing an adaptive variant (A-iALP) for online deployment, the approach demonstrated significant improvements in managing the exploration-exploitation trade-off across simulated environments. To democratize personalization, Tan et al. (2024) [53] introduced “One PEFT Per User” (OPPU), a method that assigns a unique, parameter-efficient fine-tuning module to each user, effectively giving them ownership of a personalized LLM. Evaluating the ability of LLMs to track such preferences, Zhao et al. (2025) [54] developed the PREFEVAL benchmark, which revealed that even state-of-the-art models struggle to follow user preferences in long-context conversations, highlighting a key area for future improvement. Huang et al. (2025) [14] introduce RecBench+, a comprehensive benchmark dataset containing 30,000 complex and realistic user queries, designed to assess LLMs’ capabilities in personalized, interactive recommendation scenarios. Their evaluation of seven LLMs reveals that while these models handle explicit conditions well, they struggle with ambiguous or misleading queries, and their effectiveness is influenced by user demographics. Wu et al. (2024) [55] present RecSys Arena, a novel LLM-based evaluation framework that performs pair-wise comparisons of recommendation outputs by simulating user feedback based on generated profiles. Their findings demonstrate improved alignment with A/B test results and greater sensitivity to subtle model differences, offering a richer alternative to traditional metrics like AUC and nDCG. These approaches mark a transition toward more adaptive, context-aware, and fine-grained modeling of user preferences, where personalization is not only dynamic and scalable but also increasingly evaluated through realistic, user-aligned benchmarks. Researchers are also exploring novel methods for generative retrieval and knowledge augmentation to enhance recommendation quality. Rajput et al. (2023) [56] introduced a generative retrieval approach where a model autoregressively decodes “Semantic IDs”—semantically meaningful identifiers for items—which was found to significantly outperform state-of-the-art models and improve generalization. On the prompting side, Yu et al. (2024) [57] developed PepRec, a training-free framework that integrates knowledge from both content-based and collaborative filtering to improve performance and interpretability. In a different vein, Wang et al. (2025) [58] introduced the HFAR framework, which leverages LLMs to simulate human-like feedback on content quality, thereby improving recommendations without harming accuracy. Shang et al. (2024) [59] propose a personalized recommendation system that fine-tunes a RoBERTa model using user-generated reviews and contrastive learning to capture deep semantic features. By modeling user preferences across explicit, implicit, and temporal dimensions, their system significantly outperforms traditional baselines, particularly in recommending long-tail and cross-genre items. These works showcase the diverse ways generative capabilities and external knowledge are being harnessed to produce more relevant and reliable recommendations. The versatility of LLM-based recommendation techniques is being demonstrated through their application in specialized and cross-disciplinary contexts. Within fashion, Liu et al. (2024) [60] developed a sequential recommendation framework that uses a pre-trained LLM enhanced with domain-specific prompts to translate textual descriptions into relevant product suggestions. The cross-domain capabilities of LLMs were investigated by Liu et al. (2025) [61] through the LLM4CDR pipeline, which creates context-aware prompts using a user’s purchase history from a source domain to make recommendations in a target domain, showing significant performance gains. Expanding beyond typical e-commerce, Weber (2024) [62] explored the effectiveness of LLMs like ChatGPT as pattern matchers for editing semi-structured and structured documents. These applications illustrate the broad potential of LLMs to revolutionize information filtering and generation across a wide array of domains. Finally, a set of studies offers theoretical insights and explores future-facing concepts that will shape the next generation of recommender systems. Looking toward more autonomous systems, Huang et al. (2025) [63] investigated the development of agentic recommender systems that integrate multimodal LLMs with the perception of external environments. Providing a theoretical foundation for many current systems, Zhang et al. (2024) [64] analyzed the generalization error of two-stage, tree-based recommender systems, suggesting that performance can be improved by increasing the number of branches in retrievers and harmonizing distributions between stages. Another study explores the societal implications of these systems, where Baumann and Mendler-Dünner (2024) [65] investigated algorithmic collective action, demonstrating that a small collective of users can strategically manipulate playlists to significantly boost an underrepresented artist’s recommendations. Tian et al. (2024) [66] propose MMREC, a framework that unifies text and image data into a shared latent space through LLM processing, facilitating effective learning for downstream ranking models. Their approach demonstrates notable improvements in discriminative performance, especially in scenarios involving imbalanced datasets and false positives. These forward-looking analyses are crucial for understanding the technical limitations, potential capabilities, and societal impact of increasingly intelligent and autonomous recommender technologies.

7. Enhancing LLMs with RAG, Knowledge Graphs, and Prompts

Large language models have been enhanced by retrieval-augmented generation (RAG) techniques and prompt optimization strategies to ground their outputs in relevant data and improve reliability. For example, personal conversational models can be linked to private knowledge bases: Prahlad et al. [67] convert a user’s calendar and chat logs into a local knowledge graph that guides a Llama-2 RAG pipeline, significantly reducing hallucinations and cutting response latency while boosting ROUGE and BLEU scores over a text-only baseline. In the travel domain, RAG pipelines augmented with domain-specific knowledge or objectives yield notable gains. Song et al. [68] introduce TravelRAG, a framework that automatically constructs a multi-layer tourism knowledge graph from user-generated travel blogs and employs it as the retrieval index, sharply improving answer accuracy and efficiency while reducing hallucinations relative to conventional RAG baselines. Banerjee et al. [69] introduce a sustainability-augmented RAG (SAR) pipeline into the prompt ranking step to bias a tourism recommender toward less crowded destinations, managing to maintain answer relevance while reducing both over-tourism and the hallucination rate of a 7B LLM by about one-third. Retrieval-based methods have also been applied to technical documents: Yang et al. [70] demonstrate that converting semi-structured environmental reports into vector-indexed text chunks and then integrating that store with GPT-4 in a RAG setup enables near-perfect expert question-answering (scores 95/100) in both English and Chinese, far above an unaugmented GPT-4 baseline (60–85/100). Likewise, Li et al. [71] develop a private knowledge-base QA pipeline that combines web crawling, a dual-encoder retriever, and LoRA fine-tuning of LLaMA-2, achieving 25–150% higher recall and BLEU than the base model. They note, however, that aggressive generator fine-tuning on small, biased data can degrade linguistic fluency, offering practical guidance for building hallucination-resistant enterprise QA systems. Beyond retrieval, prompt engineering itself has been a target for optimization: Wu et al. [72] introduce an RL-based exemplar selector that discovers an ideal fixed set of in-context examples, yielding consistent few-shot accuracy gains across diverse tasks while remaining highly computationally efficient. This approach illustrates how judicious prompt design can bolster LLM performance even without external knowledge.

Integrating structured knowledge graphs with LLMs has emerged as a promising approach to incorporate factual and relational knowledge into language reasoning. The importance of this LLM-graph synergy is evidenced by the first-ever LLM4Graph workshop [73], which brought together researchers to chart out techniques for fusing generative language models with graph learning, from graph-based pre-training and prompting to trustworthy graph-informed applications. In practice, incorporating graph structures has produced substantial benefits in recommendation and question-answering tasks.For instance, Wang et al. [74] develop ELMRec, a recommendation model that feeds an LLM high-order user–item interaction signals via graph-aware embeddings and a training-free re-ranking step, achieving up to 293% gains in ranking metrics over state-of-the-art baselines. In a similar vein, knowledge-augmented prompting has been used to boost QA accuracy: Zhang et al.’s KnowGPT [75] framework uses deep reinforcement learning to extract a relevant subgraph and a multi-armed bandit to compose a concise prompt from it, enabling a GPT-3.5 model to outperform GPT-4 on commonsense and open-domain QA benchmarks while using far fewer tokens per query. Knowledge graphs have also been coupled with RAG-based recommenders. In one approach, Wang et al. [10] introduce K-RagRec, which indexes multi-hop graph neighborhoods and injects these structured “knowledge prompts” into a frozen LLaMA LLM; this method virtually eliminated recommendation hallucinations (≈93% reduction) and improved recall by up to 32% relative to graph-agnostic retriever baselines. Likewise, G-CRS by Qiu et al. [76] demonstrates that a conversational recommender can omit fine-tuning entirely by using a graph neural network and Personalized PageRank to retrieve items and dialogue context for GPT-4 prompting grounding the model’s responses such that it surpasses several fine-tuned 8B-model baselines on standard conversational recommendation metrics.Notably, these knowledge-infused pipelines have been surveyed and taxonomized: Yang et al. [77] provide a comprehensive review of over 150 knowledge-enhanced LLM methods (spanning RAG, knowledge graph integration, and hybrid neuro-symbolic reasoning), and outline persistent challenges like interpretability, scalability, and keeping LLMs’ knowledge up-to-date. Their survey underscores the need for modular retrieval and ethical, trustworthy design as key directions for future knowledge-aware LLM systems. Beyond using KGs to assist LLMs, researchers are also leveraging LLMs to build or complete knowledge graphs themselves. Benjira et al. [78] develop a semi-automated pipeline wherein GPT-4 is tasked with aligning heterogeneous open data schemas to a reference ontology; by pairing rule-based type filters with GPT-4 schema-mapping prompts, they achieved 99% mapping accuracy and >98% F1 in constructing a queryable SDG indicator knowledge graph, outperforming state-of-the-art schema matchers. In a meta-level contribution, Choi and Jung [79] use GPT-4 to assist a literature meta-review of the post-2022 knowledge graph field: they automatically screened ~4000 papers and organized 201 relevant studies into an “Extraction–Learning–Evaluation” pipeline, charting progress from multimodal entity extraction to new trustworthiness benchmarks. Their synthesis highlights continuing challenges in scaling KGs and keeping them current and reliable, aligning with calls for real-time multimodal data integration and rigorous evaluation standards in next-generation KGs. Knowledge graph validation has also benefited from hybrid LLM approaches: Tsaneva et al. [80] explore nine workflows combining GPT-4/Claude LLM screening with human-in-the-loop checks to verify facts in a 3600-triple science KG. They find that a simple LLM filter can boost precision by 12 percentage points at the cost of recall, whereas a disagreement-triggered human review scheme (where the model flags uncertain triples for human verification) needed only 158 human annotations to achieve a 5-point F1 gain, outperforming both fully manual and fully automated baselines. Finally, LLMs themselves are being adapted to serve as knowledge base completion engines. Yao et al. [81] introduce KG-LLM, framing KG completion as an instruction-following task: they transform knowledge triples into natural language prompts and fine-tune a 7–13B LLaMA model (via LoRA and prompt tuning) to predict missing links. The resulting model, KG-LLaMA, achieves 96.6% accuracy on standard triple classification and 10–20 point Hits@1 improvements on link prediction benchmarks over previous KG-BERT models. Remarkably, it even outperforms larger closed-source models like GPT-4 on these structured tasks, all while preserving sub-second inference latency. This shows that mid-sized, carefully adapted LLMs can rival much larger models in knowledge-focused applications when given minimal structural cues and instruction tuning.

A number of recent works apply reinforcement learning (RL) and other optimization techniques to better align LLMs with desired behaviors or performance targets. In recommendation ranking, Chao et al. [82] address the inherent mismatch between free-form text generation and ranking metrics by introducing ALRO (Aligned Listwise Ranking Objectives), an LLM-based listwise ranking framework. ALRO fine-tunes an LLM with a differentiable soft lambda loss that directly optimizes item ordering, thereby mitigating position bias in generated recommendation lists. This task-specific objective delivers higher NDCG (Normalized Discounted Cumulative Gain) ranking accuracy than both traditional learning-to-rank and naive LLM baselines, and it does so with substantially less inference overhead than prior pipeline approaches. In the realm of information retrieval, Hsu et al. [83] propose an RL-driven approach called LeReT to boost factuality: they employ independently bootstrapped few-shot prompts to seed diverse multi-hop search queries, then use a preference-based policy gradient (an identity-guided RL objective) to train the query generator to retrieve better evidence. This method raised document recall by up to 29% and improved grounded QA accuracy by 17% compared to strong prompting and supervised baselines—effectively using RL to train the LLM to be a more efficient information seeker. Optimization at the data level has also been explored: Bansal et al. [84] study the compute–performance trade-off in synthetic data distillation for training reasoning LLMs. They find that, under a fixed computation budget, generating a high volume of diverse training solutions from weaker, low-cost models can produce up to 31% higher accuracy on math and reasoning benchmarks than using fewer examples from a single stronger model. This “weak-teacher, strong-student” paradigm suggests a cost-effective route to boost LLM performance by maximizing training data diversity instead of relying solely on the most powerful data generators. On a broader scale, the interaction between RL and parameter-efficient fine-tuning in shaping advanced LLM behavior has been surveyed by Kumar et al. [85] They unify various post-training strategies—from parameter-efficient fine-tuning (e.g., QLoRA) to reward-based RL (e.g., RLAIF/RLHF) and test-time reasoning boosts—into a three-pillar taxonomy. This survey highlights how such techniques can alleviate hallucinations and forgetting while enabling rapid domain adaptation, often attaining reasoning improvements comparable to those from scaling up model size. It provides an end-to-end perspective on optimizing LLMs after pre-training, underscoring best practices (like balanced reward design and uncertainty calibration) that are guiding the next generation of safer, more efficient reasoning models.

To specialize large language models without bearing the cost of full model retraining, researchers have developed a variety of parameter-efficient adaptation methods. Some approaches introduce intermediate representation steps to guide LLMs. For example, Zhang (2024) [86] proposes Guided Profile Generation (GPG), which has the model generate a concise user profile from raw personal data as an intermediate step before personalization tasks. This self-profile significantly improved downstream personalization outcomes—more than doubling purchase preference prediction accuracy and greatly enhancing stylistic tweet rewriting and dialog relevance—all achieved without any large-scale fine-tuning of the base model. Another lightweight strategy is to extend the model’s vocabulary for structured identifiers: Huang et al. [87] present META ID, which creates dedicated out-of-vocabulary token embeddings for each user or item based on their connections in a meta-path graph. By encoding user–item relationships via meta-path clusters, this method boosted LLM recommendation performance across multiple Amazon product datasets, improving both accuracy (e.g., BLEU, RMSE) and interpretability over prior ID-handling strategies. Even for non-language data modalities, adaptation techniques show promise. Thomas et al. [88] introduce LoCalPFN to overcome limitations in TabPFN, a transformer for tabular data: LoCalPFN retrieves a small set of similar datasets (k-nearest neighbors) and jointly fine-tunes the transformer on those, effectively customizing the model to each new task context. This on-the-fly fine-tuning removes TabPFN’s context length bottleneck and reduces underfitting, achieving state-of-the-art performance (highest average AUC) on 95 tabular benchmarks and consistently outperforming tuned tree-based models as data complexity grows. Liao et al. [89] push parameter-efficiency to the extreme with HMoRA, a hierarchical mixture-of-experts built on Low-Rank Adaptation. HMoRA activates different expert weights at token and task-level granularities and shares parameters in a Hydra-like fashion, so that only 3.9% of the base model’s parameters are actually trained. Despite this tiny tunable fraction, it outperforms full fine-tuning on knowledge-intensive benchmarks (MMLU, OpenBookQA, etc.) and cuts training time by about one-third, demonstrating the effectiveness of structured modular fine-tuning. Domain-specific adaptation of LLMs has likewise proven highly effective, even with limited resources. In the tourism domain, Wei et al. [90] develop TourLLM, a 7B Qwen model fine-tuned (via LoRA) on only 12.8k curated travel dialogue and QA examples. This moderately-sized, domain-focused model outperforms much larger general models (like ChatGPT-3.5 and open 6–7B LLMs) on travel planning tasks, as measured by BLEU/ROUGE and human evaluations. The result underscores that targeted supervised fine-tuning can markedly improve an LLM’s consistency and usability in a specialized domain. In finance, Jeong et al. [91] present a comprehensive workflow for building a domain-specific LLM: starting from a 7B foundation model, they perform QLoRA fine-tuning on a carefully curated financial text corpus and implement domain-specific safety filters and evaluations. The resulting finance-specialized model answers stock market questions more accurately than the original pre-trained model and enables downstream applications in market forecasting and personalized client advice. This work illustrates how a combination of data curation, parameter-efficient tuning, and expert evaluation can produce an industry-grade LLM that excels in a particular knowledge domain.

In addition to tuning and data augmentation, researchers are exploring hybrid and layered architectures that extend beyond a single monolithic language model. Sordoni et al. [92] introduce the concept of Deep Language Networks (DLNs), which reconceive an LLM as a stack of multiple stochastic “language layers.” In their two-layer prototype, an initial GPT-3-based layer breaks down a task into sub-instructions, and a second layer model then interprets and solves these sub-tasks. Crucially, they jointly optimize the prompts passed between layers using a variational inference objective. This learned decomposition produces strong performance gains: a 2-layer DLN outperforms both Automatic Prompt Engineering and even a zero-shot GPT-4 on various reasoning and NLP benchmarks. The success of prompt-optimized layering suggests that smaller models working in concert (each handling part of the task) can collectively achieve generalization competitive with a single much larger model. Such architectures point toward a future where complex tasks are handled by networks of specialized LLM modules, potentially offsetting brute-force scaling with more intelligent organization of computation.

One challenge lies in handling very long contexts: Xu et al. [93] highlight that even GPT-4 struggles with long-context in-context learning when prompts contain sequences of tasks. Their Lifelong ICL benchmark reveals failure rates of 15% for GPT-4 and up to 61% for smaller open models on extended sequences, due to issues like recency bias and distraction. This finding sets a clear agenda for developing more robust long-context LLMs that can maintain accuracy over many intermixed instructions. Another approach focuses on quantifying hallucination probabilistically: Jesson et al. [94] define a “posterior hallucination” rate based on the model’s own predictive uncertainty. They present a lightweight resampling algorithm that uses the black-box LLM to estimate when an answer is unexpectedly low likelihood under a Bayesian posterior, which correlates strongly with actual error rates. This model-independent metric tracks hallucinations without needing reference answers, offering a principled new tool for uncertainty-aware evaluation of LLM outputs. Complementing these intrinsic measures, domain-specific evaluation suites have been proposed for LLM-integrated applications. Ahmed et al. [95] introduce a 17-metric evaluation framework (built on Evidently) for a tourism recommendation RAG system. By systematically varying generation settings, they show that the retrieval component largely preserves factual accuracy, whereas overly high sampling randomness (temperature or top-p) can slash response quality by up to 64%. Such results provide fine-grained, replicable guidelines for tuning LLM systems: in this case, demonstrating the importance of conservative decoding settings in high-stakes recommender applications. Overall, these efforts to diagnose long-context failures, quantify hallucinations, and benchmark output quality form a toolkit to drive more trustworthy and rigorously evaluated LLM deployments.

In parallel, significant progress has been made in adapting LLMs to multimodal inputs and to specialized application domains. Qian et al. [96] tackle the challenge of complex, layered instructions in multimodal LLMs through their MIA-Bench benchmark. Using 400 image-plus-text prompts, they expose large performance gaps between models (GPT-4 solves 89%, whereas comparable open models lag below 79%). They further demonstrate a practical fix: fine-tuning a vision-language LLM (LLaVA-NeXT-13B) on just 5k carefully composed multimodal instructions closed about 10 points of this gap with negligible loss on other tasks. This study not only sets a new benchmark for multimodal instruction-following but also provides an optimization recipe for making such models more instruction-faithful. Other domain-specific evaluations underscore the importance of structured reasoning. Zhang et al. [97] present the first systematic survey of LLMs for table-based reasoning, a crucial capability for semi-structured data. They categorize techniques from fine-tuning and ensembles to prompt design and step-by-step reasoning, and show that instruction-tuned LLMs consistently outperform older symbolic and neural methods on benchmarks like WikiTQ and Spider. The emergent ability of LLMs to follow instructions and decompose complex queries (e.g., into stepwise table lookups) is credited for their superior accuracy in these structured domains. The survey also identifies open issues, noting that current models still struggle with domain-specific data diversity and error propagation. As a remedy, the authors advocate for next-generation systems that incorporate multimodal table understanding, agent-based pipelines, and retrieval-augmented processing to further enhance the fidelity of table reasoning.

More generally, across domains from travel planning to finance (as discussed earlier), a clear pattern is emerging: LLMs that are tailored via modest fine-tuning or guided prompting on domain-specific data can often match or surpass larger general-purpose models in both accuracy and user satisfaction. This convergence of human knowledge and machine learning—whether via images, tables, or curated domain texts—is expanding the utility of language models into diverse real-world scenarios that demand both broad linguistic competence and deep domain insight.

8. Conclusions

This review has examined recent advances in the integration of Large Language Models (LLMs) into structured data processing, knowledge-based architectures, and recommendation systems. The evidence highlights not only growing technical sophistication—through innovations such as graph-augmented retrieval, hybrid RAG architectures, and domain-specific fine-tuning—but also persistent challenges related to robustness, fairness, and generalizability.

More than incremental improvements to traditional pipelines, LLMs represent a paradigmatic shift in how recommendation systems are conceived and built. Their capacity for contextual reasoning, dynamic content generation, and semantic understanding redefines the boundaries of personalization, moving beyond reactive filtering toward proactive, conversational, and agentic recommendation models. This transformation opens the door to systems that are more adaptive, interpretable, and aligned with user intent.

At the same time, the deployment of LLMs raises broader concerns that extend beyond the technical. Fairness, transparency, and sustainability must be treated not as afterthoughts, but as core design principles. Inadequate attention to these dimensions risks amplifying existing inequities and undermining trust in AI-assisted decision-making.

While this review has primarily focused on the technical and architectural aspects of LLM-based recommendation systems, it is important to acknowledge that their broader deployment may intersect with macroeconomic dimensions such as productivity gains, labor market adjustments, and concentration of technological capabilities. Although these implications were not empirically synthesized through the systematic review itself, they are raised here to contextualize the relevance and potential impact of technical findings in real-world settings. Future interdisciplinary research could further explore these socioeconomic aspects in greater depth.

Ultimately, realizing the full potential of LLMs requires more than technical optimization. It demands sustained interdisciplinary collaboration, critical reflection, and a commitment to ensuring that these systems serve human needs in equitable, transparent, and reliable ways.

Author Contributions

Conceptualization, I.L.; methodology, A.S.; software, A.S., R.K. and M.B.D.; validation, A.S., R.K. and M.B.D.; formal analysis, A.S.; investigation, R.K.; resources, A.S., R.K. and M.B.D.; data curation, A.S., R.K. and M.B.D.; writing—original draft preparation, M.B.D.; writing—review and editing, A.S., R.K., M.B.D. and I.L.; visualization, A.S., R.K. and M.B.D.; supervision, I.L.; project administration, I.L.; funding acquisition, I.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is (partly) supported by SPIN projects “INFOBIP Konverzacijski Order Management (IP.1.1.03.0120)”, “Projektiranje i razvoj nove generacije laboratorijskog informacijskog sustava (iLIS)” (IP.1.1.03.0158), “Istraživanje i razvoj inovativnog sustava preporuka za napredno gostoprimstvo u turizmu (InnovateStay)” (IP.1.1.03.0039), “European Digital Innovation Hub Adriatic Croatia (EDIH Adria) (project no. 101083838)” under the European Commission’s Digital Europe Programme and the FIPU project “Sustav za modeliranje i provedbu poslovnih procesa u heterogenom i decentraliziranom računalnom sustavu”.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial intelligence
GPT	Generative Pre-trained Transformer
KG	Knowledge Graph
LLM	Large Language Model
NLP	Natural Langauge Processing
PRISMA	Preferred Reporting Items for Systematic Reviews and Meta-Analyses
RAG	Retrieval-Augmented Generation
RL	Reinforcement Learning
RS, RecSys	Recommendation System
SAR	Sustainability Augmented Reranking

References

Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, M.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2022, arXiv:2108.07258. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Annual Conference on Neural Information Processing Systems 2020, Virtual, 6–12 December 2020. [Google Scholar]
Deloitte. Deloitte’s State of Generative AI in the Enterprise Quarter Four Report. 2024. Available online: https://www2.deloitte.com/content/dam/Deloitte/us/Documents/consulting/us-state-of-gen-ai-q4.pdf (accessed on 2 May 2025).
McKinsey. The Economic Potential of Generative AI: The Next Productivity Frontier. 2023. Available online: https://www.mckinsey.com/~/media/mckinsey/business%20functions/mckinsey%20digital/our%20insights/the%20economic%20potential%20of%20generative%20ai%20the%20next%20productivity%20frontier/the-economic-potential-of-generative-ai-the-next-productivity-frontier.pdf (accessed on 2 May 2025).
Li, H.; Xi, J.; Hsu, C.H.; Yu, B.X.; Zheng, X.K. Generative artificial intelligence in tourism management: An integrative review and roadmap for future research. Tour. Manag. 2025, 110, 105179. [Google Scholar] [CrossRef]
Xiao, X. MMAgentRec, a personalized multi-modal recommendation agent with large language model. Nature 2025, 15, 12062. [Google Scholar] [CrossRef]
Zakarija, I.; Škopljanac Mačina, F.; Marušić, H.; Blašković, B. A Sentiment Analysis Model Based on User Experiences of Dubrovnik on the Tripadvisor Platform. Appl. Sci. 2024, 14, 8304. [Google Scholar] [CrossRef]
Brynjolfsson, E.; Li, D.; Raymond, L. Generative AI at work. Q. J. Econ. 2025, 140, 889–942. [Google Scholar] [CrossRef]
Acemoglu, D.; Restrepo, P. Artificial intelligence, automation, and work. In The Economics of Artificial Intelligence: An Agenda; University of Chicago Press: Chicago, IL, USA, 2018; pp. 197–236. [Google Scholar]
Wang, S.; Fan, W.; Feng, Y.; Ma, X.; Wang, S.; Yin, D. Knowledge Graph Retrieval-Augmented Generation for LLM-based Recommendation. arXiv 2025, arXiv:2501.02226. [Google Scholar]
Abane, A.; Bekri, A.; Battou, A. FastRAG: Retrieval Augmented Generation for Semi-structured Data. arXiv 2024, arXiv:2411.13773. [Google Scholar]
Jiang, M.; Bao, K.; Zhang, J.; Wang, W.; Yang, Z.; Feng, F.; He, X. Item-side Fairness of Large Language Model-based Recommendation System. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024. [Google Scholar]
bo Ning, L.; Fan, W.; Li, Q. Retrieval-Augmented Purifier for Robust LLM-Empowered Recommendation. arXiv 2025, arXiv:2504.02458. [Google Scholar]
Huang, J.; Wang, S.; bo Ning, L.; Fan, W.; Wang, S.; Yin, D.; Li, Q. Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs. arXiv 2025, arXiv:2503.09382. [Google Scholar]
Li, B.; Jiang, G.; Li, N.; Song, C. Research on large-scale structured and unstructured data processing based on large language model. In Proceedings of the International Conference on Machine Learning, Pattern Recognition and Automation Engineering, Singapore, 7–9 August 2024; pp. 111–116. [Google Scholar]
Ko, H.; Yang, H.; Han, S.; Kim, S.; Lim, S.; Hormazabal, R. Filling in the Gaps: LLM-Based Structured Data Generation from Semi-Structured Scientific Data. In Proceedings of the ICML 2024 AI for Science Workshop, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Moundas, M.; White, J.; Schmidt, D.C. Prompt patterns for structured data extraction from unstructured text. In Proceedings of the 31st Pattern Languages of Programming (PLoP) Conference, Columbia River Gorge, WA, USA, 13–16 October 2024. [Google Scholar]
Wu, X.; Tsioutsiouliklis, K. Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data. arXiv 2024, arXiv:2412.10654. [Google Scholar] [CrossRef]
Zhong, Y.; Deng, Y.; Chai, C.; Gu, R.; Yuan, Y.; Wang, G.; Cao, L. Doctopus: A System for Budget-aware Structural Data Extraction from Unstructured Documents. In Proceedings of the Companion of the 2025 International Conference on Management of Data, Berlin, Germany, 22–27 June 2025; pp. 275–278. [Google Scholar]
Huang, X.; Surve, M.; Liu, Y.; Luo, T.; Wiest, O.; Zhang, X.; Chawla, N.V. Application of Large Language Models in Chemistry Reaction Data Extraction and Cleaning. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 3797–3801. [Google Scholar]
Sui, Y.; Zhou, M.; Zhou, M.; Han, S.; Zhang, D. Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Yucatan, Mexico, 4–8 March 2023. [Google Scholar]
Lee, Y.; Kim, S.; Rossi, R.A.; Yu, T.; Chen, X. Learning to reduce: Towards improving performance of large language models on structured data. arXiv 2024, arXiv:2407.02750. [Google Scholar] [CrossRef]
Fang, X.; Xu, W.; Tan, F.A.; Zhang, J.; Hu, Z.; Qi, Y.; Nickleach, S.; Socolinsky, D.; Sengamedu, S.; Faloutsos, C. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding—A Survey. arXiv 2024, arXiv:2402.17944. [Google Scholar]
Zhang, Y.; Zhong, M.; Ouyang, S.; Jiao, Y.; Zhou, S.; Ding, L.; Han, J. Automated Mining of Structured Knowledge from Text in the Era of Large Language Models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 1–8 February 2024; pp. 6644–6654. [Google Scholar]
Zou, Y.; Shi, M.; Chen, Z.; Deng, Z.; Lei, Z.; Zeng, Z.; Yang, S.; Tong, H.; Xiao, L.; Zhou, W. ESGReveal: An LLM-based approach for extracting structured data from ESG reports. J. Clean. Prod. 2025, 489, 144572. [Google Scholar] [CrossRef]
Jiang, J.; Zhou, K.; Dong, Z.; Ye, K.; Zhao, W.X.; Wen, J.R. Structgpt: A general framework for large language model to reason over structured data. arXiv 2023, arXiv:2305.09645. [Google Scholar] [CrossRef]
Chen, H.; Shen, X.; Wang, J.; Wang, Z.; Lv, Q.; He, J.; Wu, R.; Wu, F.; Ye, J. Knowledge Graph Finetuning Enhances Knowledge Manipulation in Large Language Models. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Tan, Z.; Li, D.; Wang, S.; Beigi, A.; Jiang, B.; Bhattacharjee, A.; Karami, M.; Li, J.; Cheng, L.; Liu, H. Large Language Models for Data Annotation and Synthesis: A Survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 930–957. [Google Scholar]
Ghiani, G.; Solazzo, G.; Elia, G. Integrating Large Language Models and Optimization in Semi- Structured Decision Making: Methodology and a Case Study. Algorithms 2024, 17, 582. [Google Scholar] [CrossRef]
Paoli, S.D. Performing an Inductive Thematic Analysis of Semi-Structured Interviews with a Large Language Model: An Exploration and Provocation on the Limits of the Approach. Soc. Sci. Comput. Rev. 2024, 42, 997–1019. [Google Scholar] [CrossRef]
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A survey on large language model based autonomous agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
Loureiro, S.M.C.; Guerreiro, J.; Friedmann, E.; Lee, M.J.; and, H.H. Tourists and artificial intelligence-LLM interaction: The power of forgiveness. Curr. Issues Tour. 2024, 28, 1172–1190. [Google Scholar] [CrossRef]
Hou, Y.; Zhang, A.; Sheng, L.; Yang, Z.; Wang, X.; Chua, T.S.; McAuley, J. Generative Recommendation Models: Progress and Directions. In Proceedings of the Companion Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; pp. 13–16. [Google Scholar]
Lin, J.; Dai, X.; Xi, Y.; Liu, W.; Chen, B.; Zhang, H.; Liu, Y.; Wu, C.; Li, X.; Zhu, C.; et al. How Can Recommender Systems Benefit from Large Language Models: A Survey. ACM Trans. Inf. Syst. 2025, 43, 28. [Google Scholar] [CrossRef]
Zhao, Z.; Fan, W.; Li, J.; Liu, Y.; Mei, X.; Wang, Y.; Wen, Z.; Wang, F.; Zhao, X.; Tang, J.; et al. Recommender Systems in the Era of Large Language Models (LLMs). IEEE Trans. Knowl. Data Eng. 2024, 36, 6889–6907. [Google Scholar] [CrossRef]
Chen, J.; Liu, Z.; Huang, X.; Wu, C.; Liu, Q.; Jiang, G.; Pu, Y.; Lei, Y.; Chen, X.; Wang, X.; et al. When large language models meet personalization: Perspectives of challenges and opportunities. World Wide Web 2024, 27, 42. [Google Scholar] [CrossRef]
Zhang, J.; Xie, R.; Hou, Y.; Zhao, X.; Lin, L.; Wen, J.R. Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach. ACM Trans. Inf. Syst. 2025, 43, 114. [Google Scholar] [CrossRef]
Xu, L.; Zhang, J.; Li, B.; Wang, J.; Chen, S.; Zhao, W.X.; Wen, J.R. Tapping the Potential of Large Language Models as Recommender Systems: A Comprehensive Framework and Empirical Analysis. ACM Trans. Knowl. Discov. Data 2024, 19, 105. [Google Scholar] [CrossRef]
Lian, J.; Lei, Y.; Huang, X.; Yao, J.; Xu, W.; Xie, X. RecAI: Leveraging Large Language Models for Next-Generation Recommender Systems. In Proceedings of the Companion Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 1031–1034. [Google Scholar]
Ren, Y.; Chen, Z.; Yang, X.; Li, L.; Jiang, C.; Cheng, L.; Zhang, B.; Mo, L.; Zhou, J. Enhancing Sequential Recommenders with Augmented Knowledge from Aligned Large Language Models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 345–354. [Google Scholar]
Chu, Z.; Wang, Z.; Zhang, R.; Ji, Y.; Wang, H.; Sun, T. Improve Temporal Awareness of LLMs for Domain-general Sequential Recommendation. In Proceedings of the ICML 2024 Workshop on In-Context Learning, Vienna, Austria, 27 July 2024. [Google Scholar]
Na, H.; Gang, M.; Ko, Y.; Seol, J.; Lee, S.g. Enhancing Large Language Model Based Sequential Recommender Systems with Pseudo Labels Reconstruction. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 7213–7222. [Google Scholar]
Yang, T.; Chen, L. Unleashing the Retrieval Potential of Large Language Models in Conversational Recommender Systems. In Proceedings of the 18th ACM Conference on Recommender Systems, Vienna, Austria, 21–27 July 2024; pp. 43–52. [Google Scholar]
Zhu, Y.; Wan, C.; Steck, H.; Liang, D.; Feng, Y.; Kallus, N.; Li, J. Collaborative Retrieval for Large Language Model-based Conversational Recommender Systems. In Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; pp. 3323–3334. [Google Scholar]
Kim, S.; Kang, H.; Choi, S.; Kim, D.; Yang, M.; Park, C. Large Language Models meet Collaborative Filtering: An Efficient All-round LLM-based Recommender System. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 1395–1406. [Google Scholar]
Zhu, Y.; Wu, L.; Guo, Q.; Hong, L.; Li, J. Collaborative Large Language Model for Recommender Systems. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 3162–3172. [Google Scholar]
Qu, H.; Fan, W.; Lin, S. Generative Recommendation with Continuous-Token Diffusion. arXiv 2025, arXiv:2504.12007. [Google Scholar] [CrossRef]
Wei, C.; Duan, K.; Zhuo, S.; Wang, H.; Huang, S.; Liu, J. Enhanced Recommendation Systems with Retrieval-Augmented Large Language Model. J. Artif. Int. Res. 2025, 28, 1147–1173. [Google Scholar] [CrossRef]
Jeong, C.; Kang, Y.; Cho, Y.S. Leveraging Refined Negative Feedback with LLM for Recommender Systems. In Proceedings of the Companion Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; pp. 1028–1032. [Google Scholar]
Fayyazi, A.; Kamal, M.; Pedram, M. FACTER: Fairness-Aware Conformal Thresholding and Prompt Engineering for Enabling Fair LLM-Based Recommender Systems. In Proceedings of the Forty-Second International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Zhao, J.; Wang, W.; Xu, C.; Ng, S.K.; Chua, T.S. A Federated Framework for LLM-based Recommendation. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, NM, USA, 30 April 2025; pp. 2852–2865. [Google Scholar]
Wang, J.; Karatzoglou, A.; Arapakis, I.; Jose, J.M. Large Language Model driven Policy Exploration for Recommender Systems. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, Hannover, Germany, 10–14 March 2025; pp. 107–116. [Google Scholar]
Tan, Z.; Zeng, Q.; Tian, Y.; Liu, Z.; Yin, B.; Jiang, M. Democratizing Large Language Models via Personalized Parameter-Efficient Fine-tuning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 6476–6491. [Google Scholar]
Zhao, S.; Hong, M.; Liu, Y.; Hazarika, D.; Lin, K. Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Wu, Z.; Jia, Q.; Wu, C.; Du, Z.; Wang, S.; Wang, Z.; Dong, Z. RecSys Arena: Pair-wise Recommender System Evaluation with Large Language Models. arXiv 2024, arXiv:2412.11068. [Google Scholar]
Rajput, S.; Mehta, N.; Singh, A.; Keshavan, R.H.; Vu, T.; Heldt, L.; Hong, L.; Tay, Y.; Tran, V.Q.; Samost, J.; et al. Recommender Systems with Generative Retrieval. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Yu, Y.; Qi, S.a.; Li, B.; Niu, D. PepRec: Progressive Enhancement of Prompting for Recommendation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 17941–17953. [Google Scholar]
Wang, H.; Wu, C.; Huang, Y.; Qi, T. Learning Human Feedback from Large Language Models for Content Quality-aware Recommendation. ACM Trans. Inf. Syst. 2025, 43, 86. [Google Scholar] [CrossRef]
Shang, F.; Zhao, F.; Zhang, M.; Sun, J.; Shi, J. Personalized recommendation systems powered by large language models: Integrating semantic understanding and user preferences. Int. J. Innov. Res. Eng. Manag. 2024, 11, 39–49. [Google Scholar] [CrossRef]
Liu, H.; Tang, X.; Chen, T.; Liu, J.; Indu, I.; Zou, H.P.; Dai, P.; Galan, R.F.; Porter, M.D.; Jia, D.; et al. Sequential LLM Framework for Fashion Recommendation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, Miami, FL, USA, 12–16 November 2024; pp. 1276–1285. [Google Scholar]
Liu, X.; Wang, R.; Sun, D.; Hakkani Tur, D.; Abdelzaher, T. Uncovering Cross-Domain Recommendation Ability of Large Language Models. In Proceedings of the Companion Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; pp. 2736–2743. [Google Scholar]
Weber, I. Large Language Models are Pattern Matchers: Editing Semi-Structured and Structured Documents with ChatGPT. arXiv 2024, arXiv:2409.07732. [Google Scholar] [CrossRef]
Huang, C.; Wu, J.; Xia, Y.; Yu, Z.; Wang, R.; Yu, T.; Zhang, R.; Rossi, R.A.; Kveton, B.; Zhou, D.; et al. Towards Agentic Recommender Systems in the Era of Multimodal Large Language Models. arXiv 2025, arXiv:2503.16734. [Google Scholar]
Zhang, J.; Liu, Z.; Lian, D.; Chen, E. Generalization Error Bounds for Two-stage Recommender Systems with Tree Structure. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Baumann, J.; Mendler-Dünner, C. Algorithmic Collective Action in Recommender Systems: Promoting Songs by Reordering Playlists. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Tian, J.; Wang, Z.; Zhao, J.; Ding, Z. MMREC: LLM Based Multi-Modal Recommender System. In Proceedings of the 2024 19th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP), Athens, Greece, 21–22 November 2024; pp. 105–110. [Google Scholar]
Prahlad, D.; Lee, C.; Kim, D.; Kim, H. Personalizing Large Language Models using Retrieval Augmented Generation and Knowledge Graph. In Proceedings of the Companion Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; pp. 1259–1263. [Google Scholar]
Song, S.; Yang, C.; Xu, L.; Shang, H.; Li, Z.; Chang, Y. TravelRAG: A Tourist Attraction Retrieval Framework Based on Multi-Layer Knowledge Graph. ISPRS Int. J. Geo-Inf. 2024, 13, 414. [Google Scholar] [CrossRef]
Banerjee, A.; Satish, A.; Wörndl, W. Enhancing Tourism Recommender Systems for Sustainable City Trips Using Retrieval-Augmented Generation. In Proceedings of the Recommender Systems for Sustainability and Social Good, Prague, Czech Republic, 22–26 September 2025; pp. 19–34. [Google Scholar]
Yang, H.; Guo, J.; Qi, J.Q.; Xie, J.; Zhang, S.; Yang, S.; Li, N.; Xu, M. A Method for Parsing and Vectorization of Semi-structured Data used in Retrieval Augmented Generation. arXiv 2024, arXiv:2405.03989. [Google Scholar] [CrossRef]
Li, J.; Yuan, Y.; Zhang, Z. Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases. arXiv 2024, arXiv:2403.10446. [Google Scholar] [CrossRef]
Wu, Z.; Lin, X.; Dai, Z.; Hu, W.; Shu, Y.; Ng, S.K.; Jaillet, P.; Low, B.K.H. Prompt Optimization with EASE? Efficient Ordering-aware Automated Selection of Exemplars. In Proceedings of the Annual Conference on Neural Information Processing Systems 2024, Vancouver, BC, Canada, 10–15 December 2024; pp. 122706–122740. [Google Scholar]
Ding, Y.; Fan, W.; Huang, X.; Li, Q. Large Language Models for Graph Learning. In Proceedings of the Companion Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 1643–1646. [Google Scholar]
Wang, X.; Cui, J.; Fukumoto, F.; Suzuki, Y. Enhancing High-order Interaction Awareness in LLM-based Recommender Model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 11696–11711. [Google Scholar]
Zhang, Q.; Dong, J.; Chen, H.; Zha, D.; Yu, Z.; Huang, X. KnowGPT: Knowledge Graph based Prompting for Large Language Models. In Proceedings of the Annual Conference on Neural Information Processing Systems 2024, Vancouver, BC, Canada, 10–15 December 2024; pp. 6052–6080. [Google Scholar]
Qiu, Z.; Luo, L.; Zhao, Z.; Pan, S.; Liew, A.W.C. Graph Retrieval-Augmented LLM for Conversational Recommendation Systems. In Proceedings of the Advances in Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 June 2025; pp. 344–355. [Google Scholar]
Yang, W.; Some, L.; Bain, M.; Kang, B. A comprehensive survey on integrating large language models with knowledge-based methods. Knowl.-Based Syst. 2025, 318, 113503. [Google Scholar] [CrossRef]
Benjira, W.; Atigui, F.; Bucher, B.; Grim-Yefsah, M.; Travers, N. Automated mapping between SDG indicators and open data: An LLM-augmented knowledge graph approach. Data Knowl. Eng. 2025, 156, 102405. [Google Scholar] [CrossRef]
Choi, S.; Jung, Y. Knowledge Graph Construction: Extraction, Learning, and Evaluation. Appl. Sci. 2025, 15, 3727. [Google Scholar] [CrossRef]
Tsaneva, S.; Dessì, D.; Osborne, F.; Sabou, M. Knowledge graph validation by integrating LLMs and human-in-the-loop. Inf. Process. Manag. 2025, 62, 104145. [Google Scholar] [CrossRef]
Yao, L.; Peng, J.; Mao, C.; Luo, Y. Exploring Large Language Models for Knowledge Graph Completion. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Chao, W.S.; Zheng, Z.; Zhu, H.; Liu, H. Make Large Language Model a Better Ranker. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 918–929. [Google Scholar]
Hsu, S.; Khattab, O.; Finn, C.; Sharma, A. Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Bansal, H.; Hosseini, A.; Agarwal, R.; Tran, V.Q.; Kazemi, M. Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Kumar, K.; Ashraf, T.; Thawakar, O.; Anwer, R.M.; Cholakkal, H.; Shah, M.; Yang, M.H.; Torr, P.H.; Khan, S.H.; Khan, F.S. LLM Post-Training: A Deep Dive into Reasoning Large Language Models. arXiv 2025, arXiv:2502.21321. [Google Scholar] [CrossRef]
Zhang, J. Guided Profile Generation Improves Personalization with Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 4005–4016. [Google Scholar]
Huang, T.J.; Yang, J.Q.; Shen, C.; Liu, K.Q.; Zhan, D.C.; Ye, H.J. Improving LLMs for Recommendation with Out-of-Vocabulary Tokens. In Proceedings of the Forty-Second International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Thomas, V.; Ma, J.; Hosseinzadeh, R.; Golestan, K.; Yu, G.; Volkovs, M.; Caterini, A. Retrieval & Fine-Tuning for In-Context Tabular Models. In Proceedings of the Annual Conference on Neural Information Processing Systems 2024, Vancouver, BC, Canada, 10–15 December 2024; pp. 108439–108467. [Google Scholar]
Liao, M.; Chen, W.; Shen, J.; Guo, S.; Wan, H. HMoRA: Making LLMs More Effective with Hierarchical Mixture of LoRA Experts. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Wei, Q.; Yang, M.; Wang, J.; Mao, W.; Xu, J.; Ning, H. TourLLM: Enhancing LLMs with Tourism Knowledge. arXiv 2024, arXiv:2407.12791. [Google Scholar]
Jeong, C. Domain-specialized LLM: Financial fine-tuning and utilization method using Mistral 7B. J. Intell. Inf. Syst. 2024, 30, 93–120. [Google Scholar] [CrossRef]
Sordoni, A.; Yuan, E.; Côté, M.A.; Pereira, M.; Trischler, A.; Xiao, Z.; Hosseini, A.; Niedtner, F.; Le Roux, N. Joint Prompt Optimization of Stacked LLMs using Variational Inference. In Proceedings of the Annual Conference on Neural Information Processing Systems 2023, New Orleans, LA, USA, 10–16 December 2023; pp. 58128–58151. [Google Scholar]
Xu, X.; Ye, Q.; Ren, X. Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack. In Proceedings of the Annual Conference on Neural Information Processing Systems 2024, Vancouver, BC, Canada, 10–15 December 2024; pp. 15801–15840. [Google Scholar]
Jesson, A.; Beltran-Velez, N.; Chu, Q.; Karlekar, S.; Kossen, J.; Gal, Y.; Cunningham, J.P.; Blei, D. Estimating the Hallucination Rate of Generative AI. In Proceedings of the Annual Conference on Neural Information Processing Systems 2024, Vancouver, BC, Canada, 10–15 December 2024; pp. 31154–31201. [Google Scholar]
Ahmed, B.S.; Baader, L.O.; Bayram, F.; Jagstedt, S.; Magnusson, P. Quality Assurance for LLM-RAG Systems: Empirical Insights from Tourism Application Testing. In Proceedings of the 2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), Naples, Italy, 31 March–4 April 2025; pp. 200–207. [Google Scholar]
Qian, Y.; Ye, H.; Fauconnier, J.P.; Grasch, P.; Yang, Y.; Gan, Z. MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Zhang, X.; Wang, D.; Dou, L.; Zhu, Q.; Che, W. A Survey of Table Reasoning with Large Language Models. arXiv 2024, arXiv:2402.08259. [Google Scholar] [CrossRef]

Figure 1. PRISMA 2020 flow diagram summarizing the study selection process.

Figure 2. Annual output of included studies by thematic category with cumulative trend line.

Figure 3. Distribution of excluded studies by publication source and thematic category.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Smajić, A.; Karlović, R.; Bobanović Dasko, M.; Lorencin, I. Large Language Models for Structured and Semi-Structured Data, Recommender Systems and Knowledge Base Engineering: A Survey of Recent Techniques and Architectures. Electronics 2025, 14, 3153. https://doi.org/10.3390/electronics14153153

AMA Style

Smajić A, Karlović R, Bobanović Dasko M, Lorencin I. Large Language Models for Structured and Semi-Structured Data, Recommender Systems and Knowledge Base Engineering: A Survey of Recent Techniques and Architectures. Electronics. 2025; 14(15):3153. https://doi.org/10.3390/electronics14153153

Chicago/Turabian Style

Smajić, Alma, Ratomir Karlović, Mieta Bobanović Dasko, and Ivan Lorencin. 2025. "Large Language Models for Structured and Semi-Structured Data, Recommender Systems and Knowledge Base Engineering: A Survey of Recent Techniques and Architectures" Electronics 14, no. 15: 3153. https://doi.org/10.3390/electronics14153153

APA Style

Smajić, A., Karlović, R., Bobanović Dasko, M., & Lorencin, I. (2025). Large Language Models for Structured and Semi-Structured Data, Recommender Systems and Knowledge Base Engineering: A Survey of Recent Techniques and Architectures. Electronics, 14(15), 3153. https://doi.org/10.3390/electronics14153153

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Models for Structured and Semi-Structured Data, Recommender Systems and Knowledge Base Engineering: A Survey of Recent Techniques and Architectures

Abstract

1. Introduction

1.1. Large Language Models: Definition and Emergence

1.2. Global Adoption Trends of LLMs

1.3. Application of LLMs in Recommendation Systems

1.4. Objectives and Structure

2. Materials and Methods

2.1. Search Strategy

2.2. Inclusion and Exclusion Criteria

2.3. Screening and Eligibility—PRISMA

2.4. Data Collection and Thematic Categorization

2.5. Quality Considerations and Risk of Bias

3. Results

3.1. Characteristics of the Identified Literature

3.2. Temporal and Thematic Trends

3.3. Publication Venues and Thematic Focus

4. Discussion and Limitations

4.1. Limitations of Current Research

4.2. Broader Implications and Future Research Directions

5. LLMs for Structured and Semi-Structured Data

6. Architectures and Trends in LLM Recommenders

7. Enhancing LLMs with RAG, Knowledge Graphs, and Prompts

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI