1. Introduction
Artificial Intelligence (AI) chatbots have significantly enhanced the way humans interact with technology. These innovative tools, such as ChatGPT, Bing chat, and Google Bard, have been specifically designed and developed to provide a user-friendly interface for communication. This emergent technology has become increasingly popular and widely available for public use. It utilizes advanced Natural Language Processing (NLP) algorithms and machine learning techniques to simulate human conversation, providing users with a seamless and efficient experience. Through their ability to understand and respond to natural language inputs, these chatbots have revolutionized the way individuals from a diverse range of backgrounds interact with technology, thus making it easier and more efficient to obtain information and complete specialized tasks.
Large Language Models (LLMs) play a crucial role in this digital transformation. They are potent machine-learning models designed to comprehend and generate natural language by leveraging massive amounts of text and code data, enabling them to learn patterns and relationships inherent in language. This makes them capable of generating text that is often indistinguishable from that written by humans. Their versatility allows them to perform tasks such as language translation, sentiment analysis, and creative text generation [
1]. They also exhibit proficiency in sentiment analysis, distinguishing between positive and negative tones in customer reviews. Moreover, LLMs excel to a large extent in performing other NLP tasks requiring creativity, such as generating code, poems, and musical compositions [
2], as well as generating or editing images based on users’ will [
3], demonstrating their broad applicability. Delving into the future implications of LLMs unveils their potential impact on job markets, communication dynamics, and societal paradigms, prompting a thoughtful consideration of the evolving landscape shaped by these powerful language models. Our initiative involves leveraging LLMs as the core reasoning component to introduce ChatGeoAI as an AI-empowered geographic information system (GIS).
Geospatial analysis is a key factor in making well-informed decisions in various fields and different applications. It heavily relies on the careful collection, examination, and representation of geographic data, while GIS has long been instrumental in this realm. However, the technical intricacies of GIS platforms often pose challenges, limiting access to those without a technical background who could otherwise benefit from geospatial intelligence. Wang et al. [
4] highlighted that the conventional interfaces of GIS can be daunting for non-experts, restricting their ability to leverage geospatial analysis for decision-making.
The challenges surrounding the usability and accessibility of GIS are multifaceted, encompassing complexity, technical skill requirements, cost, and integration issues. These barriers impede the broad adoption and effective use of GIS technology, particularly among non-expert users [
5]. To address these challenges and democratize GIS technologies, integrating Generative AI and LLMs with geospatial analysis tools could be a significant transformative step, making geospatial analysis available to all.
Our innovative approach aims to bridge the gap between complex geospatial analysis and users with limited or no GIS expertise. The core idea is to use natural language processing to interpret user requests and Generative AI to automatically generate executable codes for geospatial analysis, which can be run automatically to perform analyses and create resulting maps. By enhancing accessibility and usability, this approach offers several benefits for non-expert users, e.g.,:
Intuitive interaction: Users can engage with GIS tools using natural language requests to perform complex geospatial analyses.
Wider adoption: The accessibility of this integration encourages the adoption of GIS across diverse fields and industries-
Efficiency: Automating the translation of user requests into executable codes reduces the time and resources needed for complex geospatial analyses.
Informed decision-making: public users can make better decisions in areas such as urban planning, environmental conservation, emergency response, and even daily activities such as booking hotels based on spatial criteria.
Interdisciplinary collaboration: Enabling various disciplines to incorporate geospatial analysis into their research promotes collaboration and innovation.
The central challenge for implementing this idea lies in refining the decision-making core of LLMs to accurately respond to user queries and adapt to domain-specific requirements (geospatial analysis, in this case), ensuring reliable and precise results. In other words, the LLMs must acquire geospatial analysis knowledge. Additionally, geospatial tasks involve multiple operations that must be executed in a specific order. This implies that the LLMs should be provided with the reasoning capacity to be able to generate operations and execute them in the right order to perform a required geospatial analysis, correctly.
In this paper, we propose a methodology for applying state-of-the-art LLMs to reason geospatial analysis. Our specific aim is to design and evaluate a system architecture and methodology for an AI-empowered GIS called ChatGeoAI. This system leverages LLMs, which are fine-tuned on GIS tasks and contextualized through the integration of domain-specific ontologies, to understand natural language queries, generate executable code, and perform geospatial analysis to support the decision-making of users with no or limited GIS skills. To achieve this aim, the following research questions should be answered:
How can natural language processing be used efficiently to interpret and convert geospatial queries into executable GIS code?
What role do large language models play in enhancing the accessibility and usability of GIS for non-expert users, and what level of accuracy can they achieve in solving GIS tasks?
How can the fine-tuning of LLM and incorporation of geospatial entity recognition improve the accuracy and contextual relevance of responses generated by geospatial AI systems?
The remainder of this paper is organized as follows. The second section delves into the background of LLMs, exploring their applications, capabilities, and inherent limitations. Particular focus is paid to a literature review on LLM applications in geospatial analysis, highlighting both their potential and current challenges. This section will be concluded by situating this paper in the context and clarifying its contribution to knowledge. Following this, the third and the fourth sections describe, respectively, the architecture of the proposed system and its development steps, providing a detailed overview of its design and functionality. Afterward, the subsequent two sections present the results gleaned from the system’s performance analysis and in-depth discussions about its efficacy and limitations. Finally, this paper concludes by summarizing key insights, implications, and avenues for future research and development in developing such systems.
2. Materials and Methods
2.1. NLP Techniques for Geospatial Analysis
NLP techniques in geospatial analysis offer a systematic and interpretable way to extract and understand geographical information from text. Rule-based approaches, which rely on predefined linguistic rules and patterns, are commonly used to extract geospatial information from natural language. These rules, designed by domain experts, identify specific syntactic or semantic patterns that point to geospatial references or queries. While effective for straight forward, well-defined queries, these methods struggle with complex or ambiguous cases and require ongoing manual rule creation and maintenance. On the other hand, statistical models extract geospatial information by analyzing large text datasets and learning relationships between words, phrases, and geospatial entities. Commonly used techniques include Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. Trained on annotated datasets, these models can identify and extract geospatial entities, such as locations, addresses, and coordinates, based on observed statistical patterns.
Machine learning has significantly enhanced spatial analysis by providing more sophisticated techniques for extracting, interpreting, and analyzing geographical information from text. For example, Syed et al. [
6] developed a platform for extraction and geo-referencing spatial information from textual documents. Hu et al. [
7] explored the concept of geo-text data, which encompasses texts that are linked to geographic locations either explicitly through geotags or implicitly through place mentions. Their work systematically reviewed numerous studies that have utilized geo-text data for data-driven research. Yin et al. [
8] developed an NLP-based question-answering framework that facilitates spatio-temporal analysis and visualization, although it considers only a limited number of operations using keyword-operation mapping.
NER models, such as Conditional Random Fields (CRFs) and deep learning-based approaches, are often used for identifying and classifying place-names [
9]. Geocoding algorithms, which leverage machine learning, convert textual descriptions into geographical coordinates and can achieve high accuracy [
10]. Spatial entity linking and toponym resolution models use contextual analysis and disambiguation techniques to associate place names with specific locations in geographical databases [
11]. Geospatial topic modeling applies NLP to detect and analyze the semantic, spatial, and temporal dynamics of geo-topics within geospatially-tagged documents [
12]. Sentiment analysis models classify and map sentiments expressed in text, offering insights into geographical trends. Spatial language understanding techniques extract and interpret spatial relationships described in the text, while spatial data fusion models integrate data from various sources for comprehensive analysis [
13].
A recent paper conducted an extensive evaluation of 27 widely used methods for geoparsing and location reference recognition, including rule-based, gazetteer matching–based statistical learning, and hybrid approaches [
14]. This study concludes that deep learning is currently the most promising technique for location reference recognition. Furthermore, integrating multiple approaches through a voting mechanism enhances robustness and mitigates the limitations of individual methods. Performance varies depending on the type of text and location references, with notable differences in computational efficiency. Overall, machine learning algorithms outperform rule-based techniques, offering greater accuracy, efficiency, and depth in geospatial analysis across these diverse tasks.
Existing approaches in connecting NLP techniques with geospatial analysis have yet several limitations and challenges that hinder their effectiveness and scalability that can be summarized as:
Low level of semantic understanding: A major limitation is the inadequate semantic understanding of current approaches. Geospatial NLP systems struggle with nuanced queries and ambiguous language, which may lead to incorrect or incomplete results.
Inefficiency in query processing: Efficiently processing natural language queries over geospatial data is crucial. Traditional query methods are not well-suited for complex natural language queries. Query optimization, including rewriting and execution planning, is vital to ensure efficient retrieval of geospatial information. Innovative techniques for processing natural language queries and optimizing query execution are needed to improve system performance and responsiveness.
Struggling with data integration and handling heterogeneity: Geospatial data often come from diverse sources with varying formats, schemas, and spatial reference systems. Integrating and harmonizing these heterogeneous data sources with natural language interfaces is challenging. Techniques such as semantic mapping, data fusion, and data interoperability are necessary to address these challenges and enable seamless integration and retrieval of geospatial information.
Limited training data and domain-specific knowledge: Training effective NLP models for geospatial applications requires substantial labeled training data, which is costly and time-consuming to acquire. Additionally, geospatial domains have specific terminologies and concepts not well-represented in general-purpose language models. Building domain-specific knowledge bases and leveraging transfer learning techniques can help overcome these challenges and improve the performance of geospatial NLP systems.
In addition to these limitations, many existing methods encounter scalability issues when handling large volumes of geospatial data. Usually, the datasets involved are vast and complex, making natural language query processing computationally intensive. Scaling systems to manage high data volumes and real-time queries remains a significant challenge. Therefore, developing efficient retrieval mechanisms is crucial for optimizing both performance and scalability.
In the case of solving GIS tasks through code generation, NLP is often applied only to structured queries. one example is the extraction of data from complex datasets (i.e., transportation database) where semantic role classification (a specific NLP technique) is used to convert queries stated in natural human language into formal database queries [
15].
LLMs can significantly enhance the integration of natural language processing with geospatial analysis thanks to their extensive training on diverse text corpora. These models excel in semantic understanding, which enables them to interpret complex queries with nuanced meanings more accurately. This capacity helps reduce semantic errors and ambiguities, making them highly effective at grasping contextual nuances in geospatial queries.
In terms of processing efficiency and data integration, LLMs excel in converting natural language into actionable analysis queries or codes by understanding user intent and extracting relevant entities and conditions from complex sentences. Their advanced capabilities also make it easier to integrate heterogeneous data sources, leveraging semantic mapping and data fusion techniques to handle diverse schemas and formats seamlessly. This flexibility is crucial for harmonizing and integrating various geospatial datasets. LLMs also show promising in optimizing query execution by generating intermediate codes and representations that align closely with underlying data and software, streamlining query execution plans.
LLMs further address challenges related to the scalability of geospatial data processing and the need for extensive domain-specific training data. By utilizing techniques such as transfer learning and knowledge augmentation, LLMs can be fine-tuned with relatively small datasets to improve their domain-specific performance. Furthermore, their deployment in distributed computing environments enables large-scale data processing, essential for managing large volumes of geospatial data and real-time query processing.
2.2. Large Language Models
Pre-trained Language Models (PLMs) such as ELMo and BERT [
16] have revolutionized natural language processing by capturing context-aware word representations through pre-training on large-scale, unlabeled corpora. ELMo introduced bidirectional LSTM networks for initial pre-training, later fine-tuned for specific tasks, while BERT employed Transformer architecture with self-attention mechanisms for similar purposes. These models established the “pre-training and fine-tuning” paradigm, which inspired subsequent PLMs such as GPT-2 [
17] and BART [
18] to refine pre-training strategies and excel in fine-tuning for specific downstream tasks.
Scaling PLMs, by increasing the model size or expanding the training data, has often enhanced their performance on downstream tasks, as supported by the scaling law [
19]. Notable examples of progressively larger PLMs include GPT-3 (with 175 billion parameters) and PaLM (with 540 billion parameters), both of which have demonstrated emergent capabilities [
20]. Although these LLMs maintain similar architectures and pre-training tasks to smaller models such as BERT (330 million parameters) and GPT-2 (1.5 billion parameters), their large scale leads to distinct behaviors. For instance, LLMs, such as GPT-3, showcase emergent abilities in handling complex tasks, as shown by the adeptness of GPT-3 in few-shot learning through in-context adaptation [
20].
The typical architecture of LLMs consists of a layered neural network that integrates embedding, recurrent, feedforward, and attention layers [
21]. The embedding layer converts each word in the input text into a high-dimensional vector, capturing both semantic and syntactic information to facilitate contextual understanding. Subsequently, the feedforward layers apply nonlinear transformations to the input embeddings, enabling the model to learn higher-level abstractions. Recurrent layers process the input text sequentially, maintaining a dynamic hidden state that captures dependencies between words in sentences [
21]. The attention mechanism enhances the model’s focus on different parts of the input text, allowing it to attend to relevant portions for more precise predictions selectively. In summary, the architecture of LLMs is meticulously designed to process input text, extract its meaning, decipher word relationships, and ultimately generate precise predictions [
22].
A key advantage of LLMs is their capacity as few-shot learners, meaning they can grasp new tasks or understand novel information with only a minimal number of examples or “shots”. Dissimilar to traditional models that require large datasets for effective learning, few-shot learners excel at generalizing from limited data, making them highly efficient and adaptable in scenarios with constrained data [
20]. This was notably demonstrated in an experiment by Zhou et al. [
23], who could outperform GPT-3 DaVinci003 by fine-tuning a LLaMA 1 model (having only 65 billion parameters) on just 1000 high-quality samples.
2.3. Applications of LLMs and Their Limitations
In practical implementations, LLMs and their associated products have been used in various applications. For example, ChatGPT has been used in robotics, focusing on prompt engineering and dialog strategies for different tasks ranging from basic logical and mathematical reasoning to complex tasks such as aerial navigation [
24]. Liang et al. [
25] presented the vision, key components, feasibility, and challenges of building an AI ecosystem that connects foundation models with millions of APIs for task completion. The idea suggests using existing foundation models as the core of the system and APIs of other AI models and systems as sub-task solvers to be able to solve specialized tasks in both digital and physical domains. Li et al. [
26] used ChatGPT to develop a generic interactive programming framework that aims to increase efficiency by automatic modeling, coding, debugging, and scaling in the energy sector. They argued that for complex tasks where engineers have no prior knowledge or totally new problems without readily available solutions, ChatGPT can reduce the learning cost by recommending appropriate algorithms and potential technology roadmap while auto-coding each step.
In the pursuit of building autonomous agents harnessing the power of LLMs, Richards (2023) [
27] introduced Auto-GPT. This agent-based application is equipped to independently initiate prompts and autonomously execute tasks using a blend of automated Chain of Thought prompting and reflective processes. This application possesses the capability to generate its own code and run scripts, enabling recursive debugging and development. In the same vein, Nakajima [
28] developed BabyAGI, which is an AI-empowered task management system leveraging OpenAI and embedding-based databases such as Chroma or Weaviate to streamline task creation, prioritization, and execution. BabyAGI features multiple LLM-based agents, each with distinct functions: one generates new tasks based on previous outcomes, another prioritizes tasks, and a third completes them while aligning with the specified objectives. To simplify the creation of such autonomous agents, Hong et al. [
29] developed MetaGPT, a meta-programming framework optimizing LLM-based multi-agent collaborations for complex tasks. MetaGPT encodes Standardized Operating Procedures into prompt sequences, streamlining workflows and reducing errors. Employing an assembly line paradigm, it efficiently delegates roles to agents, breaking down tasks into manageable subtasks.
2.4. LLMs for Geospatial Analysis
Large pre-trained models, known as foundation models, are gaining prominent in the field of GIS. In geospatial analysis, a foundation model is a sizable pre-trained machine learning model that can be fine-tuned for specialized tasks with minimal additional training, often through zero-shot or few-shot learning. Mooney et al. [
30] evaluated the performance of ChatGPT on a real-world GIS exam to assess its understanding of geospatial concepts and its ability to answer related questions. The results showed that both GPT-3.5 and GPT-4 could pass an introductory GIS exam, excelling in basic GIS data models. Notably, GPT-4 demonstrated significant improvements, scoring 88.3% compared with the 63.3% score of GPT-3.5 due to its more accurate handling of complex questions involving, for example, distance calculations. However, both models struggled with numerical computations, highlighting the limitations of LLMs in handling spatial analysis tasks that require advanced mathematical reasoning.
Another study [
31] explored the ability of GPT-4 to evaluate spatial information within the context of the Fourth Industrial Revolution (Industry 4.0). While ChatGPT exhibited a solid understanding of spatial concepts, its responses tended to emphasize technological and analytical aspects. The authors also raised concerns about potential hallucinations and fabrications, underscoring the need for ground truth validation when deploying LLMs in Industry 4.0 applications. Despite these challenges, the interpretability and broad training of ChatGPT offer promising opportunities for developing user-friendly, interactive GeoAI applications in spatial science, data analysis, and visualization.
Several studies have explored the potential of ChatGPT in related fields, such as remote sensing. Agapiou and Lysandrou [
32] conducted a literature review on using ChatGPT in remote sensing archaeology, while Guo et al. [
33] introduced “Remote Sensing ChatGPT,” which integrates ChatGPT and visual models to automate remote sensing interpretation tasks. This system aims to make remote sensing more accessible to researchers across different disciplines by understanding user queries, planning tasks, and providing responses. Quantitative and qualitative evaluations indicate that Remote Sensing ChatGPT demonstrated precision and efficiency in some tasks while achieving acceptable performance in others. IBM and NASA have jointly released a new geospatial foundation model (GFM) named Prithvi [
34], developed using self-supervised learning techniques on Harmonized Landsat and Sentinel-2 (HLS) data from across the continental United States. Prithvi, trained on satellite imagery that captures essential spectral signatures of the Earth’s surface, holds greater potential for enhancing geospatial tasks compared with models trained on other types of natural images.
Several studies have used LLMs to conceptualize or develop AI-empowered GIS. One such study explored the use of large LLMs, such as ChatGPT and Bard, for interacting with geospatial data through natural language. This study presented a framework for LLM training, SQL query generation, and response parsing, demonstrating promising results in generating SQL code for spatial queries [
35]. A more recent study introduced a framework for autonomous GIS agents, which leverages LLMs to automate the discovery and retrieval of geospatial data for spatial analysis and cartography [
36]. This framework autonomously generates, executes, and debug codes to retrieve data from selected sources, guided by metadata and technical handbooks. In a similar vein, Li et al. [
37] developed LLM-Geo, a prototype system based on GPT-4 API. LLM-Geo efficiently generated outcomes such as aggregated numbers, maps, and charts, significantly reducing manual processing time. Although the system is still in its early stages and lacks features such as logging and code testing, it shows great potential for advancing AI-empowered GIS. In addition, autonomous GIS frameworks that automate geospatial data retrieval and processing systems, such as MapGPT, are being designed for specific real-world applications. MapGPT utilizes LLMs to improve pathfinding by integrating visual and textual information with dynamically updated topological maps, enabling agents (robots) to perform adaptive path planning and multi-step navigation [
38]. Mai et al. [
39] examined the performance of various Foundation Models across a range of geospatial tasks, including Geospatial Semantics, Urban Geography, Health Geography, and Remote Sensing. The results showed that for tasks primarily reliant on text data, such as toponym recognition and location description recognition, task-agnostic LLMs outperformed task-specific models in zero-shot or few-shot learning scenarios. However, for tasks incorporating multiple data modalities, such as street view image-based urban noise intensity classification, POI-based urban function classification, and remote sensing image scene classification, existing foundation models lagged behind task-specific models. This study also emphasized the importance of improving computational efficiency during both the training and fine-tuning phases of large models. Additionally, this study highlighted the challenges of incorporating external knowledge into models without accessing their internal parameters, pointing to the complexities involved in leveraging supplementary information to enhance model performance.
Another study explored ChatGPT’s potential in map design and production using both thematic and mental mapping approaches [
40]. It demonstrated the advantages of LLMs in enhancing efficiency and fostering creativity but also acknowledged limitations such as user intervention dependency and unequal benefits.
To lower the barrier for non-professional users in addressing geospatial tasks, Zhang et al. [
41] introduced GeoGPT, a framework that combines the semantic understanding capabilities of LLMs with established tools from the GIS community. GeoGPT utilizes the Langchain framework using the GPT-3.5-turbo as the controlling agent and integrates classical GIS operations such as Buffer, Clip, and Intersect alongside professional tools such as land use classification. This enables GeoGPT to handle a variety of geospatial tasks, such as data crawling, facility placement, spatial queries, and mapping. However, GeoGPT faces limitations in fine-tuning for geospatial knowledge. In fact, generic LLMs often lack precise geographic knowledge, as demonstrated by challenges in query-POI matching [
39]. Additionally, GeoGPT does not use an LLM specialized in coding, which could offer greater reliability for manipulation of GIS tools directly and allow for a broader range of operations beyond its predefined set.
Our paper shares a similar objective to Zhang et al. [
41] but employs a different methodology. Rather than providing the model with a predefined pool of tools, we fine-tune the model on PyQGIS, enhancing its ability to learn a wide range of potential operation combinations.
Another study, which is related to our work, presented a framework that integrates LLMs with GIS and constrains them with a flood-specific knowledge graph to improve public understanding of flood risks through natural language dialogues. The LLM generates code that runs in ArcPy to perform flood analysis and provide personalized information aligned with domain-specific knowledge [
42]. While this study focuses on flood risk mapping, our paper proposes a system and methodology that can be used in any application.
It is worth noting that the commercial offerings of OpenAI, such as ChatGIS, GIS Expert, and GIS GPT, already provide users with general knowledge of geography, geology, and GIS through natural language conversation. Dissimilar to our proposed system, ChatGeoAI, these models are not specifically designed to perform geospatial analysis. Instead, they aim to enhance user understanding of domain-specific concepts and provide generic solutions for simple tasks. These GPT models have not been fine-tuned for task-specific applications but use techniques such as Retrieval-Augmented Generation (RAG) to incorporate external knowledge dynamically. While this approach enables them to offer insights across a broad spectrum of topics, they do not perform specific geospatial analysis tasks, which is the goal of our system.
The existing literature reveals some gaps in applying LLMs to geospatial analysis. Current studies predominantly focus on using LLMs for tasks such as text generation and knowledge extraction via structured queries or integrating LLMs with GIS tools, where their role is limited to interpreting user intent and selecting operations. This leaves the potential of LLMs for direct code generation in geospatial analysis largely unexplored. Moreover, the few studies that use LLMs for code generation, in this context, do not consider fine-tuning and the integration of domain-specific knowledge, which could significantly enhance their capabilities.
2.5. Contribution to Knowledge
Generative AI, encompassing technologies capable of generating diverse types of data using various model architectures, intersects with two major AI domains: NLP and Computer Vision. NLP focuses on processing and generating text, while Computer Vision handles visual content. Within this intersection, LLMs primarily specialize in text processing and generation, leveraging transformer-based architectures.
In this study, we focus on integrating LLMs with Geographic Information Systems (GIS), a field that greatly benefits from advanced AI techniques to enhance geospatial analysis and decision-making. Our work represents a novel approach at the intersection of LLMs and GIS, using fine-tuned LLMs, enhanced by Generative AI methodologies, to perform GIS tasks.
Our paper offers several contributions. First, we extend the capabilities of LLMs by enabling them to control external tools such as QGIS (used in this study, version 3.30). Second, we fine-tune an LLM, specifically Llama 2, for domain-specific geospatial tasks, improving its performance through named entity extraction methods to provide more accurate context. Additionally, we introduce and evaluate a system architecture methodology called ChatGeoAI, designed to create an AI-empowered GIS. Finally, we test and assess the proposed system’s performance, focusing on its ability to understand user requirements and execute a sequence of relevant functions to conduct geospatial analyses in response to those needs.
4. Results
In this section, we report the results of the evaluation of the proposed system and its contribution to improved performance compared with the baseline Llama 2 model. We begin by assessing our architecture, with a particular focus on the impact of fine-tuning on performance. Subsequently, we evaluate the system across multiple queries to determine its ability to generate accurate and executable code.
4.1. System Architecture Analysis
To assess the proposed system, we evaluate the impact of fine-tuning on the Llama 2 model. Therefore, to ensure robust model performance and mitigate overfitting, we closely monitored the model’s performance on the validation set.
Figure 4 shows the training and validation loss function curves illustrating a substantial decrease from 1.062 to 0.0129 and from 1.16 to 0.09, respectively. This signifies that the fine-tuning increases the Llama 2 model’s ability to generate codes effectively. Such progression is in harmony with the inherent nature of LLMs as few-shot learners, supported by research indicating that approximately 1000 prompts can be adequate for fine-tuning LLMs [
14].
A reduction in loss alone does not fully capture the quality and accuracy of the generated code. Therefore, we assessed the code quality using various natural language processing metrics to obtain a more comprehensive evaluation and capture diverse aspects such as structural similarity, semantic accuracy, syntax correctness, and data flow consistency. This multi-metric approach provides a holistic view of the system’s performance and reliability, highlighting specific strengths and weaknesses. Additionally, using multiple metrics allows for comparing performance against the baseline across various dimensions, thus ensuring a balanced and thorough evaluation.
Table 1 shows the performance comparison of baseline Llama 2 and ChatGeoAI using the CodeBLEU metric. This metric comprises four sub-metrics: N-gram Match Score, Weighted N-gram Match Score, Syntax Match Score, and Data Flow Score. Assigning equal weight to all these aspects, ChatGeoAI achieved a CodeBLEU score of 23.58%, which is 2.35 points higher than the baseline Code-Llama 2. CodeBLEU scores typically range from 0 to 100, depending on the complexity of the tasks and models used. Scores between 20–30 are generally considered reasonable, with scores exceeding 40% indicating good performance. Ren et al. evaluated various LLMs on text-to-code tasks, achieving scores between 18.04 and 30.96% [
53]. Scores higher than 30 are often achieved when tested on simpler data such as CoNaLa, which includes simple generic programming problems such as sorting a list or file operations and which involve typically short and one-line code snippets that are easier for models to generate accurately based on well-defined intents [
58]. So, the achieved score indicates moderate performance that we will analyze further by showing its sub-metrics. ChatGeoAI achieves slightly higher values in terms of N-gram Match Score (6.70 vs. 5.62) and Weighted N-gram Match Score (9.18% vs. 6.81%). Both metrics are typically low for code generator models due to high syntax variability and structural differences in programming languages. ChatGeoAI, however, achieved slightly higher values in both, indicating it generates code more similar in content to the reference code than the code generated by the baseline model. The Syntax Match Score assesses the syntactical correctness of the generated code by evaluating how well it matches the reference code’s syntax tree. By achieving a score of 28.19%, which is approximately 10% higher than the baseline, our system seems to adhere more closely to correct syntax rules, making it more likely to be executable and understandable. Dataflow Match Score measures the accuracy of the data flow in the generated code. A higher score means the logical flow of data (variables, functions, etc.) in the generated code is more consistent with the reference code, thus ensuring correctness in functionality. A 7.2% increase in this metric can potentially enhance the logical flow of the code and its correctness which makes the proposed system more accurate. To summarize, the achieved values in CodeBLEU and its components indicate that the generated code maintains a reasonable structural resemblance to the reference code, with well-aligned data flow, despite the discrepancy in token sequences, likely because of varying variable names and code structuring practices. At the character level, ChatGeoAI outperformed the baseline, achieving a ChrF score of 26.43% compared with the baseline score of 23.86%. This suggests that it has a better fine-grained similarity and captures more character-level details and nuances than the baseline. To consolidate this interpretation of the syntax matching, we use semantic similarity metrics, specifically BERTScore.
ChatGeoAI outperforms the baseline Llama 2 model in both BERTScore and ROUGE-L metrics, as depicted in
Table 2. BERTScore, which uses BERT embeddings to evaluate similarity, shows a 4-point increase in recall for ChatGeoAI, suggesting it generates a broader range of relevant content compared with the baseline. Precision also improves, though less significantly. This suggests that the exactness of the generated instances remains relatively closer to the baseline, even if ChatGeoAI generates a wider range of relevant content or code syntax. The achieved value of F1-scores for BERTScore, which is 83.56%, demonstrates its balanced improvement in recall and precision over the baseline. It also reflects good performance, as scores above 80% are usually considered excellent.
Rouge-L reflects the ability of the model to produce sequences of words that match the reference text by measuring the longest common subsequences. ChatGeoAI also outperformed baseline Llama 2 in the recall of Rouge-L metric (36.71% vs. 32.53%), suggesting that ChatGeoAI is better at generating sequences of words that match those in the reference text with quite better precision of 29.66% (vs. 25.38%). The F1-score of the ROUGE-L metric, at 32.81%, indicates a moderate degree of sequence overlap with the reference code. Typically, ROUGE-L scores range between 20% and 50%, with higher scores in this range achievable for simple and generic programming tasks such as those in the CoNaLa dataset [
55].
To summarize, ChatGeoAI demonstrates slight to significant improvements over the baseline across all metrics. In addition, the high BERTScore and decent syntax and dataflow sub-metrics of CodeBLEU indicate good semantic understanding, structure, and logic, which signifies that the generated code can effectively capture the intended functionality despite differences in implementation details. Moderate scores in the N-gram matching component of CodeBLEU, ChrF, and ROUGE-L show the acceptable structure of the code but also some fine-grained variability. This variability is likely due, to some extent, to the expected differences in tokens, variable names, and coding style, which are typical due to the flexible nature of coding. In the next section, we will evaluate the impact of this fine-grained variability on the correctness and executability of the generated code.
4.2. System Performance Analysis
The comparative evaluation between Baseline Llama 2 and ChatGeoAI highlights significant enhancements in the performance of ChatGeoAI across various metrics, as shown in
Table 3. ChatGeoAI demonstrated a superior initial success rate of 25% on the first trial, a notable increase from the 15% baseline of Llama 2, indicating its enhanced ability to accurately process queries on the first attempt due to advanced NLP capabilities. Furthermore, ChatGeoAI required fewer average trials to achieve success, with only 8 compared with the 14 of baseline Llama 2, suggesting better efficiency in refining and correcting user queries through iterative interactions. This efficiency is critical in reducing user frustration and increasing engagement. Additionally, ChatGeoAI showed a lower rate of permanent failures at 20%, as opposed to 30% for Baseline Llama 2, reflecting its superior error handling and adaptive learning mechanisms that effectively manage complex or ambiguous queries. Overall, these improvements mark ChatGeoAI as a more robust, user-friendly system and valuable for practical applications where quick geospatial analysis is required.
4.3. System Demonstration
In this section, we present the results obtained from running the system using the queries designed for system evaluation. Due to space constraints, the results of just three queries with different levels of complexity (easy, intermediate, and difficult) have been described as examples. Further demonstrations have been provided in
Appendix A. The complexity of the query is determined based on three criteria: the number of steps, the complexity of the workflow, and the complexity of the underlying algorithm. For each query, we first detail the procedural steps executed within the code to complete the task. Next, we present the resulting map visualizations to demonstrate how users can visualize the output. This comprehensive approach enables a nuanced analysis of the system’s performance across different levels of query complexity, providing insights into both its capabilities and limitations.
Here is a detailed step-by-step breakdown of how the ChatGeoAI system would process the query “I need a list of pharmacies which are within 1000 m of the Grand hotel in Lund, Sweden”:
User Interface Component: The user enters the query into the natural language input module: “I need a list of pharmacies which are within 1000 m of the Grand hotel in Lund, Sweden”.
Task Definition: The system captures the user’s geospatial analysis task specification: finding pharmacies within a specific distance from a landmark.
Query processing: preprocessing and normalization using NLP techniques will transform the raw query into a structured and clean format. It performs the following operations to standardize language variations: Lowercasing, tokenization, removing stop words, lemmatization, removing punctuation, and normalization. The output is:
[“need”, “list”, “pharmacy”, “within”, “1000”, “meter”, “grand”, “hotel”, “lund”, “sweden”]
- 4.
Named Entity Recognition (NER): the system identifies and extracts entities from the preprocessed text. The output is:
“Countries: Sweden
Cities: Lund
Specific Locations: Grand Hotel in Lund
Commercial and Healthcare Entities: Pharmacies
Proximity/Spatial Relationships: Within 1000 m”
These entities help contextualize and understand the query within WorldKG, enabling accurate and relevant results based on the spatial and relational context provided.
- 5.
Enriched Context-Aware Query: Integration of the original query with extracted named entities.
- 6.
Llama 2 Code Generation: The enriched semantic context is fed into fine-tuned Llama 2 to generate the necessary code for geospatial analysis.
Table 4 shows the procedural steps executed by the fine-tuned Llama2 of ChatGeoAI to provide a response to the user. The complexity level of the query is considered easy because it has limited (six) steps in one chain without any hierarchical structure in the workflow, and only basic GIS operations are needed to run the analysis.
Looking at the 5th step, in
Table 4, an interesting observation is that the generated code has calculated the distances from the Grand Hotel to the pharmacies and incorporated this information as a field within the newly created layer (memory layer, as a temporary layer), even if it is not explicitly requested by the user. The procedural steps are responding accurately to the user intent. Regarding the code itself, there is potential for further optimization from an expert’s perspective. For instance, instead of iterating over all features, a spatial index could be created or utilized for the pharmacies’ dataset. This approach will allow faster identification of those within a certain radius of the Grand Hotel. Additionally, experts might refine the process by adding a spatial query after applying the initial buffer to ensure that only pharmacies genuinely within 1000 m are considered.
Figure 5 shows the output map in the ChatGeoAI user interface. The names of the pharmacies have been printed in the output text area of the application and a generated map, on which the location of pharmacies have been highlighted, is also presented to the user. The system has been able to generate and perform the required geospatial analysis to provide a response to the query. Experts can achieve similar results by using GIS tools. In QGIS, for example, the Identify Features tool can be used to locate and select Grand Hotel. Then, the creation of a 1 km buffer around the hospital can be achieved using “Vector > Geoprocessing Tools > Buffer”. Then, the list of pharmacies can be obtained using “Vector > Research Tools > Select by Location” to select pharmacy features within the buffer. Finally, one can add labels and adjust symbology as desired. To list the names of selected pharmacies, the user can export them from the attribute table.
Figure 6 shows the result of this manual process, which closely resembles the outcome achieved by the proposed system when compared.
Table 5 illustrates a query with intermediate complexity, entered by the user, and the procedural steps executed by the system to provide a response to the user. The complexity of the query is considered intermediate because although it involves limited steps and uses a pre-developed path optimization algorithm, the system should cope with some challenges, such as determining the type of the required data (network data), defining the right inputs and arguments for the functions and generating and saving the expected output properly. In addition, it involves a workflow that is more complex than the first query with additional data processing.
The code generated by the system successfully achieved the task of finding the shortest path between two attraction sights, Lund Cathedral, and Monumentet, using the PyQGIS library. For other queries, ChatGeoAI uses the local data downloaded from OSM. To run this query, the model contacted the OpenStreetMap to retrieve the OSM network via the OSMnx library for Lund, Sweden. This capability is already defined in Llama 2, demonstrating its ability to access external geospatial datasets when needed. It then employs NetworkX to analyze the street network as a graph, enabling efficient computation of the shortest path between Lund Cathedral and Monumentet. An emergent behavior of the system is that it knows the coordinates for Lund Cathedral and Monumentet and uses them directly, bypassing the need to search local datasets.
The transportation mode in the code is set to “all” by the ChatGeoAI model for the network type parameter, implying that the network comprises all the edges (links) that are suitable for walking, driving, and biking. So, the shortest path shown in
Figure 7 is a type of multimodal path. This outcome is understandable, given that the query does not specify the mode of transportation. However, it would enhance user experience if the code could visually distinguish between walking and driving segments along the route depicted on the map. By explicitly specifying the transportation mode in the query, such as “by car,” the system loads only the driving network from OSMnx, thus enabling the identification of a new route tailored specifically for driving conditions. The same thing can be said for the criteria for determining the shortest path. Without explicit mention of alternative criteria such as travel time, the system considers only distance as the criterion for determining the shortest path.
Table 6 shows a query with a high level of complexity and the procedure for performing the analysis.
To achieve similar results using QGIS software (version 3.30), the user can use the “Shortest Path (Point to Point)” tool from “Processing > Toolbox > Network analysis”. He should configure the tool by selecting the road network layer as the Input Layer and choosing the start and end points from the respective layers. Additional parameters, such as cost or direction, can be specified if needed. Finally, he can click “Run” to generate the shortest path, which will be added as a new layer in the project for visualization.
Figure 8 shows the result obtained using QGIS which is quite similar to the one produced by ChatGeoAI, except for short segments such as in road intersection and the starting point, which is less accurate in the ChatGeoAI output.
To provide a response to this query, multiple analyses should be conducted including “query by attribute” and “spatial analysis” as well as “handling fuzzy proximities”. The generated code effectively handles multiple criteria for hotel recommendation. It filters hotels based on star ratings, ensuring that only those with four or five stars are considered. Additionally, it performs spatial analysis to determine proximity to highways, train stations, and shopping centers, which enhances the relevance of the desired hotels by the user. By leveraging spatial relationships, the code accurately assesses the proximity of hotels to key amenities such as train stations and shopping centers. The system sets a threshold of 500 m as the maximum distance for a hotel to be categorized as “close” to an amenity. Conversely, distances beyond 1000 m are classified as “away” in the solution. Interestingly, the system returns a message to the user indicating that the chosen distances can be customized based on the user’s preferences. The result is shown in
Figure 9. Considering the above, the complexity of the query is high because it involves multiple steps that should be run in a complex hierarchical workflow to produce the results. In addition, the system should understand and decide about fuzzy variables.
Using the same distances used by ChatGeoAI, results can be obtained using QGIS, as depicted in
Figure 10. The manual process involves filtering hotels by star rating, selecting those within the city center boundary, and creating buffers around the central train station and shopping centers to identify nearby hotels. After refining the selection by intersecting these buffers, a highway buffer is applied to exclude hotels near highways. The resulting layer contains hotel polygons that can be labeled, symbolized, or converted into point features with icons for better visualization.
In chat mode, users can iteratively refine their results, adjust parameters, and introduce additional constraints to their original queries without needing to restate the entire context. For example, the user might initially ask the system the following query: “Can you find all playgrounds in Lund? Please count these playgrounds and highlight them on the map”. Subsequently, he obtains the results as shown in
Figure 11. Then, in the same chat session, he can refine the query to, “I am interested only in the playgrounds that are located within or intersecting with parks? Please list their names and highlight these playgrounds on the map”. The system retains all previous interactions in the session, allowing it to understand that the playgrounds should still be in Lund, even though the location is not reiterated in the follow-up query. This session memory simplifies the user experience by enabling the addition of specific constraints while the system seamlessly maintains the broader context established in earlier exchanges.
This interactive setup not only aids in honing the query to meet specific needs but also allows the fine-tuned Llama model to generate the correct code and expected results. This is shown by the continuation of the example. When the user submits the refined query asking for the playgrounds that are located within or intersecting with parks, some errors can be encountered while running the generated executable script. The first error, in this specific case, was an AttributeError due to incorrectly accessing feature attributes. As shown in
Figure 12, no results are displayed, and an error was caught and returned by the system in the chat. The user has only to copy the error message, which is returned by the system and paste it back into the chat. The system will then generate a new code to correct the error. After fixing this issue, another AttributeError occurred because QVariant objects were not converted to strings before calling lower(). Users only need to send back the error messages, and the system will provide the corrected code until it works and the desired results are achieved, as shown in
Figure 13.
If any syntax or semantic errors occur, they are displayed in the results area. Users can then copy and paste the error message back into the chat. The Llama model, which has the ability to recognize and understand these errors, processes the feedback, automatically corrects the code, and the system executes it again in the background. This feedback loop significantly enhances the model’s accuracy and the user’s ability to achieve desired outcomes efficiently. Similarly, the user can refine the results if they are not as expected. In this example, if the user considers that the system incorrectly counted individual features labeled as “playground” in OSM data, which led to over-counting when multiple features represented the same playground, then he can simply notify the system. Attempts to improve the code using DBSCAN from ‘sklearn’ failed due to the library not being installed, and subsequent efforts with QGIS’s native DBSCAN faced parameter configuration issues until the desired result was achieved.
5. Discussion
The presented performance analysis reveals that ChatGeoAI exhibits a high failure rate of 75% from the first instance. While this statistic initially appears disappointing, it underscores a broader challenge within the field—currently, LLMs struggle with generating correct executable code for GIS analysis, reflecting intrinsic limitations in their application to spatial data processing. Despite these limitations, the results should not be viewed purely in a negative light. In fact, in the context of software development, coding is inherently prone to errors requiring frequent debugging, even for skilled human programmers. Thus, the a need, at least in this stage, for iterative testing and refinement. Looking forward, the integration of an autonomous self-correcting mechanism could significantly enhance system performance by allowing the model to learn from its errors and dynamically improve its code output, although this development lies outside the scope of the current paper and is suggested as a primary direction for future research. In fact, Li and Ning [
35] tried to develop an autonomous GIS system having a quite similar purpose. The system first decomposes the spatial problem into a workflow of operations; each represented as nodes in a solution graph. The system then generates Python code for each operation, assembling these into a complete program that executes the entire spatial analysis. However, the system is not fine-tuned on geospatial analysis tasks and lacks incorporated semantic knowledge, such as ontology, specific to geospatial analysis.
Furthermore, supplementing the system with a repository of pre-defined, ready-to-use codes with a complete list of arguments for various GIS operations could potentially increase the success rate. To solve a complex task, the LLMs have to understand the query and subsequently build a series of operations from the pool. This approach would streamline the code generation process by providing reliable building blocks that reduce the complexity the model needs to handle directly, thus paving the way for more accurate and robust GIS application development through advanced AI models. However, this approach, similar to that used by Zhang et al. [
38], has certain limitations. Relying on a predefined set of operations could restrict the LLM’s flexibility and creativity, and therefore, it would potentially overlook more efficient or accurate solutions outside of it. In addition, keeping this pool comprehensive and up-to-date would be both challenging and resource-intensive. Therefore, a sounder solution could be a hybrid approach where the system should primarily rely on the method introduced in this paper to ensure the coverage of a wide range of tasks and, in case of failure, switch to the aforementioned alternative by drawing a solution from the operation pool. This approach presents one of the most compelling perspectives of this paper. Another straightforward hybrid approach is to use LLMs for natural language processing while utilizing GIS tools for processing and analysis. However, the role of LLMs in this method is limited and strictly confined to the understanding of the user intent and does not involve any code generation.
It is important to note that widely used mapping services do not fully integrate advanced NLP for interaction with normal users, and they are not designed for geospatial analysis. For example, Google Maps supports natural language-like input through keyword searches and voice commands, but it is strictly limited to handling structured queries such as “restaurants near me” or “navigate to the nearest gas station”. It fails to respond to the type of queries presented in this paper.
The performance analysis of our system also reveals its proficiency in various aspects of geospatial query generation, spanning from geometry identification to error handling and validation. This performance is notably achieved by the robust capabilities of the Llama 2 model which are also further enhanced through meticulous fine-tuning. Here, we delve into a comprehensive examination of the system’s capabilities from different perspectives while mentioning some of its limitations or failures that call for further research and studies.
Geometry identification: the system can parse and understand textual descriptions or instructions that specify different types of geometry. For example, it can recognize phrases such as “find parks,” “identify roads,” or “locate buildings,” indicating different types of geometries.
Feature geometry analysis: Given a specific geospatial analysis task, the system can generate code that checks the geometry type of features before performing operations on them. For instance, it can generate code that distinguishes between polygons, multilines, points, etc. and applies appropriate operations or conditions based on the geometry type.
While the system demonstrates accuracy in many cases, there are certain scenarios where it falls short. One such instance occurs when the system attempts to create a new “Point” layer to accommodate features with diverse geometries. In such cases, the generated code may make assumptions about the uniformity of feature types, leading to errors when dealing with polygons. For instance, the code may incorrectly place markers for polygonal features, failing to represent them with their centroids accurately.
Handling different geometries: The generated code can include logic to handle different geometries effectively. For example, when identifying parks or green spaces, it can filter features based on their geometry type to ensure that only polygons are considered. Similarly, when analyzing road networks, it can focus on features with line or multiline string geometries.
Error handling for incompatible geometries: In cases where the specified operation is incompatible with certain geometry types, the system can potentially generate error-handling mechanisms to address such situations. For example, it can include conditional statements to skip features with incompatible geometries or provide informative error messages.
Attribute queries: The model can understand and process queries related to attribute data associated with geographic features. For example, it can filter features based on attribute values such as names, stars, and/or amenities. However, it may struggle with ambiguous attribute names or aliases, leading to incorrect filtering or retrieval of features. For example, in OSM, “highway” is a tag for all types of roads (including primary, secondary, and tertiary roads), while in natural language, “highway” is a synonym to, e.g., “motorway”. So, when a user asks to find attractions near a “highway,” the system may be confused with selecting either roads with OSM highway tags or roads that are classified conventionally as highways (motorways).
Spatial relationships: The system can comprehend instructions involving spatial relationships between features, such as proximity, containment, and intersection. This enables it to perform tasks such as finding neighboring features and identifying features within a certain distance. When dealing with queries that include fuzzy relations, such as “near to,” the system suggests, for example, 500 m as an approximation distance and runs the analysis accordingly. But at the same time, it returns a message to the user that the selected distance value (e.g., 500 m) can be adjusted as needed.
Coordinate systems and projections: The model can identify the coordinate system of the map and create the output layers in that reference accordingly. So, the output results are correctly geocoded. However, we have noticed some incompatibility issues, for example the case when the model computes areas in degrees rather than conventional units such as square meters expected in the query. This disparity between user expectations, preferring areas in square meters, and the output in degrees may potentially cause confusion and misinterpretation.
Spatial operations: the model can generate code for various spatial operations, such as buffer analysis, intersection, union, difference, centroid calculation, area calculation, and length calculation, on-the-fly, to perform simple or complex geospatial analysis in response to a user query. However, the system still struggles with edge cases. For example, consider a scenario when a user requests the union of a neighborhood boundary and a park boundary which are overlapping as the park lies within the neighborhood. Despite this spatial relationship, when the model attempts to compute the union, it struggles to handle the shared boundaries accurately. Consequently, the resulting output may combine both boundaries incorrectly, potentially leading to flawed urban planning or resource allocation decisions.
In addition, the system typically does not incorporate spatial indexation, a technique crucial for efficient spatial data retrieval. The absence of spatial indexation led to prolonged computation times due to the exhaustive iteration through all features. Additionally, errors may occur, for example, when attempting to access elements using spatial indexation in PyQGIS, especially when the spatial indexation involves multiple layers.
Syntax accuracy: The system generally excels in producing code with correct syntax and can generate PyQGIS code with accurate function calls, variable assignments, and control structures. However, there have been instances where functions from older versions of PyQGIS were utilized with slightly different spellings.
Map styling and visualization: The system can assist in tasks related to map styling and visualization, properly. It can generate code for applying different styles, symbology, and labels to geographic features on a map. We just set the location of SVG files for the markers, and then the model had the capability to choose the most suitable symbol for each feature. The used icons have predefined colors, sizes, and paths, ensuring minimum visualization in the output maps. However, it is important to note that the primary focus of our manuscript is on the application of LLMs for GIS analysis rather than GIS visualization. This latter is an aspect that requires separate research and special attention as the system might encounter challenges in creating visually appealing maps with optimal legibility of labels and aesthetics. This challenge becomes particularly apparent when the output contains dense features, necessitating post-processing styling using predefined codes.
Data manipulation and processing: It can perform data manipulation and processing tasks, such as data conversion, merging, splitting, filtering, joining, and summarizing, on spatial datasets.
Error handling and validation: The system can generate code with error handling mechanisms to validate input data, check for errors, handle exceptions, and provide informative messages to users.