2.1. Domain and Phenomenon Characterisation
This section outlines the operational and functional requirements of the voice map application. Understanding the components and their interplay within the application is crucial. The web application system comprises three primary elements: Frontend, Backend, and Database. Users interact with the application through existing visualisations or adjust them using voice commands.
The application offers a range of thematic visualisations, including marker and choropleth maps. User profiles may vary based on expertise and capabilities. Its use cases span map exploration, visualising spatial relationships, educational settings, palette customisation, and data querying. Essential functions include voice-assisted navigation, cartographic display, base map selection, palette adjustments, and data analysis.
2.2. Methodology
Employing a theoretically-driven model in research conceptualisation offers advantages, such as a framework for hypotheses formulation and testing, improved communication, and enhanced comprehension. Systems that request geographical information in natural language often faces difficulties due to vague references. The obstacles and issues for embedding the natural language processing on a GIS interface requires a schema or a model that eases the process of retrieval of information from the database and reduces the cognitive load from the user’s perspective. The PlanGraph theory which was introduced by G Cai, H Wang, A M MacEachren and S Fuhrmann [
12] indicates this challenge and considers as a solution, which has already been implemented and tested by the Hongmei Wang and colleagues on their GeoDialogue [
8]. The PlanGraph describes its structure by representing three main concepts:
Recipes: A recipe describes the components of an action in terms of parameters, sub-actions, and constraints.
Plans: A plan corresponds to a schema describing not only how to act, but more importantly, the mental attitude towards it, such as beliefs, commitments, and Execution status.
Actions: An action refers to a specific goal as well as the efforts needed to achieve it. An action can be basic or complex.
For example, the goal of a task is to show a map. This task can be considered an action since it is an ultimate goal. The parameter that can be related to this action is the layer of the map. For realising the action, there is a subaction for generating the map. All of the mentioned structures together create the recipe. The recipe definition also includes the ability to define constraints, which describe pre- or postconditions as well as partial orders for subactions. In the case of existing specific conditions for executing the action, subactions, or retrieving the parameters, the recipe structure is upgraded into a plan structure, which demonstrates a complete structure of the flow.
A compatible, visible, relevant, and intuitive action set was required for effective geospatial data processing and visualisation in the Human-GIS computer context, leading to the categorisation of all actions into four primary types:
Type I: Acquisition of spatial data (e.g., map layer retrieval)
Type II: Analytical tasks (e.g., finding spatial clusters)
Type III: Cartographic and visualisation tasks (e.g., zooming, panning)
Type IV: Domain-specific tasks (e.g., evacuation planning during hurricanes)
The PlanGraph model facilitates user interaction and function through reasoning algorithms. Upon receiving user inputs, the system initiates a three-step process: interpreting the input, advancing available plans towards completion, and delivering responses to the user. The PlanGraph methodology was integrated into the application by initially categorising all tasks into two main groups: cartographic tasks (visual map modifications) and analytical tasks (data filtering). Task extraction and categorisation were streamlined by leveraging the existing User Interface (UI) and features on the BStreams platform, a pre-existing interface for geospatial data visualisation. This platform is a free data visualisation tool, enabling the users to create reports and visualisation in different formats of graphs and maps with high customisation. Over the past year, BStreams has been enhanced with Geospatial data visualization and various innovative features, enhancing user-friendliness. The effort was done with the cooperation of the development team and the first author; therefore, the infrastructure is flexible for new development.
The UI heavily relies on user-platform interaction for task execution and modification. However, an in-depth exploration of each tool and feature was necessary to determine the correct format, threshold, and structure of user inputs. Examining the core functions of the two visualisations in the backend and the default configuration file stored in the database provided insights into the necessary parameters and conditions for each derived task.
Once all the necessary components were identified, including parameters and constraints, the actions of the PlanGraph models were conceptualised. Following the outlined methodology, the process started with defining the root action/plan and then defining the parameters and constraints for each task, as depicted in
Figure 2. Next, the type of action was identified based on the PlanGraph types discussed previously. Actions were classified as basic or complex, depending on whether users required prior domain knowledge to perform the task.
For example, tasks like “Zoom in” or “Pan to the left” were considered basic, while tasks like “Changing the steps for graduated colour” were deemed complex and involved sub-actions and plans. Each action type was determined based on its complexity, with
Figure 3 illustrating an example of the basic task “Zoom in/out” in the PlanGraph model. In this model, the root plan aimed to achieve the zooming action, with parameters such as map window size, predefined scale, and the sub-action of reloading the basemap with pre-loaded data. The Zoom in/out task in the user interface is primarily performed by using the mouse wheel or enabling the Zoom in/out feature on the map. The window size is an important parameter derived from the User Interface code, and it plays a crucial role in recognising the specific visualisation that the user is referring to. All of these parameters are retrieved and considered for the PlanGraph model. The parameters and subactions are mainly visible and useful in the technical approaches and code implementation. The root plan was classified as a Type III action, focusing on visualisation modification, while the sub-action was considered Type I.
2.3. Survey
The initial challenge was to comprehend the diverse range of voice commands that both expert and non-expert users might employ when interacting with the geospatial application. For this purpose, an English-language survey was designed and circulated among a diverse public to assess and extract vocabulary choices when interacting with web-based maps. The questionnaire contained twelve open-ended questions, enabling respondents to articulate their thoughts and ideas about tasks converted into questions. GIFs were incorporated to aid users in understanding the tasks and commands better. The survey was embedded within a Google Form, and the links to the questionnaire have been distributed through the personal social media channels of the authors, such as Facebook, LinkedIn, and Instagram.
The survey also collected demographic data such as age, gender, educational attainment, field of study, native language, and English proficiency to detect possible patterns or correlations. The survey reached 66 diverse respondents via social media and underwent data-cleaning procedures for result analysis. The analysis primarily revolved around identifying frequently used vocabulary and verbs to enhance human-computer interaction in the application design. Most respondents identified as men (63.6%), indicating a potential gender bias in interest towards the topic. The largest age group, with over 65% of participants, was between 18 and 30, suggesting that older generations may find voice technology on a map less appealing or unfamiliar.
Educational attainment was another factor examined, revealing that over 58% of respondents held a master’s degree or higher. Fields of study were also considered, with 50 respondents providing related information. Native language and English proficiency also played key roles. As anticipated, due to the survey distribution channels, the largest group of respondents were Italian speakers.
Figure 4 depicts the distribution of other native languages among the respondents, with Persian speakers constituting the second-largest group due to the first author’s ethnicity. Regarding English proficiency, most respondents were anticipated to have an advanced level, while none reported having basic proficiency. The survey aimed to collect specialised geospatial data visualisation and analysis terminology to create a structured archive accessible to users. The compiled terms were integrated into the application’s code to enhance the user experience by offering diverse inputs and improving response accuracy.
The frequency of most commonly used verbs was estimated for each question, taking into account the verb-word collocation. Each question was designed to fulfil a specific task and characteristic. The primary approach to analysing derived results was the frequency of vocabularies. Estimating the frequency of commonly used vocabulary also facilitated predicting the most frequently used verbs and words for cloud words and overall human-computer interaction scenarios, given the human tendency to command or order behaviour. This user-centric design application aims to acknowledge and prioritise users’ needs, preferences, and experiences.
Figure 5 presents the results from analysing questions in the survey, formulated to understand the terminologies people use to talk to a map. While the anticipated most frequent verbs were “find” and “search,” statistics reveal that “show” was used more often. This preference could be attributed to the visual nature of the tasks and the user’s interaction with the application, relying on the fact that the application should visually respond to commands.
One of the anticipated significant outcomes of the survey was the creation of a word cloud, which visually represents the most frequent words used in the collected archive, offering a quick overview and understanding of user tendencies. In this context, the word cloud helped pinpoint the most frequently used terminologies related to the application’s defined tasks.
Figure 6 portrays the word cloud visualisation that represents the frequency of terms derived from the survey data.
This visualisation provides a snapshot of user tendencies and the most commonly used terminologies in the context of the application’s defined tasks. Through this survey, we gained insights into the potential user interactions and vocabulary usage, enabling us to tailor the application to suit user preferences and needs best.
2.4. Comparison with ChatGPT
While the questionnaire provided valuable insights into geospatial interaction, we were concerned that our sample might be too limited to reflect the language patterns of the general public. To ensure a more representative understanding, we turned to NLP, given that it is based on extensive data sources. This would provide a broader perspective on the likely commands for map interactions. Consequently, we emulated the questionnaire using ChatGPT, a sophisticated AI language model developed by OpenAI. The aim of comparing responses generated by ChatGPT with the survey data was to gain a deeper understanding of the commands used and to corroborate the findings. Based on the questionnaire’s questions, a vast collection of voice commands was gathered, transcribed into text, and fed into ChatGPT for training and evaluation.
Our findings indicated a significant alignment of the ChatGPT answers with our survey results. Specifically, a Pearson correlation analysis revealed a notable correlation (r = 0.81, p < 0.01) between the probability scores given by the NLP model and the prevalence of terms observed in the survey data. This analysis was conducted based on 17 mutual words identified across both datasets. One of the intriguing disparities was the model’s preference for the term “filter”. In contrast, survey respondents favoured terms such as “show”, “select”, and “highlight”. This discrepancy suggests that the term “filter” might be more entrenched within the lexicon of database professionals, thereby being somewhat alien to the general user profile we interviewed.
Structural directives like “top left/top right” and commands such as “zoom in/zoom out” demonstrated user usage variability. Respondents deployed these phrases more freely instead of adhering strictly to a fixed order or set of complements, indicating flexibility in user phrasing preferences. Additionally, terms like “marker” and “provinces” exhibited a higher frequency among survey participants than the ChatGPT model predictions. A plausible explanation for this observation could be rooted in the phrasing of our survey questions. By featuring these terms prominently in the questions, it’s possible that they became more salient to respondents. This effect might be because the participants were interacting in a second language, where familiar terms from the question can be more readily recalled and repeated.
Our analysis revealed the most commonly used verbs in voice commands for map interaction: “show”, “change”, “locate”, “zoom”, “find”, “move”, and “filter”. The likelihood of using these verbs varied depending on the specific interaction performed and the user’s background knowledge. For instance, users with an environmental engineering background preferred the verb “locate”, whereas those with a computer science background tended to use the verb “search”. Notably, the AI-generated answers closely mirrored responses from human participants. Both sources utilised similar commands and verbs such as “show”, “find”, “zoom in”, and “locate”. However, there were differences in sentence structures. For short commands, humans tend to specify very simple and use very few words and simple sentences. In contrast, the human responses were more complex in more advanced tasks such as filtering geospatial data.
ChatGPT mostly understood the intention behind the questions, though the responses varied in terms of terminologies, sentence structures, and probability of appearance. For example, the first question asked after conveying the whole scenario and user profiles was, “Imagine users have a map and wish to command the map with their voice to visualise the Pyramid of Giza. What are the possible commands and their likelihoods in percentages?”. The AI’s responses mirrored the ones we collected from human respondents, with the verb “show” used most frequently, confirming our expectations.
A comparison of the two most frequent sentences from ChatGPT and respondents revealed that AI-generated sentences tended to be more complex and informative, whereas humans typically used simpler and shorter sentences when interacting with a voice chatbot (
Figure 7). In the Human response parse tree, S represents the sentence, VP represents the verb phrase, VB represents the verb, NP represents the noun phrase, and det represents the determiner. In the ChatGPT parse tree, S represents the sentence, VP represents the verb phrase, VB represents the verb, NP represents the noun phrase, det represents the determiner, VBZ represents the linking verb, and AdvP represents the adverb phrase. In the first sentence, “Show me the Pyramid of Giza”, the VP consists of only one verb “Show”, whereas in the second sentence “Show me where the Pyramid of Giza is”, the VP consists of two verbs “Show” and “is”, with a subordinate clause “where the Pyramid of Giza is” acting as the complement of the verb “Show”. This difference in the VP’s complexity reflects that the second sentence conveys more information since an AI generates it.
Our comparison between ChatGPT’s responses and those collected from human respondents highlighted similarities and differences. In most cases, the frequency of verb and word usage was similar. However, the complexity of sentence structures showed that humans tend to interact with virtual assistants more generally and simply. This tendency might be due to ease of use, limited time or patience, and advancements in natural language processing and machine learning that enable virtual assistants to comprehend and respond effectively to general queries. The comparison between the percentage of the probability of terminologies’ appearance among the two sources (ChatGPT and Survey) for the mutual words has been visualised in
Figure 8.
To drive the implementation of our interactive map, we utilised two primary sources of information: the survey and the responses garnered through ChatGPT. The survey gave us insights into users’ linguistic preferences and terminologies when interacting with geospatial data. On the other hand, ChatGPT, with its vast database of natural language interactions, offered a broader understanding of likely voice command structures and terminologies. By combining the findings from both the survey and ChatGPT, we were able to develop our interactive map’s voice interface.