Subjective Assessment of a Built Environment by ChatGPT, Gemini and Grok: Comparison with Architecture, Engineering and Construction Expert Perception

Belaroussi, Rachid

doi:10.3390/bdcc9040100

Open AccessArticle

Subjective Assessment of a Built Environment by ChatGPT, Gemini and Grok: Comparison with Architecture, Engineering and Construction Expert Perception

by

Rachid Belaroussi

COSYS-GRETTIA, University Gustave Eiffel, F-77447 Marne-la-Vallée, France

Big Data Cogn. Comput. 2025, 9(4), 100; https://doi.org/10.3390/bdcc9040100

Submission received: 19 February 2025 / Revised: 18 March 2025 / Accepted: 1 April 2025 / Published: 14 April 2025

(This article belongs to the Special Issue Machine Learning and AI Technology for Sustainable Development)

Download

Browse Figures

Versions Notes

Abstract

The emergence of Multimodal Large Language Models (MLLMs) has made methods of artificial intelligence accessible to the general public in a conversational way. It offers tools for the automated visual assessment of the quality of a built environment for professionals of urban planning without requiring specific technical knowledge on computing. We investigated the capability of MLLMs to perceive urban environments based on images and textual prompts. We compared the outputs of several popular models—ChatGPT, Gemini and Grok—to the visual assessment of experts in Architecture, Engineering and Construction (AEC) in the context of a real estate construction project. Our analysis was based on subjective attributes proposed to characterize various aspects of a built environment. Four urban identities served as case studies, set in a virtual environment designed using professional 3D models. We found that there can be an alignment between human and AI evaluation on some aspects such as space and scale and architectural style, and more general accordance in environments with vegetation. However, there were noticeable differences in response patterns between the AIs and AEC experts, particularly concerning subjective aspects such as the general emotional resonance of specific urban identities. It raises questions regarding the hallucinations of generative AI where the AI invents information and behaves creatively but its outputs are not accurate.

Keywords:

ChatGPT; Gemini; Grok; built environment; architecture

1. Introduction

Evaluating the built environment is important for sustainable policy making in the city. A well-designed urban environment can significantly influence travel behavior, encouraging a shift towards active modes like walking and cycling, with positive effects on public health and well-being. Pleasant urban spaces that prioritize walkability and cycling can lead to improved spirit, increased physical activity and enhanced urban vibrancy and vitality. Conversely, unpleasant environments can negatively impact individuals, fostering feelings of fear and stress, and potentially contributing to increased crime and car dependence. Therefore, prioritizing the quality of the built environment is essential for creating healthier, more sustainable and thriving communities.

Assessing the emotional response elicited from being in a particular architectural space is a difficult task because of its fundamentally subjective nature. Architectural ambiance refers to the overall atmosphere or mood created by the design and layout of a built environment, covering the sensory experience. The arrangement of spaces, vegetation, urban furniture, interaction and overall flow within a public space can influence the degree of liveliness within the space. Scale, proportion and volume impact the human experience of a given space. Sizable structures, large glass windows, balconies and open layouts can create a sense of openness, animation and grandeur, while smaller, enclosed spaces might evoke intimacy, protection or calmness. Architects often aim to create specific atmospheres through careful consideration of built environment elements, designing the ambiance to suit the intended functions and emotional responses desired for the space.

Traditional methods for evaluating the aesthetics and ergonomics of public spaces rely heavily on fieldwork, surveys, focus groups and document analysis [1]. These approaches are often labor intensive and time consuming. Studies involving human participants face challenges such as recruitment difficulties, as individuals may not be interested or available to participate within a reasonable timeframe. Recruiting experts in the Architecture, Engineering and Construction (AEC) field presents additional difficulties. Professionals in these sectors are often busy with their own projects, usually leaving only a small number of architects within an agency responsible for assessing the aesthetics and ergonomics of urban developments. Moreover, when evaluating prospective developments, multiple alternative projects must often be assessed before selection of the most suitable one. In such cases, a fast and automated tool could significantly aid decision-making.

Recent advances in artificial intelligence (AI) present promising opportunities for the development of computer-aided tools in the analysis of built environments. Multimodal Large Language Models (MLLMs) such as ChatGPT-4o, Gemini 2.0 and Grok 3 offer the general public a way to explore urban environments through images and conversation-based interactions. MLLMs can generate diverse creative content and provide informative answers, even to open-ended, complex or unusual questions. Despite their capabilities, the extent to which MLLM-generated insights align with human experiences of public spaces—particularly from the perspective of AEC experts—remains largely unknown. The potential of these models in analyzing architectural ambiance, a subject that is highly subjective, is still underexplored, leaving room for further research and development in this area.

The potential of AI in fields requiring subjective assessment, such as aesthetics, constitutes a research gap. Little is known about how AI generates outputs when questioned about its subjective perception of a built environment and how well these outputs align with a professional’s expertise. The research questions introduced in this paper can be formulated as follows:

What is the potential of AI in the aesthetic assessment of a built environment?
How does it compare to the subjective perception of AEC experts?
Are there significant differences between AI models and is there a more suitable MLLM for the task of evaluating the quality of a built environment?

For this study, we explored the potential of MLLMs in assessments related to the built environment. We focused on three state-of-the-art and widely used MLLMs: ChatGPT-4o, Gemini 2.0 and Grok 3. We applied these AI models to automate the analysis of four cityscapes represented in a virtual environment, in the context of a real estate project under development. Specific aspects of architectural ambiance were selected: space and scale, degree of enclosure, architectural style and overall feelings. For each aspect, a set of possible attributes was proposed to characterize the mood evoked by the built environment. We incorporated these quality criteria, derived from the literature, into input prompts fed to the MLLMs, asking them to assess scene images extracted from each urban identity. Additionally, AEC professionals who were part of the real estate development were asked to evaluate the same qualities to compare their assessments with those of the MLLMs. This comparison helped reveal both the potential and the limitations of MLLMs in evaluating the quality of public spaces.

2. Related Works

2.1. Human Assessment of the Built Environment

Estimating the visual quality of a built environment is a subjective task, usually involving the participation of either laypersons or AEC experts. Prior research has shown that the design of built environments, along with individuals’ perceptions of them, plays a significant role in shaping walking behavior and neighborhood residents’ active lifestyles [2,3].

Traditionally, subjective measurements are employed to understand individuals’ perceptions at a microscopic level. These methods consider various factors related to the streetscape that contribute to categorizing a specific area as safe and comfortable, often represented by a numerical score [4]. Ewing et al. [5] proposed categorizing streetscape features into imageability, visual enclosure, human scale, transparency and complexity.

Efforts have been made toward more objective qualification of urban environments, contributing to improved active mobility in general in urban areas by increasing the infrastructure satisfaction [6]. Geospatial building data coupled with descriptive data can also be used in various domains ranging from urban planning to energy research [7].

Evaluating the aesthetic quality of a future real estate project requires the construction of a virtual scene to be visited. Approaches using Building Information Modeling (BIM) present superior data precision. More often than not, the assessment of a built environment is made by few participants, especially if AEC experts are required.

Gomez et al. [8] studied immersive virtual reality (IVR) in the initial design stage for architecture students. The perception of spaces and usefulness of IVR were investigated with n = 12 students and compared to the impressions of n = 12 specialized architectural teachers in the subject of basic architectural design. Harputlugil et al. [9] performed an analysis of assessments of architectural design with n = 5 architects during face-to-face interviews but in a very theoretical way depending only on their professional experiences, i.e., not on drawings.

Corticelli et al. [10] distributed an online questionnaire to public bodies and n = 4 naval authority stakeholders to investigate a reconstruction project of the Rimini Canal in Italy from the point of view of infrastructure, transportation and public spaces. Alwah et al. [11] worked with n = 7 experts in selecting tools and factors for content validity indexes to determine to what extent public spaces meet users’ requirements. All the experts were professors in urban design and landscape architecture.

Rivière et al. [12] performed semi-structured interviews with n = 11 architects working in the design of hospital facilities about their use of virtual environments during the design phase. Architects usually confer with health care professional to make their assessments of indoor environments.

Keil et al. [13] explored physiological sensors while projecting a virtual urban scene into a head-mounted display. n = 6 participants were enrolled to study the influence of traffic density and noise.

Scerri et Attar [14] created a virtual platform to engage the local community in street experimentation. To design it, they performed semi-structured interviews of n = 4 stakeholders for ideas about intervention, later presented as digital illustrations of the street with a short description of the intervention to n = 16 laypersons.

In [15], we investigated the evaluation protocol of Gomez et al. [16] to compare the perceptions of architectural ambiances of laypersons and n = 6 AEC experts. Over all the categories of characterization proposed, laypersons expressed more heterogeneous opinions, while there were fewer compromises in professionals’ evaluations and more optimistic feelings. In this article, we extend some aspects of this previous study to compare the outputs of AI to experts’ responses.

2.2. Artificial Intelligence for Urban Analytics

The potential of Natural Language Processing (NLP) can be applied for analyzing urban data. Its application in urban studies, particularly urban design, is still recent, as noted by Cai [17]. Existing NLP approaches for urban design primarily focus on evaluating emotional responses to urban environments using textual input. Sentiment analysis, often applied to geotagged social media data, is a common method. Examples include analyzing hashtags from image-sharing platforms like Instagram [18] or Flickr [19], or processing reviews from platforms such as Yelp [20] and Airbnb [21].

Computer vision can also be a tool in AI-driven urban analytics, particularly in the semantic segmentation of street view imagery. A comprehensive review by Biljecki and Ito [22] highlights the widespread use of street view data across diverse research domains, including vegetation analysis and transportation studies. Machine learning algorithms [23,24,25] enable the processing of still images extracted from street view imagery, facilitating valuable insights for urban studies.

Multimodal Large Language Models (MLLMs) present a promising new approach to assessing architectural ambiance. Their ability to process both visual and textual information allows them to analyze images and provide textual interpretations. However, it remains to be seen how well these models handle the inherently subjective nature of ambiance assessment. Despite not being specifically trained for this purpose, these readily available MLLMs can achieve effective results.

The rise of Multimodal Large Language Models (MLLMs), such as GPT-4 [26], DeepSeek v3 [27], Grok 3 [28] and Gemini 2.0 [29], enables new opportunities for automated urban analytics by merging textual and visual processing. These models leverage the language understanding of LLMs along with image analysis, enabling seamless interpretation of both modalities. However, their application in urban scene analysis remains in its infancy [30,31].

Liang et al. [30] examined GPT-4’s ability to detect temporal changes in streetscapes by analyzing street view images from different time periods. Their study evaluated how well GPT-4 could assess visual quality based on prompts that required analyzing various aspects of a scene depicted in four street view images, captured from four different directions at two distinct time periods. The evaluation focused on nine independent aspects, including facades, road damage, signboards, greenery and street furniture. GPT-4 was given three response options: Positive, Negative or No Change.

Malekzadeh et al. [31] explored the use of the ChatGPT model to estimate urban attractiveness by analyzing street view imagery with simple text prompts. They instructed GPT-4 to rate the overall visual appeal of scenes on a scale from 1 to 7, and compared its rating to human assessments.

This paper investigates the capability of the latest versions of ChatGPT, Gemini and Grok in the assessment of architectural ambiance by comparing their interpretations of various urban scenes to the subjective grading of a corpus of human AEC experts. The similarity of their outputs to human sensations at the design stage, to predict the urbanistic interest generated by a built environment, is based on questions in which the participants express their feelings about dimensions such as scale and size, degree of enclosure, architectural style and overall impressions.

3. Materials and Methods

3.1. Case Study

In the context of a real estate construction project, several architecture firms—Atelier M3, Leclercq Associés, Atelier 2/3/4, Agence Pietri and BASE—joined efforts to produce BIMs of buildings, roads, sidewalks, landscape and urban furniture for a future 20-hectare borough in the south of Paris, France. Benefiting from the collaboration of the client of the real estate project—SEMOP Châtenay-Malabry Parc-Centrale—we collected these BIMs and assembled them in a 3D city model for visualization purposes. Four virtual reality visits were produced for different parts of the borough, as illustrated by Figure 1.

The virtual environment simulated a 3D architectural representation of several parts of a peri-urban borough on a sunny day. The colors are bright and clear, as they often are in a real estate project presentation. The images were taken at street level, which allows the buildings to be seen in their urban context. The slightly low-angle perspective highlights the height and structure of the buildings.

Scene 1 is the business district of the borough. It is made up of three main buildings: The first building has an angular and asymmetrical shape, with a largely glazed lower part and an upper part made up of several floors with regular rectangular windows. There is a significant gap between the glass base and the upper floors, creating an interesting visual effect. The regular arrangement of the windows of the upper floor gives an orderly appearance to the upper part of the building, contrasting with the freer and more open appearance of the glazed base. The base of the building is made up entirely of large bay windows, with an open space inside. This transparency also allows visual interaction with the external environment. There is variation in the heights and shapes of the buildings. The central one is distinguished by a canopy-shaped entrance, creating a dynamic and original visual effect. This variation breaks the monotony and brings a strong identity to the whole. A fence is also visible in the foreground, delimiting a private business space. Large glass surfaces are present on the facades, suggesting a desire to maximize the supply of natural light inside the buildings. The buildings are organized around a central pedestrian alley with light vegetation. This organization promotes circulation and interactions between the different buildings. There is a difference in height between the buildings, which energizes the whole.

Scene 2 illustrates a strip mall, an ensemble of shops at the ground floor of a series of buildings with different styles. The buildings have a contemporary style with clean lines and mostly white facades. We can see a variety in the heights and depths of the buildings, which energizes the perspective. Some buildings have balconies with glass or metal railings. The perspective is accentuated by the layout of the buildings and the vanishing line of the street, giving an impression of depth. The ground floor is clearly intended for commercial premises, with large windows. The spaces are available for rental to businesses. Some access doors are framed in bright colors, such as orange, which brings a touch of dynamism to the whole. It is a typical urban environment with the presence of a sidewalk, a curb and urban design elements such as bicycle racks and a fenced area. The presence of trees indicates a desire to integrate some greenery into the urban environment. Overall, the images give an impression of modernity, functionality and commercial potential. The architecture is simple but efficient, and the commercial spaces on the ground floor provide opportunities for various activities.

Scene 3 revolves around a typical suburban residential area. The buildings feature a modern style with clean lines and rectangular shapes. The emphasis is on simplicity and functionality. Materials are in a light, almost white color, which gives a clean and bright appearance. The windows are highlighted by black frames, creating a strong contrast. Black metal railings on balconies and windows also contribute to the contemporary style and provide functional safety. The arrangement of rectangular windows of different sizes creates an interesting visual rhythm on the facade. The balconies are clearly visible and appear spacious. They provide private outdoor space for residents. The most distinctive elements are the round corner buildings located on the street corners. Their architecture is contemporary, with curved lines and balconies that soften the angles. The public space in front of the buildings is treated with care, with trees planted along the street to provide shade and greenery. The linear perspective of the street invites the gaze towards the background of the image.

Scene 4 give an illustration of a green urban identity. It is a more secluded residential complex, with more vegetation and a recreational area. The buildings have a contemporary design with clean lines, flat roofs and large windows. This feature softens the overall look and gives it a more organic aesthetic. The dominant colors are white and gray, creating a neutral and sophisticated look. The facades are mainly white, which gives them a bright appearance and highlights the openings. We observe an alternation of rectangular windows of different sizes, some with balconies or loggias, which energizes the composition. The ground is paved with light-colored slabs, creating a clean and uniform surface. There are numerous young, leafy trees scattered throughout the scene, indicative of a green space and recreational area. The buildings are situated in an open area, suggesting a spacious and potentially public plaza favorable for social interactions. The open space and the presence of trees were planned to create a pleasant and livable environment. The overall impression supposed to be conveyed is one of a well-planned and aesthetically pleasing urban development.

3.2. Methodology

The experimental protocol is illustrated in Figure 2. The survey was conducted on n = 6 AEC experts involved in all the phases of the real estate project from the start of the planning of the borough with the stakeholders to the drawing of the built environment and now the construction phase. The participants were three women and three men aged 32 to 45 years old. The voluntary participants were all adults. Each participant was sent a link to an online questionnaire made up of five main parts. Participant could access the questionnaire online from their computer, laptop or handheld devices as long as they had an internet connection.

The questionnaire consisted of four identical parts, each featuring a 1 min video of a virtual tour. The complete protocol is described in [15]. Participants were invited to imagine themselves visiting the urban space and encouraged to express their feelings regarding scale and size, degree of enclosure and architectural style during the virtual tour. These same types of characterizations were made by ChatGPT (GPT-4o), Gemini 2.0 and Grok 3 based on the images illustrated in Figure 1. From a textual prompt we provided, they output their understanding of how humans are likely to perceive and react to these visual cues. Several architecture aesthetic aspects were investigated for each urban scene. For each one, a set of possible mood attributes was proposed, and they are summarized in Table 1. We selected space and scale and degree of enclosure because they are classical elements of urban design quality [5,32], and added architectural style and overall feelings from the protocol of Gomez et al. [15,16]. Other aspects such as imageability, transparency and complexity were not investigated because we wanted to focus our analysis on the response of AI not on urban design qualities.

ChatGPT was developed by OpenAI, Gemini by Google DeepMind and Grok by xAI. The exact implementations of ChatGPT, Grok and Gemini are proprietary to these companies, but some technical details are disclosed in the literature or by questioning the MLLMs directly. An MLLM does not use a single discrete algorithm; its analysis is the result of a complex interplay of several machine learning techniques and models, primarily focused on Natural Language Processing and computer vision. It relies heavily on attentional mechanisms such as convolutional neural networks for vision tasks and transformer-based LLMs for text processing. Typically, the image is fed into the computer vision system, relevant features are extracted from the image, such as shapes, objects and textures, and the system analyzes the relationships between these features to understand the overall scene. It connects the visual elements to abstract concepts presented in the textual prompt provided by the user. For example, the presence of tall buildings and a narrow street suggests a feeling of enclosure. The LLM generates a coherent and descriptive text based on the information extracted from the image, structuring the text in a logical and informative way. For each aspect, ChatGPT GPT-4V, Grok 3 and Gemini 2.0 proposed a single attribute from the available choices shown in Table 1, with an argumentative explanation of the choice they made.

4. Results

The instruction given to the Chatbots and AEC experts for each of the four virtual visits was to imagine visiting the urban environment shown. We asked them to characterize this space according to four factors: (1) space and scale, (2) enclosure, (3) architectural style and (4) general feelings.

The first aspects of architectural aesthetics are space and scale. The humans and MLLMs were asked what feeling they felt the most out of three choices: Balance, Grandeur and Restlessness. Figure 3 displays the results obtained for all the virtual visits. In three out of four virtual visits, the Chatbots and the AEC experts were predominantly in accordance. The office district mostly evoked a sense of Grandeur among AEC experts and Grok, whereas ChatGPT and Gemini, aiming to reflect the likely experience of someone visiting this urban district, characterized it as Balance, agreeing with a minority of the experts.

For this urban identity, ChatGPT [33] argued that The proportionate design of the buildings and the openness of the space offer a harmonious feel, while Gemini [34] stated that The urban environment appears harmonious, with a mix of open spaces and built structures that create a sense of equilibrium. They made a similar analysis, as both cited the impression of open space and harmony. These arguments are convincing; however, they do not explain why the shown space does not feel vast and grand, which are some of the first things that can be noticed from the images illustrated in Figure 1a. It was captured by Grok [35], who stated that The buildings in the images are tall, modern, and have a significant presence with their sleek, angular designs and large glass facades. (…) I would choose Grandeur as the feeling, as the architecture and layout suggest a sense of awe and majesty due to the height and expansive design of the buildings.

ChatGPT, Gemini and Grok classified the other scene as Balance, agreeing with the majority of experts, but with other arguments. ChatGPT felt a sense of proportion and uniformity in the strip mall, harmony and a well-planned atmosphere in the urban residences and a sense or order and harmony in the green residences, without overwhelming Grandeur or Restlessness in any of the scenes. In the strip mall scene, Gemini deemed the buildings of moderate height, not towering or imposing, and that the street widths appear comfortable, neither excessively wide nor narrow. In the urban residences scene, Gemini argued that the presence of trees and a pedestrian-friendly street level contributed to a sense of balance between the built environment and human scale. In the green residences, it viewed the urban environment as well proportioned, with a human-scale design. It did not judge any of the scenes to be vast and grand or cramped and restless. Grok argued that the scale feels harmonious and well proportioned in the strip mall and green residences, and feels proportionate and human scaled, neither cramped nor grandiose, in the urban residences.

Figure 4 illustrates the results obtained for the degree of enclosure aspect. Four options were available: Protection, Calmness, Freedom and Animation. ChatGPT and Gemini chose the same attribute for all the scenes, Calmness, whereas expert opinions showed more variety. Once again, the business district showed more differences between AI and human feelings; experts expressed mostly a feeling of Animation or Freedom, and not one of them felt Calmness. The AI based their choice on the clean lines, orderly arrangement and well-defined boundaries, which provide the sentiments of security and tranquility. Only Grok chose Animation for the degree of enclosure, as the space felt dynamic and engaging to it, with a sense of activity and movement encouraged by the design. Grok argued that the balance between openness and the presence of the buildings suggested a lively yet contained atmosphere.

Scene 2 showing the strip mall and scene 3 showing the urban residences triggered a lot of disagreement between experts, with no clear emotion winning the vote tally. This is a typical result when dealing with the subjective assessment of degree of enclosure. The only virtual visit where the opinion was mostly shared was the visit to the green residences, with 83% of experts voting for Calmness, which was to be expected in an environment with a lot of vegetation. ChatGPT and Gemini based their “Calmness” assessment mostly on the orderly arrangement of trees and walkways’ open space, but, for all the virtual visits, lacked of real variety of opinions for this enclosure aspect. More simply, they experienced difficulty in finding creative arguments on such a subjective aspect. It appears that the characterization of degree of enclosure by AI-based methods from a visual scene has room from improvement.

For the urban scene of the strip mall, Grok chose Animation as the degree of enclosure, as the space felt dynamic and inviting, with a sense of activity encouraged by the commercial ground-floor spaces and pedestrian-friendly design. This argument made Grok more convincing than the other two AIs, as the argument was less creative but more based on facts.

Architecture style was also investigated, with a set of emotions given as possible choices: Elegance/Satisfaction, Simplicity/Serenity and Eccentrism/Surprise. The AEC expert opinions are given as the first line of results in Figure 5. The responses show a mix of Elegance/Satisfaction, Simplicity/Serenity and Eccentrism/Surprise, depending on the visit. The business district haf a mix of all three attributes but was mostly perceived as Eccentric, while the shopping street was dominated by Simplicity/Serenity. The urban residences had a mix of Simplicity/Serenity and some Eccentricism, and, finally, the green residences strongly favored Simplicity/Serenity.

The business areas evoked more varied perceptions. Eccentricity was mostly associated with the business district by the experts but the three AIs gave divergent opinions. Grok selected Eccentricity/Surprise, as did 50% of the experts, as the architecture stands out as unconventional and eye-catching, evoking a reaction of curiosity and amazement [35]. ChatGPT was second best as it selected Elegance/Satisfaction, like 33% of the experts, while Gemini chose Simplicity/Serenity, which was the third choice of the experts.

Feelings of Eccentrism/Surprise were never chosen by Gemini and ChatGPT, reflecting a certain sense of conservatism in AI responses. AI models are trained to be reliable and predictable, which might make them less likely to assign extreme or unexpected emotions. This results in a safer, more uniform response like Simplicity/Serenity even in cases where humans would perceive more variation. Grok seems to have the ability to express more original feelings.

Figure 6 illustrates the responses of AEC experts and compares them to ChatGPT, Grok and Gemini’s outputs regarding the overall feelings evoked by the four virtual visits. For this general characterization of general feelings, the AEC experts had a less diverse emotional response to the urban spaces than for the previous factors, as they heavily tended to vote for Indifference/Unnoticed sensations, at least 50% of the times. They considered the business district as evoking mostly Indifference (50%), but also Joy/Theatricality (33%) and some Sadness/Nostalgia (17%). The shopping street and urban residences were predominantly characterized as evoking Indifference (83%), with a small percentage (17%) of Joy/Theatricality and Emotion/Spirituality. The green residences evoked a balanced mix of Indifference and Emotion/Spirituality.

Globally, Gemini displayed a stronger differentiation between environments than ChatGPT, but the feelings of emotion in visit 1 were not expressed by the experts. AI saw the business district and the urban residences as more engaging than the experts did. This could be due to biased training data which associate urban areas with vibrancy rather than neutrality. ChatGPT and Gemini aligned with the experts’ perceptions of the strip mall, while Grok chose the minority attribute. On the contrary, Grok correctly captured the feeling of the majority of the experts in visit 3, while ChatGPT and Gemini were totally divergent from them. Half of the experts found the green residences scene to evoke a strong Spiritual/Emotional connection, but ChatGPT and Gemini classified it as purely Indifference. AI might lack an understanding of how nature and tranquility evoke deeper emotions. Grok was out of line, expressing Joy/Theatricality.

Let us explain some of the differences. In all visits, the majority of human experts did not find a relevant attribute other than Indifference because the other propositions did not fit their experience. ChatGPT and Gemini were in line with this attribute on two visits out of four, which can be interpreted as their natural inclination towards conservative assessment, leading to a neutral stance. The Grok model was inaccurate in assigning Theatricality to three of the visits, likely focusing on visual aesthetics and urban dynamism.

Actually, we found out that the criterion of “overall feeling” was ill described [15]. The predominance of Indifference in the professionals’ opinions can likely be attributed to a lack of clarity regarding architectural ambiance as a general feeling. Participants may have struggled to choose between the given options because none fully aligned with their professional experience. Strong emotions like sadness, joy and spirituality are distinct, but subtler feelings might not have been adequately proposed.

This suggests room for improvement in the study, particularly in identifying nuanced emotional responses beyond indifference, but this bad design can be used as a probe to exhibit a general behavior of AI: false invention. In contrast to the experts, the AI models appeared to be compelled to provide a distinctive response. In doing so, they generated hallucinations—fabricating interpretations and exhibiting creative tendencies—while ultimately producing inaccurate outputs.

5. Discussion

5.1. Observations on AIs

On the first three aspects—scale and size, enclosure and style—Grok gave an answer corresponding to the (sometimes relative) votes of the majority of experts, while ChatGPT and Gemini were correct in 75% of the cases (nine times out of twelve). On the overall feeling aspect of architectural ambiance, ChatGPT and Gemini were correct in 50% of the visits because the question favored their tendency towards conservatism, while Grok was correct in only one visit out of four.

This led us to conclude that AI can be used as a helpful tool when the questions asked are accurate enough, which can be challenging to achieve when dealing with subjective notions. AI models may have a bias toward predictability and simplicity in their classifications. Humans experience a wider range of perceptions based on their personal experiences and, in the case of AEC experts, occupational habits. The AIs may need more nuanced training data to better capture the diversity of human perceptions in architecture and urban design. The datasets likely emphasize descriptive and functional attributes rather than emotional reactions, which are more complex and context dependent.

The AI results may have been more accurate if we had provided examples of urban scenes for each attribute to define them more clearly, but the experiments were run as a zero-shot learning problem where, in the test phase, an AI model observed samples from classes which were not observed during training, and had to predict the class that they belonged to. It was the same condition given to the human participants, with no clear definition of the attribute being given as a preamble to the questioning. Even though they were AEC experts, they were not familiar with the aesthetic attributes proposed in the experimental protocol. It is also important to use the AI in real conditions, where a person with no training on conversational agents can input a simple prompt for a quick answer.

Finally, humans bring personal memories and cultural influences into their perceptions of spaces. AI models, on the other hand, rely on statistical patterns in text and may gravitate toward more common, neutral descriptions rather than capturing subtle emotional variations. Also, AI models rely on text-based pattern matching, which may not fully grasp why a space evokes feelings such as spirituality or nostalgia. Therefore, their outputs—despite being helpful—have to be handled with caution.

5.2. Limitations and Future Research

Our work was deliberately limited to a few aspects of urban design quality to focus on analyzing AI outputs. To address this intentional limitation, we plan to expand the set of questions to include other aspects, such as imageability, transparency and complexity, to cover the multifaceted nature of urban ambiances.

The prompts used as inputs to question the Chatbots were simplified to closely match the questions submitted to participants. No explanations or definitions of any attribute were provided, requiring each participant to respond based on their personal interpretation of the propositions. This is a limitation of our study, as different prompts can lead to different AI responses [30,31]. An extension of this study could focus on refining the prompts to help AIs better align with the professional opinions.

The number of experts involved in the study was limited because of online recruitment and the need to ensure that all participants were AEC professionals. The only reliable way to verify their expertise was to recruit individuals directly involved in the real estate project we were partnered with. In the future, we plan to enlarge this panel with the participation of stakeholders and urban planning specialists.

For each visit, three points of view were arbitrarily selected to illustrate the portion of the neighborhood under study. We chose the most relevant point of view with a slightly low-angle perspective to highlight the height and structure of the buildings. However, different viewpoints might generate alternative perspectives on the scene, potentially leading to different AI evaluations. This assertion warrants further investigation.

The virtual environments were generated in daytime with clear weather; the nighttime context and seasons were not investigated during this study. These are important factors, as seasons have an effect on mood and nighttime has an effect on the sensation of security. Exploring these aspects could be a valuable direction for future research.

6. Conclusions

We investigated the capability of MLLMs in perceiving urban environments based on images and textual prompts. We compared the outputs of three popular models, ChatGPT, Gemini and Grok, with the visual assessments of AEC experts within the context of a real estate construction project.

Our analysis was based on subjective attributes used to characterize various aspects of the built environment. Four urban identities served as case studies in a virtual environment designed from professional 3D models.

We found that human and AI evaluations aligned on certain aspects, such as space, scale and architectural style, and showed greater agreement in environments with vegetation. Grok was notably more in line with the opinions of the experts in aesthetics assessment for the four urban identities, often showing less uniform judgment than ChatGPT or Gemini. However, notable differences existed in the response patterns, particularly regarding subjective aspects like the general feelings evoked by some urban identities. This raises questions about generative AI hallucinations, where the AI invents information and exhibits creative behavior yet produces inaccurate outputs.

AI-powered tools offer the potential to revolutionize public space audits by providing quick assessments and generating creative ideas. However, more research is needed to determine how well they capture real-world experiences and align with expert evaluations in urban planning and design. MLLMs offer valuable insights and can assist in the preliminary assessment of architectural projects, but expert evaluation remains essential to capture the nuances of human perception. AI models are trained on datasets that may emphasize functional descriptions of spaces rather than the emotional impact of urban design.

AI models currently prioritize consistency over diversity in perception, leading to a bias toward neutral and predictable descriptions such as Balance, Calmness or Simplicity/Serenity. With more specialized training datasets, multimodal learning and personalization, future AI models could more accurately reflect the range of human emotions tied to urban environments. For instance, AIs could be fine-tuned with human survey data, psychological studies on architecture and emotional reactions to spaces to better capture diverse perceptions.

Funding

The design of the 3D city model used in this work was funded by the E3S project, a partnership between Eiffage and the I-SITE FUTURE consortium. FUTURE bénéficie d’une aide de l’État gérée par l’Agence Nationale de la Recherche (ANR) au titre du programme d’Investissements d’Avenir (référence ANR-16-IDEX-0003) en complément des apports des établissements et partenaires impliqués. This work has received support under the program “France 2030” launched by the French Government and implemented by ANR, with the reference ANR-21-EXES-0007.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to Legal Regulations, Article L1121-1 Code de la santé publique (public health code defining the three categories of research involving humans): https://www.legifrance.gouv.fr/codes/article_lc/LEGIARTI000046125746, accessed on 1 February 2025.

Informed Consent Statement

Informed consent for publication was obtained from all identifiable human participants. This study was conducted in compliance with the General Data Protection Regulation (GDPR) (Regulation (EU) 2016/679). All personal data were processed in accordance with GDPR principles, ensuring confidentiality, integrity and restricted access. Participants provided informed consent, and data were anonymized to protect individual privacy.

Data Availability Statement

Image data are available upon request to the author.

Acknowledgments

The author would like to thank Arcadis for providing the assembled IFC files of the 3D building models, as well as SEMOP Châtenay-Malabry Parc Centrale (Eiffage Aménagement and the City of Châtenay-Malabry) for granting permission to use them for publication.

Conflicts of Interest

The author declares no conflicts of interest.

References

Ewing, R.; Clemente, O.; Neckerman, K.M.; Purciel-Hill, M.; Quinn, J.W.; Rundle, A. Measuring Urban Design: Metrics for Livable Places; Island Press: Washington, DC, USA, 2013; Volume 200. [Google Scholar] [CrossRef]
Koschinsky, J.; Talen, E.; Alfonzo, M.; Lee, S. How walkable is Walker’s paradise? Environ. Plan. B Urban Anal. City Sci. 2017, 44, 343–363. [Google Scholar] [CrossRef]
Liao, B.; van den Berg, P.E.W.; van Wesemael, P.J.; Arentze, T.A. How does walkability change behavior? A comparison between different age groups in the Netherlands. Int. J. Environ. Res. Public Health 2020, 17, 540. [Google Scholar] [CrossRef]
Karatas, P.; Tuydes-Yaman, H. Variability in sidewalk pedestrian level of service measures and rating. J. Urban Plan. Dev. 2018, 144, 04018042. [Google Scholar] [CrossRef]
Ewing, R.; Handy, S.; Brownson, R.C.; Clemente, O.; Winston, E. Identifying and measuring urban design qualities related to walkability. J. Phys. Act. Health 2006, 3, S223–S240. [Google Scholar] [CrossRef]
Belaroussi, R.; Issa, E.; Cameli, L.; Lantieri, C.; Adelé, S. Exploring Virtual Environments to Assess the Quality of Public Spaces. Algorithms 2024, 17, 124. [Google Scholar] [CrossRef]
Biljecki, F.; Chow, Y.S.; Lee, K. Quality of crowdsourced geospatial building information: A global assessment of OpenStreetMap attributes. Build. Environ. 2023, 237, 110295. [Google Scholar] [CrossRef]
Gomez-Tone, H.C.; Alpaca Chávez, M.; Vásquez Samalvides, L.; Martin-Gutierrez, J. Introducing Immersive Virtual Reality in the Initial Phases of the Design Process;Case Study: Freshmen Designing Ephemeral Architecture. Buildings 2022, 12, 518. [Google Scholar] [CrossRef]
Harputlugil, T.; Gültekin, A.T.; Prins, M.; Topçu, Y.İ. Architectural design quality assessment based on analytic hierarchy process: A case study. METU J. Fac. Archit. 2014, 31, 139–161. [Google Scholar] [CrossRef]
Corticelli, R.; Pazzini, M.; Mazzoli, C.; Lantieri, C.; Ferrante, A.; Vignali, V. Urban Regeneration and Soft Mobility: The Case Study of the Rimini Canal Port in Italy. Sustainability 2022, 14, 14529. [Google Scholar] [CrossRef]
Alwah, A.A.; Li, W.; Alwah, M.A.; Shahrah, S. Developing a quantitative tool to measure the extent to which public spaces meet user needs. Urban For. Urban Green. 2021, 62, 127152. [Google Scholar] [CrossRef]
Rivière, J.P.; Vinet, L.; Prié, Y. Towards the use of virtual reality prototypes in architecture to collect user experiences: An assessment of the comparability of patient experiences in a virtual and a real ambulatory pathway. Int. J. Hum.-Comput. Stud. 2024, 192, 103342. [Google Scholar] [CrossRef]
Keil, J.; Edler, D.; Schmitt, T.; Dickmann, F. Creating Immersive Virtual Environments Based on Open Geospatial Data and Game Engines. KN-J. Cartogr. Geogr. Inf. 2021, 71, 53–65. [Google Scholar] [CrossRef]
Scerri, K.; Attard, M. People as planners: Stakeholder participation in the street experimentation process using a virtual urban living lab. J. Urban Mobil. 2023, 4, 100063. [Google Scholar] [CrossRef]
Belaroussi, R.; González, E.D.; Dupin, F.; Martin-Gutierrez, J. Appraisal of Architectural Ambiances in a Future District. Sustainability 2023, 15, 13295. [Google Scholar] [CrossRef]
Gómez-Tone, H.C.; Martin-Gutierrez, J.; Bustamante-Escapa, J.; Bustamante-Escapa, P. Spatial Skills and Perceptions of Space: Representing 2D Drawings as 3D Drawings inside Immersive Virtual Reality. Appl. Sci. 2021, 11, 1475. [Google Scholar] [CrossRef]
Cai, M. Natural language processing for urban research: A systematic review. Heliyon 2021, 7, e06322. [Google Scholar] [CrossRef]
Jang, K.M.; Kim, Y. Crowd-sourced cognitive mapping: A new way of displaying people’s cognitive perception of urban space. PLoS ONE 2019, 14, e0218590. [Google Scholar] [CrossRef]
Redi, M.; Aiello, L.M.; Schifanella, R.; Quercia, D. The spirit of the city: Using social media to capture neighborhood ambiance. Proc. ACM Hum.-Comput. Interact. 2018, 2, 1–18. [Google Scholar] [CrossRef]
Olson, A.W.; Calderón-Figueroa, F.; Bidian, O.; Silver, D.; Sanner, S. Reading the city through its neighbourhoods: Deep text embeddings of Yelp reviews as a basis for determining similarity and change. Cities 2021, 110, 103045. [Google Scholar] [CrossRef]
Wang, J.; Chow, Y.S.; Biljecki, F. Insights in a city through the eyes of Airbnb reviews: Sensing urban characteristics from homestay guest experiences. Cities 2023, 140, 104399. [Google Scholar] [CrossRef]
Biljecki, F.; Ito, K. Street view imagery in urban analytics and GIS: A review. Landsc. Urban Plan. 2021, 215, 104217. [Google Scholar] [CrossRef]
Liang, X.; Zhao, T.; Biljecki, F. Revealing spatio-temporal evolution of urban visual environments with street view imagery. Landsc. Urban Plan. 2023, 237, 104802. [Google Scholar] [CrossRef]
Ibrahim, M.R.; Haworth, J.; Cheng, T. Understanding cities with machine eyes: A review of deep computer vision in urban analytics. Cities 2020, 96, 102481. [Google Scholar] [CrossRef]
Chen, S.; Biljecki, F. Automatic assessment of public open spaces using street view imagery. Cities 2023, 137, 104329. [Google Scholar] [CrossRef]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
xAI. Grok 3 Beta—The Age of Reasoning Agents. 2025. Available online: https://x.ai/news/grok-3 (accessed on 21 February 2025).
Fu, C.; Zhang, R.; Wang, Z.; Huang, Y.; Zhang, Z.; Qiu, L.; Ye, G.; Shen, Y.; Zhang, M.; Chen, P.; et al. A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise. arXiv 2023, arXiv:2312.12436. [Google Scholar] [CrossRef]
Liang, H.; Zhang, J.; Li, Y.; Wang, B.; Huang, J. Automatic Estimation for Visual Quality Changes of Street Space Via Street-View Images and Multimodal Large Language Models. IEEE Access 2024, 12, 87713–87727. [Google Scholar] [CrossRef]
Malekzadeh, M.; Willberg, E.; Torkko, J.; Toivonen, T. Urban attractiveness according to ChatGPT: Contrasting AI and human insights. Comput. Environ. Urban Syst. 2025, 117, 102243. [Google Scholar] [CrossRef]
Xiao, Y.; Song, M. How are urban design qualities associated with perceived walkability? An AI approach using street view images and deep learning. Int. J. Urban Sci. 2024, 1–26. [Google Scholar] [CrossRef]
OpenAI. Analysis of Urban Design Factors Based on Visual Input, 2025. AI-Generated Response from ChatGPT on January 14, 2025. Retrieved via Interaction. Available online: https://chatgpt.com/ (accessed on 30 January 2025).
Gemini, G. Qualitative Image Analysis Using Gemini. 2025. Retrieved via Interaction. Available online: https://gemini.google.com/app (accessed on 30 January 2025).
Grok 3 (xAI). Personal Communication Regarding Urban Environment Analysis Based on Provided Images, 2025. Analysis Provided via AI Interaction. Available online: https://grok.com/ (accessed on 19 February 2025).

Figure 1. Overview of the large-scale virtual environment. Four virtual visits were designed, representing different urban identities. Sources: SEMOP Châtenay-Malabry Parc-Centrale.

Figure 2. Experimental protocol: comparison of AI-generated and expert opinions of mood attributes evoked by the urban environment.

Figure 3. Space and scale. What feeling do you feel the most? Gemini and GPT-4 modeled all the urban scenes as Balanced. Grok was more in line with AEC experts on all visits.

Figure 4. Enclosure. What degree of enclosure would you feel inside the scene? Human participants versus GPT-4 and Gemini categorization with four choices: Protection, Calmness, Freedom and Animation.

Figure 5. Architecture style. What do the buildings inspire in you? Human participants versus GPT-4 and Gemini categorization with three choices: Elegance, Simplicity and Eccentrism.

Figure 6. Overall feeling. What kind of feelings describe most your sensation inside the scene?

Table 1. Mood attributes and architectural dimensions investigated.

Aspect	Mood Attributes Evoked by the Urban Environment
Space and scale [5]	Balance	Grandeur	Restlessness
Enclosure [5]	Protection	Calmness	Freedom	Animation
Style [16]	Elegance/Satisfaction	Simplicity/Serenity	Eccentrism/Surprise
Overall [15,16]	Indifference/Unnoticed	Emotion/Spirituality	Joy/Theatricality	Sadness/Nostalgia

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Belaroussi, R. Subjective Assessment of a Built Environment by ChatGPT, Gemini and Grok: Comparison with Architecture, Engineering and Construction Expert Perception. Big Data Cogn. Comput. 2025, 9, 100. https://doi.org/10.3390/bdcc9040100

AMA Style

Belaroussi R. Subjective Assessment of a Built Environment by ChatGPT, Gemini and Grok: Comparison with Architecture, Engineering and Construction Expert Perception. Big Data and Cognitive Computing. 2025; 9(4):100. https://doi.org/10.3390/bdcc9040100

Chicago/Turabian Style

Belaroussi, Rachid. 2025. "Subjective Assessment of a Built Environment by ChatGPT, Gemini and Grok: Comparison with Architecture, Engineering and Construction Expert Perception" Big Data and Cognitive Computing 9, no. 4: 100. https://doi.org/10.3390/bdcc9040100

APA Style

Belaroussi, R. (2025). Subjective Assessment of a Built Environment by ChatGPT, Gemini and Grok: Comparison with Architecture, Engineering and Construction Expert Perception. Big Data and Cognitive Computing, 9(4), 100. https://doi.org/10.3390/bdcc9040100

Article Menu

Subjective Assessment of a Built Environment by ChatGPT, Gemini and Grok: Comparison with Architecture, Engineering and Construction Expert Perception

Abstract

1. Introduction

2. Related Works

2.1. Human Assessment of the Built Environment

2.2. Artificial Intelligence for Urban Analytics

3. Materials and Methods

3.1. Case Study

3.2. Methodology

4. Results

5. Discussion

5.1. Observations on AIs

5.2. Limitations and Future Research

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI