1. Introduction
Evaluating the built environment is important for sustainable policy making in the city. A well-designed urban environment can significantly influence travel behavior, encouraging a shift towards active modes like walking and cycling, with positive effects on public health and well-being. Pleasant urban spaces that prioritize walkability and cycling can lead to improved spirit, increased physical activity and enhanced urban vibrancy and vitality. Conversely, unpleasant environments can negatively impact individuals, fostering feelings of fear and stress, and potentially contributing to increased crime and car dependence. Therefore, prioritizing the quality of the built environment is essential for creating healthier, more sustainable and thriving communities.
Assessing the emotional response elicited from being in a particular architectural space is a difficult task because of its fundamentally subjective nature. Architectural ambiance refers to the overall atmosphere or mood created by the design and layout of a built environment, covering the sensory experience. The arrangement of spaces, vegetation, urban furniture, interaction and overall flow within a public space can influence the degree of liveliness within the space. Scale, proportion and volume impact the human experience of a given space. Sizable structures, large glass windows, balconies and open layouts can create a sense of openness, animation and grandeur, while smaller, enclosed spaces might evoke intimacy, protection or calmness. Architects often aim to create specific atmospheres through careful consideration of built environment elements, designing the ambiance to suit the intended functions and emotional responses desired for the space.
Traditional methods for evaluating the aesthetics and ergonomics of public spaces rely heavily on fieldwork, surveys, focus groups and document analysis [
1]. These approaches are often labor intensive and time consuming. Studies involving human participants face challenges such as recruitment difficulties, as individuals may not be interested or available to participate within a reasonable timeframe. Recruiting experts in the Architecture, Engineering and Construction (AEC) field presents additional difficulties. Professionals in these sectors are often busy with their own projects, usually leaving only a small number of architects within an agency responsible for assessing the aesthetics and ergonomics of urban developments. Moreover, when evaluating prospective developments, multiple alternative projects must often be assessed before selection of the most suitable one. In such cases, a fast and automated tool could significantly aid decision-making.
Recent advances in artificial intelligence (AI) present promising opportunities for the development of computer-aided tools in the analysis of built environments. Multimodal Large Language Models (MLLMs) such as ChatGPT-4o, Gemini 2.0 and Grok 3 offer the general public a way to explore urban environments through images and conversation-based interactions. MLLMs can generate diverse creative content and provide informative answers, even to open-ended, complex or unusual questions. Despite their capabilities, the extent to which MLLM-generated insights align with human experiences of public spaces—particularly from the perspective of AEC experts—remains largely unknown. The potential of these models in analyzing architectural ambiance, a subject that is highly subjective, is still underexplored, leaving room for further research and development in this area.
The potential of AI in fields requiring subjective assessment, such as aesthetics, constitutes a research gap. Little is known about how AI generates outputs when questioned about its subjective perception of a built environment and how well these outputs align with a professional’s expertise. The research questions introduced in this paper can be formulated as follows:
What is the potential of AI in the aesthetic assessment of a built environment?
How does it compare to the subjective perception of AEC experts?
Are there significant differences between AI models and is there a more suitable MLLM for the task of evaluating the quality of a built environment?
For this study, we explored the potential of MLLMs in assessments related to the built environment. We focused on three state-of-the-art and widely used MLLMs: ChatGPT-4o, Gemini 2.0 and Grok 3. We applied these AI models to automate the analysis of four cityscapes represented in a virtual environment, in the context of a real estate project under development. Specific aspects of architectural ambiance were selected: space and scale, degree of enclosure, architectural style and overall feelings. For each aspect, a set of possible attributes was proposed to characterize the mood evoked by the built environment. We incorporated these quality criteria, derived from the literature, into input prompts fed to the MLLMs, asking them to assess scene images extracted from each urban identity. Additionally, AEC professionals who were part of the real estate development were asked to evaluate the same qualities to compare their assessments with those of the MLLMs. This comparison helped reveal both the potential and the limitations of MLLMs in evaluating the quality of public spaces.
3. Materials and Methods
3.1. Case Study
In the context of a real estate construction project, several architecture firms—Atelier M3, Leclercq Associés, Atelier 2/3/4, Agence Pietri and BASE—joined efforts to produce BIMs of buildings, roads, sidewalks, landscape and urban furniture for a future 20-hectare borough in the south of Paris, France. Benefiting from the collaboration of the client of the real estate project—SEMOP Châtenay-Malabry Parc-Centrale—we collected these BIMs and assembled them in a 3D city model for visualization purposes. Four virtual reality visits were produced for different parts of the borough, as illustrated by
Figure 1.
The virtual environment simulated a 3D architectural representation of several parts of a peri-urban borough on a sunny day. The colors are bright and clear, as they often are in a real estate project presentation. The images were taken at street level, which allows the buildings to be seen in their urban context. The slightly low-angle perspective highlights the height and structure of the buildings.
Scene 1 is the business district of the borough. It is made up of three main buildings: The first building has an angular and asymmetrical shape, with a largely glazed lower part and an upper part made up of several floors with regular rectangular windows. There is a significant gap between the glass base and the upper floors, creating an interesting visual effect. The regular arrangement of the windows of the upper floor gives an orderly appearance to the upper part of the building, contrasting with the freer and more open appearance of the glazed base. The base of the building is made up entirely of large bay windows, with an open space inside. This transparency also allows visual interaction with the external environment. There is variation in the heights and shapes of the buildings. The central one is distinguished by a canopy-shaped entrance, creating a dynamic and original visual effect. This variation breaks the monotony and brings a strong identity to the whole. A fence is also visible in the foreground, delimiting a private business space. Large glass surfaces are present on the facades, suggesting a desire to maximize the supply of natural light inside the buildings. The buildings are organized around a central pedestrian alley with light vegetation. This organization promotes circulation and interactions between the different buildings. There is a difference in height between the buildings, which energizes the whole.
Scene 2 illustrates a strip mall, an ensemble of shops at the ground floor of a series of buildings with different styles. The buildings have a contemporary style with clean lines and mostly white facades. We can see a variety in the heights and depths of the buildings, which energizes the perspective. Some buildings have balconies with glass or metal railings. The perspective is accentuated by the layout of the buildings and the vanishing line of the street, giving an impression of depth. The ground floor is clearly intended for commercial premises, with large windows. The spaces are available for rental to businesses. Some access doors are framed in bright colors, such as orange, which brings a touch of dynamism to the whole. It is a typical urban environment with the presence of a sidewalk, a curb and urban design elements such as bicycle racks and a fenced area. The presence of trees indicates a desire to integrate some greenery into the urban environment. Overall, the images give an impression of modernity, functionality and commercial potential. The architecture is simple but efficient, and the commercial spaces on the ground floor provide opportunities for various activities.
Scene 3 revolves around a typical suburban residential area. The buildings feature a modern style with clean lines and rectangular shapes. The emphasis is on simplicity and functionality. Materials are in a light, almost white color, which gives a clean and bright appearance. The windows are highlighted by black frames, creating a strong contrast. Black metal railings on balconies and windows also contribute to the contemporary style and provide functional safety. The arrangement of rectangular windows of different sizes creates an interesting visual rhythm on the facade. The balconies are clearly visible and appear spacious. They provide private outdoor space for residents. The most distinctive elements are the round corner buildings located on the street corners. Their architecture is contemporary, with curved lines and balconies that soften the angles. The public space in front of the buildings is treated with care, with trees planted along the street to provide shade and greenery. The linear perspective of the street invites the gaze towards the background of the image.
Scene 4 give an illustration of a green urban identity. It is a more secluded residential complex, with more vegetation and a recreational area. The buildings have a contemporary design with clean lines, flat roofs and large windows. This feature softens the overall look and gives it a more organic aesthetic. The dominant colors are white and gray, creating a neutral and sophisticated look. The facades are mainly white, which gives them a bright appearance and highlights the openings. We observe an alternation of rectangular windows of different sizes, some with balconies or loggias, which energizes the composition. The ground is paved with light-colored slabs, creating a clean and uniform surface. There are numerous young, leafy trees scattered throughout the scene, indicative of a green space and recreational area. The buildings are situated in an open area, suggesting a spacious and potentially public plaza favorable for social interactions. The open space and the presence of trees were planned to create a pleasant and livable environment. The overall impression supposed to be conveyed is one of a well-planned and aesthetically pleasing urban development.
3.2. Methodology
The experimental protocol is illustrated in
Figure 2. The survey was conducted on
n = 6 AEC experts involved in all the phases of the real estate project from the start of the planning of the borough with the stakeholders to the drawing of the built environment and now the construction phase. The participants were three women and three men aged 32 to 45 years old. The voluntary participants were all adults. Each participant was sent a link to an online questionnaire made up of five main parts. Participant could access the questionnaire online from their computer, laptop or handheld devices as long as they had an internet connection.
The questionnaire consisted of four identical parts, each featuring a 1 min video of a virtual tour. The complete protocol is described in [
15]. Participants were invited to imagine themselves visiting the urban space and encouraged to express their feelings regarding scale and size, degree of enclosure and architectural style during the virtual tour. These same types of characterizations were made by ChatGPT (GPT-4o), Gemini 2.0 and Grok 3 based on the images illustrated in
Figure 1. From a textual prompt we provided, they output their understanding of how humans are likely to perceive and react to these visual cues. Several architecture aesthetic aspects were investigated for each urban scene. For each one, a set of possible mood attributes was proposed, and they are summarized in
Table 1. We selected space and scale and degree of enclosure because they are classical elements of urban design quality [
5,
32], and added architectural style and overall feelings from the protocol of Gomez et al. [
15,
16]. Other aspects such as imageability, transparency and complexity were not investigated because we wanted to focus our analysis on the response of AI not on urban design qualities.
ChatGPT was developed by OpenAI, Gemini by Google DeepMind and Grok by xAI. The exact implementations of ChatGPT, Grok and Gemini are proprietary to these companies, but some technical details are disclosed in the literature or by questioning the MLLMs directly. An MLLM does not use a single discrete algorithm; its analysis is the result of a complex interplay of several machine learning techniques and models, primarily focused on Natural Language Processing and computer vision. It relies heavily on attentional mechanisms such as convolutional neural networks for vision tasks and transformer-based LLMs for text processing. Typically, the image is fed into the computer vision system, relevant features are extracted from the image, such as shapes, objects and textures, and the system analyzes the relationships between these features to understand the overall scene. It connects the visual elements to abstract concepts presented in the textual prompt provided by the user. For example, the presence of tall buildings and a narrow street suggests a feeling of enclosure. The LLM generates a coherent and descriptive text based on the information extracted from the image, structuring the text in a logical and informative way. For each aspect, ChatGPT GPT-4V, Grok 3 and Gemini 2.0 proposed a single attribute from the available choices shown in
Table 1, with an argumentative explanation of the choice they made.
4. Results
The instruction given to the Chatbots and AEC experts for each of the four virtual visits was to imagine visiting the urban environment shown. We asked them to characterize this space according to four factors: (1) space and scale, (2) enclosure, (3) architectural style and (4) general feelings.
The first aspects of architectural aesthetics are space and scale. The humans and MLLMs were asked what feeling they felt the most out of three choices: Balance, Grandeur and Restlessness.
Figure 3 displays the results obtained for all the virtual visits. In three out of four virtual visits, the Chatbots and the AEC experts were predominantly in accordance. The office district mostly evoked a sense of Grandeur among AEC experts and Grok, whereas ChatGPT and Gemini, aiming to reflect the likely experience of someone visiting this urban district, characterized it as Balance, agreeing with a minority of the experts.
For this urban identity, ChatGPT [
33] argued that The proportionate design of the buildings and the openness of the space offer a harmonious feel, while Gemini [
34] stated that The urban environment appears harmonious, with a mix of open spaces and built structures that create a sense of equilibrium. They made a similar analysis, as both cited the impression of open space and harmony. These arguments are convincing; however, they do not explain why the shown space does not feel vast and grand, which are some of the first things that can be noticed from the images illustrated in
Figure 1a. It was captured by Grok [
35], who stated that The buildings in the images are tall, modern, and have a significant presence with their sleek, angular designs and large glass facades. (…) I would choose Grandeur as the feeling, as the architecture and layout suggest a sense of awe and majesty due to the height and expansive design of the buildings.
ChatGPT, Gemini and Grok classified the other scene as Balance, agreeing with the majority of experts, but with other arguments. ChatGPT felt a sense of proportion and uniformity in the strip mall, harmony and a well-planned atmosphere in the urban residences and a sense or order and harmony in the green residences, without overwhelming Grandeur or Restlessness in any of the scenes. In the strip mall scene, Gemini deemed the buildings of moderate height, not towering or imposing, and that the street widths appear comfortable, neither excessively wide nor narrow. In the urban residences scene, Gemini argued that the presence of trees and a pedestrian-friendly street level contributed to a sense of balance between the built environment and human scale. In the green residences, it viewed the urban environment as well proportioned, with a human-scale design. It did not judge any of the scenes to be vast and grand or cramped and restless. Grok argued that the scale feels harmonious and well proportioned in the strip mall and green residences, and feels proportionate and human scaled, neither cramped nor grandiose, in the urban residences.
Figure 4 illustrates the results obtained for the degree of enclosure aspect. Four options were available: Protection, Calmness, Freedom and Animation. ChatGPT and Gemini chose the same attribute for all the scenes, Calmness, whereas expert opinions showed more variety. Once again, the business district showed more differences between AI and human feelings; experts expressed mostly a feeling of Animation or Freedom, and not one of them felt Calmness. The AI based their choice on the clean lines, orderly arrangement and well-defined boundaries, which provide the sentiments of security and tranquility. Only Grok chose Animation for the degree of enclosure, as the space felt dynamic and engaging to it, with a sense of activity and movement encouraged by the design. Grok argued that the balance between openness and the presence of the buildings suggested a lively yet contained atmosphere.
Scene 2 showing the strip mall and scene 3 showing the urban residences triggered a lot of disagreement between experts, with no clear emotion winning the vote tally. This is a typical result when dealing with the subjective assessment of degree of enclosure. The only virtual visit where the opinion was mostly shared was the visit to the green residences, with 83% of experts voting for Calmness, which was to be expected in an environment with a lot of vegetation. ChatGPT and Gemini based their “Calmness” assessment mostly on the orderly arrangement of trees and walkways’ open space, but, for all the virtual visits, lacked of real variety of opinions for this enclosure aspect. More simply, they experienced difficulty in finding creative arguments on such a subjective aspect. It appears that the characterization of degree of enclosure by AI-based methods from a visual scene has room from improvement.
For the urban scene of the strip mall, Grok chose Animation as the degree of enclosure, as the space felt dynamic and inviting, with a sense of activity encouraged by the commercial ground-floor spaces and pedestrian-friendly design. This argument made Grok more convincing than the other two AIs, as the argument was less creative but more based on facts.
Architecture style was also investigated, with a set of emotions given as possible choices: Elegance/Satisfaction, Simplicity/Serenity and Eccentrism/Surprise. The AEC expert opinions are given as the first line of results in
Figure 5. The responses show a mix of Elegance/Satisfaction, Simplicity/Serenity and Eccentrism/Surprise, depending on the visit. The business district haf a mix of all three attributes but was mostly perceived as Eccentric, while the shopping street was dominated by Simplicity/Serenity. The urban residences had a mix of Simplicity/Serenity and some Eccentricism, and, finally, the green residences strongly favored Simplicity/Serenity.
The business areas evoked more varied perceptions. Eccentricity was mostly associated with the business district by the experts but the three AIs gave divergent opinions. Grok selected Eccentricity/Surprise, as did 50% of the experts, as the architecture stands out as unconventional and eye-catching, evoking a reaction of curiosity and amazement [
35]. ChatGPT was second best as it selected Elegance/Satisfaction, like 33% of the experts, while Gemini chose Simplicity/Serenity, which was the third choice of the experts.
Feelings of Eccentrism/Surprise were never chosen by Gemini and ChatGPT, reflecting a certain sense of conservatism in AI responses. AI models are trained to be reliable and predictable, which might make them less likely to assign extreme or unexpected emotions. This results in a safer, more uniform response like Simplicity/Serenity even in cases where humans would perceive more variation. Grok seems to have the ability to express more original feelings.
Figure 6 illustrates the responses of AEC experts and compares them to ChatGPT, Grok and Gemini’s outputs regarding the overall feelings evoked by the four virtual visits. For this general characterization of general feelings, the AEC experts had a less diverse emotional response to the urban spaces than for the previous factors, as they heavily tended to vote for Indifference/Unnoticed sensations, at least 50% of the times. They considered the business district as evoking mostly Indifference (50%), but also Joy/Theatricality (33%) and some Sadness/Nostalgia (17%). The shopping street and urban residences were predominantly characterized as evoking Indifference (83%), with a small percentage (17%) of Joy/Theatricality and Emotion/Spirituality. The green residences evoked a balanced mix of Indifference and Emotion/Spirituality.
Globally, Gemini displayed a stronger differentiation between environments than ChatGPT, but the feelings of emotion in visit 1 were not expressed by the experts. AI saw the business district and the urban residences as more engaging than the experts did. This could be due to biased training data which associate urban areas with vibrancy rather than neutrality. ChatGPT and Gemini aligned with the experts’ perceptions of the strip mall, while Grok chose the minority attribute. On the contrary, Grok correctly captured the feeling of the majority of the experts in visit 3, while ChatGPT and Gemini were totally divergent from them. Half of the experts found the green residences scene to evoke a strong Spiritual/Emotional connection, but ChatGPT and Gemini classified it as purely Indifference. AI might lack an understanding of how nature and tranquility evoke deeper emotions. Grok was out of line, expressing Joy/Theatricality.
Let us explain some of the differences. In all visits, the majority of human experts did not find a relevant attribute other than Indifference because the other propositions did not fit their experience. ChatGPT and Gemini were in line with this attribute on two visits out of four, which can be interpreted as their natural inclination towards conservative assessment, leading to a neutral stance. The Grok model was inaccurate in assigning Theatricality to three of the visits, likely focusing on visual aesthetics and urban dynamism.
Actually, we found out that the criterion of “overall feeling” was ill described [
15]. The predominance of Indifference in the professionals’ opinions can likely be attributed to a lack of clarity regarding architectural ambiance as a general feeling. Participants may have struggled to choose between the given options because none fully aligned with their professional experience. Strong emotions like sadness, joy and spirituality are distinct, but subtler feelings might not have been adequately proposed.
This suggests room for improvement in the study, particularly in identifying nuanced emotional responses beyond indifference, but this bad design can be used as a probe to exhibit a general behavior of AI: false invention. In contrast to the experts, the AI models appeared to be compelled to provide a distinctive response. In doing so, they generated hallucinations—fabricating interpretations and exhibiting creative tendencies—while ultimately producing inaccurate outputs.
5. Discussion
5.1. Observations on AIs
On the first three aspects—scale and size, enclosure and style—Grok gave an answer corresponding to the (sometimes relative) votes of the majority of experts, while ChatGPT and Gemini were correct in 75% of the cases (nine times out of twelve). On the overall feeling aspect of architectural ambiance, ChatGPT and Gemini were correct in 50% of the visits because the question favored their tendency towards conservatism, while Grok was correct in only one visit out of four.
This led us to conclude that AI can be used as a helpful tool when the questions asked are accurate enough, which can be challenging to achieve when dealing with subjective notions. AI models may have a bias toward predictability and simplicity in their classifications. Humans experience a wider range of perceptions based on their personal experiences and, in the case of AEC experts, occupational habits. The AIs may need more nuanced training data to better capture the diversity of human perceptions in architecture and urban design. The datasets likely emphasize descriptive and functional attributes rather than emotional reactions, which are more complex and context dependent.
The AI results may have been more accurate if we had provided examples of urban scenes for each attribute to define them more clearly, but the experiments were run as a zero-shot learning problem where, in the test phase, an AI model observed samples from classes which were not observed during training, and had to predict the class that they belonged to. It was the same condition given to the human participants, with no clear definition of the attribute being given as a preamble to the questioning. Even though they were AEC experts, they were not familiar with the aesthetic attributes proposed in the experimental protocol. It is also important to use the AI in real conditions, where a person with no training on conversational agents can input a simple prompt for a quick answer.
Finally, humans bring personal memories and cultural influences into their perceptions of spaces. AI models, on the other hand, rely on statistical patterns in text and may gravitate toward more common, neutral descriptions rather than capturing subtle emotional variations. Also, AI models rely on text-based pattern matching, which may not fully grasp why a space evokes feelings such as spirituality or nostalgia. Therefore, their outputs—despite being helpful—have to be handled with caution.
5.2. Limitations and Future Research
Our work was deliberately limited to a few aspects of urban design quality to focus on analyzing AI outputs. To address this intentional limitation, we plan to expand the set of questions to include other aspects, such as imageability, transparency and complexity, to cover the multifaceted nature of urban ambiances.
The prompts used as inputs to question the Chatbots were simplified to closely match the questions submitted to participants. No explanations or definitions of any attribute were provided, requiring each participant to respond based on their personal interpretation of the propositions. This is a limitation of our study, as different prompts can lead to different AI responses [
30,
31]. An extension of this study could focus on refining the prompts to help AIs better align with the professional opinions.
The number of experts involved in the study was limited because of online recruitment and the need to ensure that all participants were AEC professionals. The only reliable way to verify their expertise was to recruit individuals directly involved in the real estate project we were partnered with. In the future, we plan to enlarge this panel with the participation of stakeholders and urban planning specialists.
For each visit, three points of view were arbitrarily selected to illustrate the portion of the neighborhood under study. We chose the most relevant point of view with a slightly low-angle perspective to highlight the height and structure of the buildings. However, different viewpoints might generate alternative perspectives on the scene, potentially leading to different AI evaluations. This assertion warrants further investigation.
The virtual environments were generated in daytime with clear weather; the nighttime context and seasons were not investigated during this study. These are important factors, as seasons have an effect on mood and nighttime has an effect on the sensation of security. Exploring these aspects could be a valuable direction for future research.
6. Conclusions
We investigated the capability of MLLMs in perceiving urban environments based on images and textual prompts. We compared the outputs of three popular models, ChatGPT, Gemini and Grok, with the visual assessments of AEC experts within the context of a real estate construction project.
Our analysis was based on subjective attributes used to characterize various aspects of the built environment. Four urban identities served as case studies in a virtual environment designed from professional 3D models.
We found that human and AI evaluations aligned on certain aspects, such as space, scale and architectural style, and showed greater agreement in environments with vegetation. Grok was notably more in line with the opinions of the experts in aesthetics assessment for the four urban identities, often showing less uniform judgment than ChatGPT or Gemini. However, notable differences existed in the response patterns, particularly regarding subjective aspects like the general feelings evoked by some urban identities. This raises questions about generative AI hallucinations, where the AI invents information and exhibits creative behavior yet produces inaccurate outputs.
AI-powered tools offer the potential to revolutionize public space audits by providing quick assessments and generating creative ideas. However, more research is needed to determine how well they capture real-world experiences and align with expert evaluations in urban planning and design. MLLMs offer valuable insights and can assist in the preliminary assessment of architectural projects, but expert evaluation remains essential to capture the nuances of human perception. AI models are trained on datasets that may emphasize functional descriptions of spaces rather than the emotional impact of urban design.
AI models currently prioritize consistency over diversity in perception, leading to a bias toward neutral and predictable descriptions such as Balance, Calmness or Simplicity/Serenity. With more specialized training datasets, multimodal learning and personalization, future AI models could more accurately reflect the range of human emotions tied to urban environments. For instance, AIs could be fine-tuned with human survey data, psychological studies on architecture and emotional reactions to spaces to better capture diverse perceptions.