The Effect of Audiovisual Environment in Rail Transit Spaces on Pedestrian Psychological Perception

Zhang, Mingli; Zou, Xinyi; Hu, Xuejun; Xie, Haisheng; Han, Feng; Meng, Qi

doi:10.3390/buildings15091400

Open AccessArticle

The Effect of Audiovisual Environment in Rail Transit Spaces on Pedestrian Psychological Perception

by

Mingli Zhang

¹,

Xinyi Zou

¹,

Xuejun Hu

²,

Haisheng Xie

³,

Feng Han

³ and

Qi Meng

^2,*

¹

Suzhou Rail Transit Technology Innovation Research Institute Co., Ltd., Suzhou 215000, China

²

Key Laboratory of Cold Region Urban and Rural Human Settlement Environment Science and Technology, School of Architecture and Design, Harbin Institute of Technology, Ministry of Industry and Information Technology, No. 92 Xidazhi Street, Harbin 150001, China

³

Suzhou Acoustic Technology Institute Co., Ltd., Suzhou 215513, China

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(9), 1400; https://doi.org/10.3390/buildings15091400

Submission received: 26 March 2025 / Revised: 16 April 2025 / Accepted: 18 April 2025 / Published: 22 April 2025

(This article belongs to the Special Issue Human-Centric Space Design: Occupant Comfort, Wellbeing, and Post-occupancy Evaluation of Multi-Scale Built Environment)

Download

Browse Figures

Versions Notes

Abstract

The environmental quality of rail transit spaces has increasingly attracted attention, as factors such as train noise and visual disturbances from elevated lines can impact pedestrians’ psychological perception through the audiovisual environment in these spaces. This study first collects audiovisual materials from rail transit spaces and pedestrian perception data through on-site surveys, measurements, VR environment simulations, and custom Deep Learning (DL) models. Using cluster analysis, the environments are categorized based on visual and auditory perceptions and evaluations of rail transit stations, delineating and classifying the spaces into different zones. The study further explores the interactive effects of audiovisual environmental factors on psychological perception within these zones. The results indicate that, based on audiovisual perception, the space within 300 m of a rail transit station can be divided into three zones and four distinct types of audiovisual perception spaces. The effect of the type of auditory environment on visual indicators was smaller than the effect of the visual environment on auditory indicators, and the category of vision had the greatest effect on the subjective indicators of hearing within Zones 1 and 2. This study not only provides a scientific basis for improving the environmental quality of rail transit station areas but also offers new perspectives and practical approaches for urban transportation planning and design.

Keywords:

transit spaces; audiovisual environment; soundscape

1. Introduction

Rail transit space refers to the space within a certain range around the rail transit exit area, which is generally defined as a circle centered on the station and radiating outward in the range of 500–800 m. In the field of transportation, scholars often use the concepts of TOD community, passenger attraction range, and rail transit station radiation area to elaborate [1,2,3]. In the context of the rapid construction of national rail transit, global scholars continue to pay attention to the interaction between the built environment and urban rail transit [4]. The problems of noise and visual interference of rail transit affect the psychological feelings of urban pedestrians [5,6,7]. Noise generated by rail operations, including wheel–rail friction and train vibrations, can lead to auditory discomfort and stress, while the visual impact of elevated tracks and station structures often disrupts the aesthetic continuity of urban landscapes. These issues not only diminish the quality of life for residents but also hinder the creation of pedestrian-friendly environments. While existing research has extensively studied the functional and operational aspects of rail transit spaces [3,8,9], a critical gap remains in understanding how the audiovisual environment of these spaces influences pedestrians’ psychological perception. This gap is particularly relevant given the rapid global expansion of urban rail systems and their growing role as public space hubs.

The delineation of zones helps to target the use of different design and management strategies at locations at different distances from the site. Current research on spatial delineation of rail transit station areas has developed multiple zoning approaches, yet significant variations persist in their theoretical foundations and applications. TOD theory suggests that the vicinity of metro stations should be organized with these stations as the central area, forming a layered structure in space that is within walking distance of 400~800 m, or walking time of 5~10 min [4]. Alternative frameworks include pedestrian-shed analysis based on walking time thresholds (5/10/15-min isochrones) [2,3,10] and morphological zoning tied to land use gradients (core–transition–peripheral layers) [3]. Recent advancements have introduced behavioral-based segmentation using mobile phone data to map activity intensity patterns, while others employ space syntax to identify cognitive boundaries through visual graph analysis [11]. However, these methods predominantly focus on either physical accessibility or economic functionality, with limited consideration of multisensory perception.

Research on the pedestrian environment around rail transit stations reveals a consensus that creating a pedestrian-friendly atmosphere does not only depend on ensuring the accessibility of the road network [12,13], but also on improving the level of residents’ psychological perception through planning and layout maneuvers [3]. However, existing assessment frameworks exhibit two critical shortcomings: (1) overreliance on static environmental indicators such as sound pressure level (SPL) without capturing dynamic human–environment interactions [13,14]; (2) at the level of audiovisual interaction, the visual coherence between station buildings and their surroundings and their coupling with noise perception have not been fully analyzed yet.

The design of audiovisual environment in urban rail transit space directly affects the physical and mental health and experience of passengers [15]. Current research focuses on the operational efficiency of rail transit, station layout, etc., and lacks in-depth exploration of the impact of the environment of rail transit station space on pedestrians’ psychological feelings. In terms of acoustic environment, the current research mainly focuses on vehicle running noise [16], wheel–rail friction noise [17], and aerodynamic noise, focusing on physical noise reduction, with less investigation on psychological feeling. In terms of technology application, new environmental control technologies such as intelligent noise reduction technology, visual optimization design, etc., have problems such as high cost and difficult maintenance in practical application. The application of these environmental control technologies also needs to consider the coordination with the overall urban environment.

People’s perception of the urban environment comes from a combination of factors [18], among which the auditory and visual environments are two important aspects [19,20], and there is a coupling between the roles of vision and hearing on psychological perception [21,22]. Regarding the visual environment, the building volumes of aboveground and elevated stations often obscure the urban landscape and affect visual permeability. Residents around the aboveground stations reflect that the station buildings cause restricted views, affecting the continuity of the urban landscape. In addition, the harmonization and integrity of station buildings with the surrounding environment also affects the visual quality. However, there are fewer studies on these visual interference factors [4].

In summary, current studies primarily focus on quantitative metrics such as accessibility, flow efficiency, and noise levels, often overlooking the human-centered experience of these spaces. Three key limitations persist in the literature: (1) zoning approaches remain rigidly distance-based (e.g., 500 m buffers), failing to account for variations in perceptual sensitivity across different spatial contexts; (2) most assessments rely on physical measurements or theoretical models, lacking empirical integration of environmental psychology and behavioral responses; (3) the psychological perception of rail transit environments is typically examined through isolated factors (e.g., noise or visual obstructions), with little consideration of how auditory and visual elements interact to shape pedestrian experience.

This study aims to address these gaps by establishing a perceptual-centric framework for evaluating and designing rail transit spaces. Specifically, we seek to: (1) redefine zoning boundaries based on psychological response; (2) develop an integrated classification system for audiovisual environmental factors; and (3) quantify interaction effects between auditory and visual stimuli on pedestrian perception. By bridging the divide between physical design and psychological experience, our findings aim to inform more humane and perceptually optimized transit environments. Building on this foundation, our study unfolds across four key sections. We begin by detailing our field methodology in Section 2, where we capture real-world audiovisual data through systematic measurements and employ deep learning to extract essential environmental indicators. Section 3 then reveals how these measurements translate into meaningful patterns, answering our core questions about perceptual zoning, environmental classification, and their combined effects on pedestrian experience. Finally, Section 4 bridges research and practice, transforming these findings into design strategies while thoughtfully considering the study’s boundaries and future possibilities.

2. Materials and Methods

2.1. Survey Site

The selected rail transit line is located in Suzhou City, Jiangsu Province, China, where the rapid development of urban rail transit has triggered corresponding noise problems. This design focuses on the elevated aboveground station on Yangcheng Lake Middle Road of Suzhou Rail Transit Line 2, and the main types of functional areas around the rail transit station are green areas, residential areas, and commercial areas. With the rail station as the center, a grid of 30 m × 30 m was drawn, proved optimal for capturing fine-grained spatial variations in soundscape perception while maintaining measurement efficiency, as this resolution adequately resolves the 15–20 dB(A) noise attenuation gradients observed around elevated rail structures [23] and aligns with the 25–35 m visual recognition thresholds for architectural elements in transit environments [24,25], and the equivalent continuous A sound level was measured at each grid intersection for 3 min, and a soundwalk study was conducted, where photographs were taken at these points, and 3 min long audio recordings were made when a train was passing by and when no train was passing by, which were used to reproduce the audio-visual environment in the laboratory, to obtain the subjective evaluation of the subjects, as well as to analyze the visual and auditory environmental indicators. The above was carried out in the morning from 9:00 to 10:00 and repeated at night from 20:00 to 21:00, when the traffic was low.

2.2. Panoramic Video

Panoramic video, binaural audio, and sound pressure level were measured simultaneously with the instrumentation arrangement shown in Figure 1a. The panoramic video was shot simultaneously with an Insta360 panoramic camera (Manufactured in Shenzhen, China) at the corresponding time and point, which should ensure that the angle of view is parallel to the railroad line, and the shooting range should be in accordance with the range of human vision, in which the horizontal field of vision is of 120°; see Figure 1b. The line of vision is accurate only for a limited part of the area in which the eyes have to focus themselves on the particular object. In the vertical plane, the eye has the capacity to see 45 degrees upwards and 65 degrees downwards whenever necessary [26]. The SQobold binaural recording system captured spatially accurate 3D audio samples at each measurement location. Its wide dynamic range (117 dB) and flat frequency response (10 Hz–20 kHz) preserved the spectral characteristics of rail noise essential for creating ecologically valid VR auditory stimuli, including directional cues and distance effects. Table 1 details the key devices and their technical applications used for acoustic environment measurements and virtual reality (VR) simulations in this study, covering the entire tool chain from field data acquisition to laboratory simulation.

A total of 32 audiovisual segments were recorded, of which 12 were residential environments, 10 were commercial spaces, and 10 were green spaces. It should be noted that these spatial-functional categorizations represent only the functions of the sites around the points, and do not represent the main composition of the visual or auditory material.

Audio-video processing: Each recording with a segment of rail transit passing was intercepted for about 30 s, and a corresponding recording of the same time period at that point without rail transit passing was intercepted for the same length, and the recording and video were synthesized into a single video file in video editing software Adobe Audition 2023. The videos were recorded during the daytime, and the nighttime recordings were mainly used as a control group when there were fewer vehicles.

2.3. Acoustic Measurement

Sound pressure levels were measured using a BSWA 801 (test accuracy: 0.1 dBA) multifunction sound level meter, which was calibrated using a calibrator before the test. In accordance with the outdoor sound environment test standard (ISO 10847:1997) [27], this test was conducted on a sunny day with wind speed lower than 2 m/s, the test height of the sound level meter was 1.5 m, and the trial time of each test point was 2 min. Prior to each measurement session, the BSWA CA111 acoustic calibrator was used to verify the sound level meter’s performance by generating a reference 94 dB tone at 1 kHz. This calibration process ensured measurement consistency with ±0.3 dB tolerance, maintaining data reliability throughout the field study under varying environmental conditions. The equipment used for testing is shown in Table 1. The two-minute equivalent continuous A-weighted sound level at each point is shown in Figure 1f, with a sound pressure level distribution between 65.4 and 78.5 dB(A), where the sound pressure level is higher close to the railroad station and the main road on which it is located, and decreases gradually with the increase in distance towards the surrounding area [28].

2.4. Questionnaires

The experiment employed a systematically designed questionnaire comprising four components: (1) an audiovisual perception questionnaire, (2) an acoustic environment evaluation, (3) a visual environment evaluation, and (4) an EmojiGrid for holistic station assessment. The development process rigorously followed standardized protocols from ISO 12913-3 for soundscape evaluation [29] and incorporated validated visual assessment frameworks from urban design studies [30,31]. Table 2 outlines the questionnaire components, specific questions, response scales, and their research objectives.

The audiovisual perception questionnaire for rail stations was based on an 11-point scale, in which the subjects were asked to what extent they perceived rail and rail noise to be dominant in that environment, with 0 representing not seeing rail at all or hearing rail noise, and 10 representing rail or its noise being dominant.

The evaluation indexes of the soundscape evaluation questionnaire refer to the perceptual and semantic dimensions of the ISO 12913-3 soundscape evaluation, i.e., the auditory perception questions include: acoustic comfort, using a 5-point scale, with 1 being extremely uncomfortable and 5 being extremely comfortable; perceived loudness, using a 5-point scale, with 1 being extremely quiet and 5 being “deafening”; and semantic evaluation, with three semantic dimensions, i.e., “eventfulness–non-eventfulness”, “pleasantness–annoyance”, “and vitality–boringness”. The semantic evaluation and the sound descriptors, i.e., “eventfulness–non-eventfulness”, “loudness–quietness”, “pleasantness-annoyance”, and “vitality–boringness”, were evaluated in four semantic dimensions. The above dimensions were evaluated on a 7-point scale, where −3 represents the most compatible with the left description, 0 represents neutral, and 3 represents the most compatible with the right description.

The visual evaluation includes overall quality (including “comfortable” and “beautiful”), spatial impression (including “open” and “depressing”), and richness (including “wealthy” and “boring”), respectively, were derived. A 5-point scale was used, with 1 being not at all consistent with the description and 5 being fully consistent with the description.

The overall evaluation of the audiovisual environment was used as a self-report tool to assess the valence and arousal of the subject. The EmojiGrid (Figure 2) is a visual tool designed to assess emotions by using a coordinate system, which replaces the verbal labels with emoji that depict facial expressions [32]. The advantage of this kind of evaluation is that it can avoid experimental errors caused by subjects’ biased understanding of the evaluation words [33,35]. In addition, since the semantic dimensions of the visual and auditory environments have already been evaluated in the experiment, evaluating the overall environment in a similar way is susceptible to interference from the evaluations that have already been performed, and therefore a differentiated evaluation method was chosen to minimize the interference. The EmojiGrid consists of two dimensions, namely, valence and arousal, which are two dimensions of human psychological feelings that are generally recognized in psychology, and in this experiment, in order to make it easier for the subjects with a non-psychological background to comprehend the EmojiGrid, we used the same system for the evaluation of the visual environment. In this experiment, in order to make it easier for non-psychological subjects to understand, “valence” was replaced by “pleasantness”.

2.5. VR Experiment

The participants were asked to complete an audiovisual perception experiment in the VR lab using HTC Vive Pro Eye VR systems with professional audio equipment. The lab configuration of VR simulations and a picture of a participant during VR simulation are shown in Figure 3a and Figure 3b respectively. A total of 42 participants evaluated the audiovisual environments under standardized lighting (5000 K, 300 lux) and background noise conditions (≤25 dB(A)) [34], with the sample size determined through power analysis to ensure adequate statistical validity (power = 0.82 for η² ≥ 0.15) [36]. A person’s age, gender, and education can influence audiovisual perception [37,38]. To ensure a representative sample for the VR-based audiovisual perception experiment, 42 participants (21 male, 21 female) were recruited through stratified sampling across age groups (18–25, 26–35, 36–45 years) and educational backgrounds (high school, undergraduate, postgraduate). All participants were residents living within 2 km of elevated rail transit stations ensuring familiarity with the study context. Prior to selection, candidates were screened to ensure normal or corrected-to-normal vision and hearing (Snellen chart <20/40 [39], pure-tone audiometry ≤25 dB HL [40]).

AKG K712 Monitoring Headphones were used to reproduce binaural recordings with exceptional clarity (10 Hz–39.8 kHz frequency response) during VR experiments. The audiovisual environments at different distances from 32 rail transit stations were recreated in VR, with each video lasting 3 min, and subjects scanned the QR code to fill out the audiovisual perception questionnaire and mark the pleasure and arousal perceptions of the overall audiovisual environments on the EmojiGrid provided by the experimenter. Depending on the position of the markers in the coordinates, values for both dimensions can be output, taking values between −1 and 1. The equipment used for testing is shown in Table 1.

2.6. Extraction of Audiovisual Environment Indicators

In the extraction of audiovisual environment indicators, a multimodal research approach is used. The methods used for indicator extraction are shown in Figure 4. For street view images, a Faster R-CNN (region-based convolutional neural network) model is used to extract key indicators, identify and classify relevant elements in the cityscape. To analyze the spatial context, PSPNet (Pyramid Scene Parsing Network) [41] is used for semantic segmentation to delineate different objects and regions in the image. For planimetry, distance measurement is performed using a Feedforward Neural Network (FNN) and semantic segmentation is performed using PSPNet. The distance is analyzed using a feedforward neural network model that includes two hidden layers and an output layer. The input features include the latitude and longitude of the point and the latitude and longitude of the station, and the output is the distance from the point to the station. The mean square error is used as the loss function and the model is trained using the Adam optimizer.

For audio data processing, the recordings are classified by a convolutional neural network (CNN) classification model to identify and classify noise sources. Since the study addresses the rail transit space, the training dataset contains images with rail transit stations and recordings containing rail transit noise. This approach extracted audiovisual environment indicators and analyzed the audiovisual dynamics within the rail transit space.

The visual indicators used for the analysis are shown in Table 3, which includes subjective evaluation of the subjects (taking values from −3 to 3), semantic segmentation (taking values from 0 to 1), and object detection, taking values from 0 to 1. The auditory indicators used for the analysis are shown in Table 4, which includes subjective evaluation of the subjects (taking values from −3 to 3), audio classification indicators, taking values from 0 to 1, and sound pressure levels, taking values from 0 to 1. The sound level indicators used for the analysis are shown in Table 4, which includes Overall_Leq, which takes values from 0 to 1, Lmin, and Lmax.

3. Results

3.1. Division of Zones

We refer to the ring buffer area centered on the station, including the pedestrian catchment area (PCA) derived from simulating actual walking paths, as the “station domain” [42].

The classification of the audiovisual environment of rail transit space is of great significance in transit-oriented development (TOD) because it directly affects the environmental quality, design strategy, and planning direction of transit station-centered development projects. This classification provides a scientific basis for acoustic environment optimization, visual environment design, spatial layout planning, and synergistic integration of the environment and transportation network [43]. Through this classification, urban planners and designers can better optimize the environmental quality around the station, design an audiovisual environment that meets the needs of passengers, plan a spatial layout that is compatible with the TOD concept, and create a transportation spatial network that is in harmony with the overall urban environment [44]. The use of clustering methods in mathematics can enhance the effectiveness of rail transit classification by extracting meaningful patterns and objectively classifying transit systems.

Relationship between distance and auditory perception and visual perception are shown in Figure 5a and Figure 5b, respectively. Subjects’ audiovisual perceptual dominance data about railroad stations in each audiovisual environment were imported into Past4 [45] for principal component analysis and cluster analysis. Through Paired Group Algorithm, a cluster analysis based on Euclidean distance, it was found that the subjects hardly perceived the sound of passing railroads beyond 200 m from the railroad station, but still perceived the railroad station visually from 300 m.

Through the cluster analysis, the points around the rail station can be roughly divided into three zones (Figure 5c,d). The first zone is within 50 m from the rail station (hereinafter referred to as Zone 1), the second zone is from 50 m to 150 m from the rail station (hereinafter referred to as Zone 2), and the third zone is from 150 m to 300 m from the rail station (hereinafter referred to as Zone 3), see Figure 5e.

3.2. Visual and Auditory Perception

The 30 m × 30 m grid measurements revealed distinct spatial patterns in equivalent sound levels (LAeq), with daytime values ranging from 65.3 dB(A) in Zone 1 (0–50 m) to 58.7 dB(A) in Zone 3 (150–300 m), while nighttime levels dropped by 4.2–6.8 dB(A) across zones. Visual metrics derived from panoramic images showed Zone 1 had the highest rail structure visibility (78% of images contained dominant rail elements), decreasing to 12% in Zone 3. Green space coverage inversely correlated with distance, increasing from 15% (Zone 1) to 42% (Zone 3).

Through cluster analysis of pre-experimental data from 32 sampling points, we confirmed that the visual environment of urban rail transit spaces can be effectively classified into three distinct categories: (1) natural environments with positive evaluation (VT1), (2) artificial environments with positive evaluation (VT2), and (3) artificial environments with negative evaluation (VT3). The clustering was performed using standardized comfort ratings (5-point scale) and green coverage ratios (50% threshold) as primary variables, yielding high silhouette coefficients (0.68 ± 0.07) indicating strong category separation. While these two parameters proved sufficient for robust classification in our rail transit context, the pre-experiment also revealed meaningful correlations between the established categories and unmeasured morphological factors—VT2 sites consistently exhibited moderate building density (observational FAR estimate: 1.8–2.2) and organized spatial patterns, whereas VT3 areas showed more chaotic urban forms. This classification framework provides a reliable foundation for perceptual environment assessment while acknowledging opportunities for future enhancement through additional built environment metrics.

The sound environment classification captures key spectral characteristics by comparing daytime (high traffic) and nighttime (low traffic) conditions with/without passing trains. Daytime recordings show dominant mid–high frequencies (2 k–8 k Hz) from combined road and rail noise, while nighttime features stronger low frequencies (63–250 Hz) from structural vibrations and distant traffic. Trains passing by introduce distinct high-frequency components (>800 Hz), creating different spectral patterns: broadband masking during daytime versus clear low-high frequency peaks at night. This operational approach effectively represents perceptually relevant spectral differences, though future work could add detailed octave-band analysis [19,20].

Analysis of variance is a statistical method that determines if the mean values of two or more groups differ from one another. Before performing t-test and ANOVA, the data were first tested for normality by the Shapiro–Wilk test (the closer the W statistic is to 1 and the p-value > 0.05) to test the normality of the data; second, the Levene test (the smaller the F statistic is and the p-value > 0.05) was used to test the chi-square nature of the data in each group; data independence was ensured through the design of the experiment and outliers were identified using the Z-score (|Z| < 3); and lastly, after the ANOVA was significant, the multiple comparisons were corrected using Bonferroni correction, Tukey HSD test (p-value corresponding to q-statistic < 0.05), or Holm correction to ensure the reliability of the analyzed results.

The results of the ANOVA test examining the relationships between different types of visual environments and ratings of pleasantness and arousal are presented in Table 5. The analysis revealed that in Zone 1, variations in visual environments without railroad noise showed significant associations with pleasantness ratings but not with arousal measures. When railroad noise was present, significant correlations were observed between the auditory stimulus and pleasantness ratings across all three visual environment types, while arousal measures only demonstrated significant associations in VT2 environments (those with positive visual evaluations). In Zone 2, visual environment differences were significantly correlated with both pleasantness and arousal ratings, with stronger associations observed for pleasantness. The presence of railroad noise showed significant relationships with pleasantness ratings in VT1 and VT2 (positive visual) environments, and with arousal measures in VT2 and VT3 (artificial visual) environments. For Zone 3, only the VT2 visual environment type showed a significant association with pleasantness ratings, with the strength of this relationship reaching a medium effect size according to standard benchmarks. These zonal patterns in the audiovisual relationships suggest that the interplay between visual environments and auditory stimuli follows distinct spatial gradients in their associations with affective responses.

Figure 6 displays the variations in pleasantness and arousal ratings across the three visual environment types during evening conditions, comparing scenarios with and without railroad traffic. Independent samples t-tests revealed several significant associations between railroad sound presence and affective responses in different zones (Table 6). In Zone 1, statistically significant relationships were observed between the auditory stimulus and both pleasantness and arousal measures when paired with VT1 or VT2 visual environments. Zone 2 demonstrated significant associations between railroad sound and affective responses in specific visual contexts: both pleasantness and arousal showed significant correlations in VT2 environments, while only pleasantness was significantly associated in VT1 settings and only arousal in VT3 conditions. For Zone 3, the analysis found no statistically significant associations between the presence of railroad sound and either pleasantness or arousal ratings across all visual environment types. These zonal patterns in the audiovisual relationships suggest that the perceptual integration of visual and auditory stimuli follows distinct spatial gradients in their association with affective responses.

Figure 7 displays the variations in pleasantness and arousal ratings across the three visual environment types during daytime conditions with vehicular traffic noise, comparing scenarios with and without railroad traffic. Independent samples t-tests revealed several significant associations between railroad sound presence and affective responses in different zones (Table 7). The analysis revealed that the magnitude of associations between railroad sounds and affective responses was generally weaker during daytime compared to nighttime conditions. In Zone 1, statistically significant associations were observed between railroad sounds and pleasantness ratings in both VT1 and VT3 visual environments, though these relationships were less pronounced than during nighttime conditions. Additionally, a significant association was found between the auditory stimulus and arousal measures specifically in VT2 environments. For Zone 2, railroad noise showed a significant association with pleasantness ratings exclusively in VT3 visual environments. In Zone 3, the only significant relationship emerged between railroad sounds and pleasantness ratings in VT2 settings. These temporal and zonal patterns in the audiovisual relationships suggest that the presence of background vehicular traffic noise may modulate how rail transit sounds associate with affective responses across different visual contexts.

3.3. Interactive Effects

Audiovisual indicator interactions in the audiovisual environment were analyzed by Analysis of Variance (ANOVA) [46], with effect sizes as quantitative indicators of the magnitude of the effect of audiovisual environment categories on each indicator. The measurements of effect size for one-way ANOVA: when the effect size indicator η² is 0.01 or above means a small effect, when η² is 0.06 or above means a medium effect, and when η² is 0.14 or above means a large effect [47].

The magnitude of effect sizes demonstrating associations between categories of the visual environment and auditory indicators is shown in Figure 8. Overall, the strongest correlations between visual categories and subjective auditory perceptions were observed within Zones 1 and 2. In Zone 2, variations in visual environment types showed significant associations with changes in the percentage of bird song, rail noise, and honking in the environment, as well as with sound pressure levels, while these acoustic indicators showed no significant correlation with visual categories in Zone 3. In Zone 3, the visual environment types exhibited weaker associations with auditory indicators compared to Zones 1 and 2, but showed relatively stronger correlations with ratings of auditory comfort and sense of calm. These zonal patterns in audiovisual relationships suggest that designers should consider differentiated strategies, with greater emphasis on coordinated audiovisual optimization in proximal areas, while focusing more on visual design enhancements to support psychological comfort in distal zones.

The magnitude of the correlation between visual environment types and auditory indicators is shown in Figure 9. The strength of association between auditory environment types and visual indicators was found to be weaker than that between visual environments and auditory indicators. In Zone 1, significant correlations were observed between different auditory environment types and the visual proportions of roads, railroads, buildings, and rail stations within the spatial environment. Zone 2 demonstrated statistically significant relationships between variations in auditory types and participants’ ratings on scales measuring beautiful, depressing, and open characteristics. For Zone 3, the auditory environment types showed measurable associations with three specific visual perception measures: sense of security, perception of natural elements, and green visual coverage. These zonal patterns in audiovisual relationships suggest that visual characteristics may play a more substantial role in shaping auditory perceptions than vice versa across the studied rail transit environments.

4. Discussion

4.1. Comparative Analysis and Methodological Advancements

The advantages of this approach are threefold. First, it resolves the limitations of rigid distance-based zoning by identifying perceptually sensitive zones (0–50 m, 50–150 m, 150–300 m) grounded in empirical data. Second, it surpasses purely physical noise reduction strategies [19,20] by quantifying how visual design (e.g., VT2’s positive artificial environments) mitigates auditory discomfort—a linkage underexplored in prior work [25]. Third, the use of VR-controlled experiments and cluster analysis overcomes the shortcomings of theoretical models [12] by directly correlating environmental features with human perception. Our methodology integrates environmental psychology principles with behavioral surveys, addressing a critical gap in existing research that either isolates sensory factors [23,24] or prioritizes infrastructural efficiency over experiential quality [48].

By identifying perceptually sensitive zones (0–50 m, 50–150 m, 150–300 m) based on empirical data, this approach overcomes the limitations of rigid distance-based zoning. It also goes beyond purely physical noise reduction strategies [19,20] by quantifying how visual design (e.g., VT2’s positive artificial environments) mitigates auditory discomfort—a linkage that has been underexplored in previous work [26]. Finally, the use of VR-controlled experiments and cluster analysis overcomes the shortcomings of theoretical models [12,49] by directly correlating environmental features with human perception.

These findings call for a paradigm change in urban transportation design. Design standards [18] should give integrated audiovisual solutions (such as beautiful sound barriers with flora) precedence over individual noise-control measures, as seen by the proven link between visual harmony and noise tolerance in Zones 1–2. Furthermore, the perceptual zoning concept gives planners evidence-based cutoff points for effective resource allocation: landscape-based improvements in transitional areas (50–150 m) and strict interventions in high-impact zones (0–50 m). Rail transit areas can be transformed from functional hubs to psychologically optimal public worlds by policymakers because to this study’s ability to bridge the gap between technical measures and human experience [14].

4.2. Design Strategies

The effect of the type of auditory environment on visual indicators is smaller than the effect of the visual environment on auditory indicators, which is somewhat different from the research on other urban public spaces [50]. As a special type of urban open space, rail transit space has the need to adopt targeted methods and strategies in design and research.

First of all, the noise generated by train operations (e.g., wheel–rail noise, whistles, etc.) is much higher in the rail space than in other urban open spaces (e.g., parks, squares). This noise is persistent, high-frequency, and high-intensity, and has a significant negative impact on the auditory comfort of pedestrians. The noise problem is especially prominent at Zone 1 (within 50 m). Second, elements such as elevated tracks, train operations, and track facilities (e.g., power lines, signaling equipment) cause significant interference with the visual environment, and this visual interference may trigger a sense of depression or insecurity for pedestrians. This is especially true within Zone 1 and Zone 2. Again, the rail transit space is highly dynamic, with train operations, pedestrian density, and environmental noise changing over time. In addition, the rail space is usually closely intertwined with other functional spaces in the city (e.g., commercial and residential areas), which increases the complexity of the environment. Dynamic design strategies are needed, such as combining real-time noise monitoring and adaptive sound barrier technologies, as well as considering the needs of different time periods (e.g., the difference between morning and evening peaks and flat peaks). Finally, rail space is both a functional transportation space and an urban public space. It needs to meet the dual needs of transportation efficiency (e.g., train operation, passenger gathering and dispersal) and public space comfort (e.g., pedestrian walking, leisure).

Zone division provides a clear reference basis for the design of rail transit space. For urban planners, the zone division framework offers a systematic approach to optimize rail transit space design. The distinct characteristics of each zone—from the high-impact Zone 1 (0–50 m) requiring intensive noise and visual mitigation, to the transitional Zone 2 (50–150 m) needing visual buffering, and the peripheral Zone 3 (150–300 m) focusing on aesthetic integration—provide clear spatial parameters for land use planning. Planners can utilize this zoning system to strategically allocate functions around stations, ensuring that high-density developments align with the acoustic and visual requirements of each zone while maintaining pedestrian comfort and urban design coherence.

Policymakers can leverage these zoning insights to develop more nuanced urban design guidelines and regulations. The framework suggests implementing tiered environmental standards, with Zone 1 requiring stringent noise control measures like mandatory sound barriers and restricted sensitive land uses, Zone 2 benefiting from landscape-based solutions such as minimum greening requirements, and Zone 3 focusing on visual harmony provisions. This graduated approach enables more targeted policy interventions that balance infrastructure needs with quality-of-life considerations, while providing measurable benchmarks for evaluating station area developments.

For engineers and designers, the zone-specific findings translate into practical technical solutions. In Zone 1, this means developing integrated systems combining advanced noise insulation materials (such as composite acoustic panels) with visually appealing treatments like artistic baffles or living walls. Zone 2 solutions might involve engineered landscape elements—water features with sound-masking properties, strategically planted vegetation belts for visual screening, and wayfinding elements that subtly redirect attention. Zone 3 implementations could focus on seamless transitions between station areas and surrounding neighborhoods using natural topography and view corridors. This zoned approach ensures technical solutions are appropriately scaled to their perceptual impact areas, optimizing both resource allocation and user experience.

4.3. Limitations

The limitations of this study are mainly in the following aspects: first, there may be differences in visual details, sound realism, and dynamic changes of the environment between the VR environment simulation and the real environment, and additional experiments can be conducted in the real environment to verify the accuracy of the results in further studies. Second, the study did not assess the effects of long-term exposure to the spatial environment of rail transit, and conducting long-term follow-up studies could help to reveal the lasting effects of the environment on pedestrians’ psychological perception. In addition, the study focused on audiovisual perception and did not cover the effects of other senses, such as touch and smell, on psychological perception, and the introduction of multisensory interaction studies could provide a more comprehensive understanding of the integrated effects of the environment. Meanwhile, the study did not fully consider dynamic environmental factors (e.g., weather, time change, crowd density, etc.), and the inclusion of these factors could enhance the realistic applicability of the study. Finally, the study may be limited by specific cultural or social contexts (e.g., the urban environment in China), and the generalizability of its conclusions has yet to be verified. Comparative studies in different cultural and social contexts are needed to further verify the broad applicability of the conclusions. By addressing these limitations, the scientific and practical value of the study can be more comprehensively enhanced.

5. Conclusions

The audiovisual environment of rail transit station space affects the psychological feeling of pedestrians, but the current research focuses on the operational efficiency and physical noise reduction, and the psychological feeling is not explored enough. Visual interference, application of sound environment control technology, and lagging engineering design standards are the main problems. It is necessary to combine multidisciplinary approaches to optimize the design of audiovisual environments, enhance pedestrian friendliness and public space quality, and promote the construction of healthy cities. Therefore, this study aims to explore the role of visual and auditory environments of rail transit spaces on the psychological feelings of pedestrians from the psychological and behavioral perspectives.

Firstly, through cluster analysis, the environment is divided into zones and categories according to the visual and auditory perception and evaluation of rail transit stations, and the interactive effects of audiovisual environmental factors on psychological perception within different zones are explored. The results show that according to the audiovisual perception, the space within 300 m from the rail transit station can be divided into three zones, as well as four space types with different audiovisual perceptions. The results can be summarized from the ANOVA test: the effect of visual environment on pleasantness and arousal varies with the zone distance. In Zones 1 and 2, the visual environment had a significant effect on pleasantness, and railroad noise significantly reduced pleasantness; in Zone 2, the visual environment also significantly affected arousal. In Zone 3, only the positive visual environment had a moderate effect on pleasantness. Overall, the audiovisual interaction was significant at close range (Zones 1 and 2), and the visual environment at far range (Zone 3) indirectly enhanced pleasantness mainly through psychological feelings. The effect of the type of auditory environment on visual indicators was smaller than the effect of the visual environment on auditory indicators, and the category of vision had the greatest effect on subjective indicators of hearing within Zones 1 and 2.

This study reveals the critical role of audiovisual environments in shaping pedestrian psychological perception around rail transit stations. The zone-based framework provides actionable guidance for stakeholders: urban planners can strategically allocate land uses according to each zone’s acoustic and visual requirements (0–50 m for intensive mitigation, 50–150 m for transitional buffering, 150–300 m for aesthetic integration); policymakers can establish tiered regulations with strict noise controls in Zone 1 and landscape-based solutions in Zone 2; while engineers can implement zone-specific technical solutions ranging from advanced sound barriers in Zone 1 to natural topography treatments in Zone 3, ensuring optimal resource allocation and enhanced user experience throughout the station area.

The findings offer valuable applications across multiple domains. For public health, the demonstrated link between optimized audiovisual environments and reduced stress responses suggests station-area design could be incorporated into urban mental health initiatives, particularly near sensitive facilities like hospitals. In urban noise management, zone-specific impact data enable more precise interventions, from acoustic engineering solutions in high-noise zones to psychoacoustic approaches in transitional areas. Regarding pedestrian comfort, the visual dominance thresholds identified can guide wayfinding systems and amenity placement to minimize cognitive load while maximizing restorative qualities of station environments. These applications collectively contribute to creating transit spaces that support both functional mobility and psychological wellbeing.

Future research directions will focus on three key expansions of this work. First, the zoning framework will be validated and adapted across diverse rail transit environments, including underground stations and at-grade crossings, to test its universal applicability. Second, the Deep Learning models will be refined to incorporate dynamic predictors like real-time crowd flows and weather conditions, enhancing their precision for practical implementation. Third, building on the current audiovisual foundation, integrated multisensory assessment tools will be developed to account for tactile and olfactory dimensions of transit environments. These extensions, coupled with cross-cultural validation studies, will strengthen the model’s robustness while addressing the current limitations identified in VR simulation fidelity, long-term exposure effects, and contextual factors.

Author Contributions

Conceptualization, Q.M. and M.Z.; methodology, Q.M. and X.H.; validation, X.H. and X.Z.; formal analysis, X.H.; investigation, X.H., X.Z. and F.H.; resources, H.X. and F.H.; data curation, H.X.; writing—original draft preparation, X.H.; writing—review and editing, X.Z. and Q.M.; visualization, X.H.; supervision, Q.M. and M.Z.; project administration, M.Z.; funding acquisition, Q.M. and M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Research Project of Suzhou Rail Transit Technology Innovation Research Institute Co., Ltd., “Joint Research on Vibration and Noise Reduction and Soundscape Integration Technology for Municipal (Suburban) Railway Transportation” [KCY-02-FW-14-0003].

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Author Mingli Zhang and Xinyi Zou are employed by Suzhou Rail Transit Technology Innovation Research Institute Co., Ltd. Author Haisheng Xie and Feng Han are employed by Suzhou Acoustic Technology Institute Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TOD	Transit-oriented development
VT	Visual environment type
SE	Sound environment
FAR	Floor Area Ratio

References

Liu, Q. Walking scale of tod area along rail transit line. City Plan. Rev. 2019, 43, 88–95. (In Chinese) [Google Scholar]
Calthorpe, P. The Next American Metropolis: Ecology, Community, and the American Dream; Princeton Architectural Press: New York, NY, USA, 1993. [Google Scholar]
Nasri, A.; Zhang, L. The analysis of transit-oriented development (TOD) in Washington, DC and Baltimore metropolitan areas. Transp. Policy 2014, 32, 172–179. [Google Scholar] [CrossRef]
Huang, Y.; Zhang, Z.; Xu, Q.; Dai, S.; Chen, Y. Causality between Multi-Scale Built Environment and Rail Transit Ridership in Beijing and Tokyo. Transp. Res. Part D Transp. Environ. 2024, 130, 104150. [Google Scholar] [CrossRef]
Song, L.; Zhang, J.; Liu, Q.; Zhang, L.; Wu, X. Characteristics of Noise Caused by Trains Passing on Urban Rail Transit Viaducts. Sustainability 2024, 17, 94. [Google Scholar] [CrossRef]
Chakour, V.; Eluru, N. Examining the influence of stop level infrastructure and built environment on bus ridership in Montreal. J. Transp. Geogr. 2016, 51, 205–217. [Google Scholar] [CrossRef]
Yu, B.Y.; Wen, L.J.; Bai, J.; Chai, Y.Y. Effect of Road and Railway Sound on Psychological and Physiological Responses in an Office Environment. Buildings 2022, 12, 6. [Google Scholar] [CrossRef]
Deng, X.; Chai, T.; He, Q.; Fei, N. Exploration of the current status of land utilization circle layer attenuation and spatial differentiation in urban rail transit stations under the TOD model-Taking Chengdu as an example. Mod. Urban Rail Transit 2025, 2, 18–25. (In Chinese) [Google Scholar] [CrossRef]
Cervero, R.; Kockelman, K. Travel demand and the 3Ds:density, diversity, and design. Transp. Res. Part D Transp. Environ. 1997, 2, 199–219. [Google Scholar] [CrossRef]
O’Sullivan, S. Walking Distances to and from Light Rail Transit Stations in Calgary. Master’s Thesis, University of Calgary, Calgary, AB, Canada, 1996. [Google Scholar]
Jaafar Sidek, M.F.; Bakri, F.A.; Kadar Hamsa, A.A.; Aziemah Nik Othman, N.N.; Noor, N.M.; Ibrahim, M. Socio-economic and Travel Characteristics of transit users at Transit-oriented Development (TOD) Stations. Transp. Res. Procedia 2020, 48, 1931–1955. [Google Scholar] [CrossRef]
Yang, Q.; Zhang, Z.; Cai, J.; Ding, M.; Li, L.; Zhang, S.; Song, Z.; Chen, F.; Ling, Y. Quality of Pedestrian Networks Around Metro Stations: An Assessment Based on Approach Routes. Systems 2025, 13, 63. [Google Scholar] [CrossRef]
Olaru, D.; Curtis, C. Designing TOD precincts: Accessibility and travel patterns. Eur. J. Transp. Infrastruct. Res. 2015, 15, 6–26. [Google Scholar] [CrossRef]
Jeffrey, D.; Boulangé, C.; Giles-Corti, B.; Washington, S.; Gunn, L. Using walkability measures to identify train stations with the potential to become transit oriented developments located in walkable neighbourhoods. J. Transp. Geogr. 2019, 76, 221–231. [Google Scholar] [CrossRef]
Yao, C.; Li, G.; Yan, S. Design Strategies to Improve Metro Transit Station Walking Environments: Five Stations in Chongqing, China. Buildings 2024, 14, 1025. [Google Scholar] [CrossRef]
Wang, Q.; Wang, H.; Yang, C.; Zhang, G. Developing Multivariate Models for Predicting the Levels of Multi-Dimensional Critical Perceptions Due to Metro Noise inside Buildings. Appl. Acoust. 2022, 200, 109083. [Google Scholar] [CrossRef]
Hou, B.; Li, J.; Gao, L.; Wang, D. Multi-Source Coupling Based Analysis of the Acoustic Radiation Characteristics of the Wheel–Rail Region of High-Speed Railways. Entropy 2021, 23, 1328. [Google Scholar] [CrossRef]
Southworth, M. The sonic environment of cities. Environ. Behav. 1969, 1, 49–70. [Google Scholar] [CrossRef]
Nasar, J.L. Perception, cognition, and evaluation of urban places. Public Places Spaces 1989, 31–56. [Google Scholar] [CrossRef]
Rock, I.; Harris, C.S. Vision and touch. Sci. Am. 1967, 216, 96–104. [Google Scholar] [CrossRef]
Heng, L.; Siu-Kit, L. A review of audio-visual interaction on soundscape assessment in urban built environments. Appl. Acoust. 2020, 166, 107372. [Google Scholar]
Hu, X.; Meng, Q.; Yang, D.; Li, M. Facial expression recognition, a predictive tool for perceiving urban open space environments under audio-visual interaction. Energy Build. 2024, 318, 114456. [Google Scholar] [CrossRef]
Margaritis, E.; Kang, J. Relationship between Green Space-Related Morphology and Noise Pollution. Ecol. Indic. 2017, 72, 921–933. [Google Scholar] [CrossRef]
Merchan, C.I.; Diaz-Balteiro, L. Noise pollution mapping approach and accuracy on landscape scales. Sci. Total Environ. 2013, 449, 115–125. [Google Scholar] [CrossRef]
Verma, D.; Jana, A.; Ramamritham, K. Predicting Human Perception of the Urban Environment in a Spatiotemporal Urban Setting Using Locally Acquired Street View Images and Audio Clips. Build. Environ. 2020, 186, 107340. [Google Scholar] [CrossRef]
Hossain, S.M.M.; Ferdausi, J.; Waz, S.N. Mechanical Properties for Fabric-Woven and Knitted Supported by Composite Material. Int. Res. J. Multidiscip. Sci. Technol. (IRJMRS) 2016, 1, 78–83. [Google Scholar]
ISO 10847:1997; Acoustics—In-Situ Determination of Insertion Loss of Outdoor Noise Barriers of All Types. International Organization for Standardization: Geneva, Switzerland, 1997.
Blanc, N.; Steux, B. LaRASideCam: A Fast and Robust Vision-based Blind Spot Detection System. In Proceedings of the 2007 IEEE Intelligent Vehicles Symposium, Istanbul, Turkey, 13–15 June 2007; pp. 480–485. [Google Scholar]
ISO 12913-3; Acoustics Soundscape Part 3: Data Analysis. International Organization for Standardization: Geneva, Switzerland, 2019.
Jeon, J.Y.; Jo, H.I. Effects of audio-visual interactions on soundscape and landscape perception and their influence on satisfaction with the urban environment. Build. Environ. 2020, 169, 106544. [Google Scholar] [CrossRef]
Garzón, L.; Bravo-Moncayo, L.; Arellana, J.; Ortúzar, J.D. On the relationships between auditory and visual factors in a residential environment context: A SEM approach. Front. Psychol. 2023, 14, 1080149. [Google Scholar] [CrossRef] [PubMed]
Toet, A.; van Erp, J.B. The EmojiGrid as a Tool to Assess Experienced and Perceived Emotions. Psych 2019, 1, 469–481. [Google Scholar] [CrossRef]
Kaneko, D.; Toet, A.; Ushiama, S.; Brouwer, A.-M.; Kallen, V.; van Erp, J.B. EmojiGrid: A 2D pictorial scale for cross-cultural emotion assessment of negatively and positively valenced food. Food Res. Int. 2019, 115, 541–551. [Google Scholar] [CrossRef]
Kern, A.C.; Ellermeier, W. Audio in VR: Effects of a Soundscape and Movement-Triggered Step Sounds on Presence. Front. Robot. AI 2020, 7, 20. [Google Scholar] [CrossRef]
Kaye, L.K.; Malone, S.A.; Wall, H.J. Emojis: Insights, affordances, and possibilities for psychological science. Trends Cogn Sci. 2017, 21, 66–68. [Google Scholar] [CrossRef]
Cohen, J. Statistical Power Analysis for the Behavioral Sciences; Routledge: London, UK, 1988. [Google Scholar]
Liu, J.; Kang, J.; Luo, T.; Behm, H. Landscape Effects on Soundscape Experience in City Parks. Sci. Total Environ. 2013, 454–455, 474–481. [Google Scholar] [CrossRef]
Yu, L.; Kang, J. Effects of social, demographic and behavioural factors on sound levele valuation in urban open spaces. J. Acoust. Soc. Am. 2008, 123, 772–783. [Google Scholar] [CrossRef] [PubMed]
Snellen, H. Probebuchstaben zur Bestimmung der Sehschärfe; Van de Weijer: Utrecht, The Netherlands, 1862. [Google Scholar]
ISO 7029:2017; Acoustics—Statistical Distribution of Hearing Thresholds Related to Age and Gender. 3rd ed. International Organization for Standardization: Geneva, Switzerland, 2017.
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2016; pp. 6230–6239. [Google Scholar] [CrossRef]
Andersen, J.L.E.; Landex, A. Catchment areas for public transport. WIT Trans. Built Environ. 2008, 101, 175–184. [Google Scholar]
Pishro, A.A.; L’hostis, A.; Chen, D.; Pishro, M.A.; Zhang, Z.; Li, J.; Zhao, Y.; Zhang, L. The Integrated ANN-NPRT-HUB Algorithm for Rail-Transit Networks of Smart Cities: A TOD Case Study in Chengdu. Buildings 2023, 13, 1944. [Google Scholar] [CrossRef]
Shaofei Niu, A.; Shen, H.Z.; Huang, Y.; Mou, Y. Measuring the built environment of green transit-oriented development: A factor cluster analysis of rail station areas in Singapore. Front. Archit. Res. 2021, 10, 652–668. [Google Scholar] [CrossRef]
Hammer, Ø.; Harper, D.A. PAST: Paleontological Statistics Software Package for Education and Data Analysis. Palaeontol. Electron. 2001, 4, 1–9. [Google Scholar]
Stahle, L.; Wold, S. Analysis of variance (ANOVA). Chemom. Intell. Lab. Syst. 1989, 6, 259–272. [Google Scholar] [CrossRef]
Olejnik, S.; Algina, J. Generalized Eta and Omega Squared Statistics: Measures of Effect Size for Some Common Research Designs (PDF). Psychol. Methods 2003, 8, 434–447. [Google Scholar] [CrossRef]
Pan, H.; Li, J.; Shen, Q.; Shi, C. What determines rail transit passenger volume? Implications for transit oriented development planning. Transp. Res. Part D: Transp. Environ. 2017, 57, 52–63. [Google Scholar] [CrossRef]
Lamour, Q.; Morelli, A.M.; Marins, K.R.d.C. Improving walkability in a TOD context: Spatial strategies that enhance walking in the Belém neighbourhood, in São Paulo, Brazil. Case Stud. Transp. Policy 2019, 7, 280–292. [Google Scholar] [CrossRef]
Arras, F.; Massacci, G.; Pittaluga, P. Soundscape perception in Cagliari, Italy. Acta Acust. United Acust. 2003, 89, 1–6. [Google Scholar]

Figure 1. (a) Research location; (b) shooting angle; (c) research area selection; (d) main noise sources analysis map; (e) other sound sources in the vicinity; (f) material collection positions.

Figure 2. EmojiGrid.

Figure 3. (a) Lab configuration of VR simulations; (b) picture of a participant during VR simulation.

Figure 4. Indicator extraction methods.

Figure 5. (a) Relationship between distance and auditory perception, (b) Relationship between distance and visual perception, (c) Cluster analysis results of audiovisual perception, (d) Principal component analysis (PCA) results based on audiovisual perception, (e) the zoning result.

Figure 6. Average and 95% confidence intervals of pleasantness and arousal in three types of visual environments in the evening without railroad transit passing (a–c) and with railroad transit passing (d–f).

Figure 7. Average and 95% confidence intervals of pleasantness and arousal in three types of visual environments in daytime without railroad transit passing (a–c) and with railroad transit passing (d–f).

Figure 8. Effect of visual type on auditory indicators, reporting effect size.

Figure 9. Effect of auditory type on visual indicators, reporting effect size.

Table 1. Measurement and VR simulation equipment specifications.

Instrument	Model	Manufacturer	Application
Sound Level Meter	BSWA 801	BSWA Tech (Beijing, China)	On-site LAeq measurements
Acoustic Calibrator	BSWA CA111	BSWA Tech (Beijing, China)	Sound level meter calibration
Binaural Recorder	HEAD acoustics SQobold	HEAD acoustics (Stuttgart, Germany)	3D audio capture for VR stimuli
Video Camera	Insta360 ONE	Insta360 (Shenzhen, China)	Visual environment documentation
VR Headset	HTC Vive Pro Eye	HTC (Taipei, China)	Audiovisual stimulus presentation
Monitoring Headphones	AKG K712 Pro	AKG (Vienna, Austria)	Binaural audio playback in VR

Table 2. Questions and objectives.

Questionnaire Module	Specific Questions	Scale	Measurement Objective
Audiovisual Dominance [32]	1. “How clearly can you hear the train noise in this environment?” 2. “How visually dominant are the rail structures in your view?”	11-point numeric scale (0 = “Not at all” → 10 = “Extremely dominant”).	Quantify salience of rail-related auditory/visual stimuli
Acoustic Evaluation [32]	1. “Rate your comfort level with the current sound environment” 2. “How loud do you perceive the surroundings to be?”	5-point Likert (1 = “Very uncomfortable” → 5 = “Very comfortable”). 5-point loudness scale (1 = “Very quiet” → 5 = “Deafening”).	Assess subjective noise perception and tolerance
Soundscape Semantics [32]	“The soundscape feels”: • Calm ↔ Chaotic; • Pleasant ↔ Annoying; • Vibrant ↔ Monotonous.	7-point bipolar semantic differential (−3 to +3, with 0 = neutral).	Capture emotional and qualitative soundscape attributes
Visual Quality [33]	1. “This space appears well-maintained.” 2. “The architectural design feels harmonious.”	5-point Likert (1 = “Strongly disagree” → 5 = “Strongly agree”).	Evaluate aesthetic and functional visual perception
Spatial Perception [33]	“The area looks”: • Open ↔ Confined; • Interesting ↔ Boring.	7-point bipolar semantic differential (−3 to +3, with 0 = neutral).	Measure spatial comfort and engagement
Integrated Evaluation [34]	EmojiGrid (5 × 5) with facial expressions 😊 → 😞.	Nonverbal selection.	Assess instant emotional response to combined audiovisual stimuli

Table 3. List of visual indicators.

Indicator Type	Indicators (Visual)
Evaluation	Comfort evaluation, Safety, Beautiful, Wealthy, Depressing, Boring, Natural, Openness
Semantic segmentation	Green_ss, Water_ss, Sky_ss, Road_ss, Building_ss, Green_pl_ss, Building_pl_ss
Object detection	People_od, Car_od, Rail_od, Station_od

Table 4. List of auditory indicators.

Indicator Type	Indicators (Auditory)
Evaluation	Comfort Evaluation, Loudness, Eventfulness, Vibrancy, Pleasantness, Calmness
Audio Classification	Bird_ac, People_ac, Horn_ac, Vehicles_ac, Transmit_ac
Sound level	Overall_Leq, Lmin, Lmax

Table 5. The differences in the effects of different types of visual environments on pleasantness and arousal (ANOVA).

		Zone 1		Zone 2		Zone 3
		Without Transit	With Transit	Without Transit	With Transit	Without Transit	With Transit
Nighttime	Pleasantness	0.097 **	0.021	0.131 **	0.054 *	0.065 **	0.053 *
Nighttime	Arousal	0.003	0.032 *	0.033 *	0.025	0.021	0.017
Daytime	Pleasantness	0.025 *	0.043 *	0.013	0.003	0.017	0.052 *
Daytime	Arousal	0.021 *	0.018	0.005	0.008	0.003	0.032

* represents p < 0.05, ** represents p < 0.01.

Table 6. Effect of the presence of rail sound on pleasantness and arousal in nighttime (t-test).

		Zone 1		Zone 2		Zone 3
		Mean Difference	Cohen’s d	Mean Difference	Cohen’s d	Mean Difference	Cohen’s d
VT1	Pleasantness	0.38 *	0.35	0.44 *	0.67	0.29	0.22
VT1	Arousal	−0.64 **	0.74	0.13	0.34	0.01	0.14
VT2	Pleasantness	0.42 *	0.56	0.82 **	0.91	0.32 *	0.31
VT2	Arousal	−0.26	0.14	0.25 *	0.64	0.06	0.09
VT3	Pleasantness	0.65 **	0.72	0.24	0.29	0.26	0.43
VT3	Arousal	0.09	0.13	0.34 *	0.54	0.09	0.19

* represents p < 0.05, ** represents p < 0.01.

Table 7. Effect of the presence of rail sound on pleasantness and arousal in daytime (t-test).

		Zone 1		Zone 2		Zone 3
		Mean Difference	Cohen’s d	Mean Difference	Cohen’s d	Mean Difference	Cohen’s d
VT1	Pleasantness	0.23 *	0.16	0.03	0.08	0.27	0.32
VT1	Arousal	−0.03	0.12	0.16	0.23	0.02	0.17
VT2	Pleasantness	−0.13	0.32	−0.34	0.43	−0.67 *	0.75
VT2	Arousal	0.48 *	0.48	−0.03	0.09	−0.19	0.21
VT3	Pleasantness	0.29 *	0.29	0.65 **	0.76	0.62	0.37
VT3	Arousal	0.22	0.21	0.04	0.13	0.17	0.31

* represents p < 0.05, ** represents p < 0.01.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, M.; Zou, X.; Hu, X.; Xie, H.; Han, F.; Meng, Q. The Effect of Audiovisual Environment in Rail Transit Spaces on Pedestrian Psychological Perception. Buildings 2025, 15, 1400. https://doi.org/10.3390/buildings15091400

AMA Style

Zhang M, Zou X, Hu X, Xie H, Han F, Meng Q. The Effect of Audiovisual Environment in Rail Transit Spaces on Pedestrian Psychological Perception. Buildings. 2025; 15(9):1400. https://doi.org/10.3390/buildings15091400

Chicago/Turabian Style

Zhang, Mingli, Xinyi Zou, Xuejun Hu, Haisheng Xie, Feng Han, and Qi Meng. 2025. "The Effect of Audiovisual Environment in Rail Transit Spaces on Pedestrian Psychological Perception" Buildings 15, no. 9: 1400. https://doi.org/10.3390/buildings15091400

APA Style

Zhang, M., Zou, X., Hu, X., Xie, H., Han, F., & Meng, Q. (2025). The Effect of Audiovisual Environment in Rail Transit Spaces on Pedestrian Psychological Perception. Buildings, 15(9), 1400. https://doi.org/10.3390/buildings15091400

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Effect of Audiovisual Environment in Rail Transit Spaces on Pedestrian Psychological Perception

Abstract

1. Introduction

2. Materials and Methods

2.1. Survey Site

2.2. Panoramic Video

2.3. Acoustic Measurement

2.4. Questionnaires

2.5. VR Experiment

2.6. Extraction of Audiovisual Environment Indicators

3. Results

3.1. Division of Zones

3.2. Visual and Auditory Perception

3.3. Interactive Effects

4. Discussion

4.1. Comparative Analysis and Methodological Advancements

4.2. Design Strategies

4.3. Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI