1. Introduction
Research on assessing the visual quality of landscapes has gradually emerged since the 1960s, with early studies focusing on subjective evaluations of landscape aesthetics, concentrating on experts’ aesthetic criteria and public preferences [
1]. For example, Crowe and Litton responded to the aesthetic impacts of rural landscapes and woodland landscapes, respectively [
2,
3]. The expert assessment model was soon adopted by government agencies in several countries, including U.S. federal agencies [
4]. Daniel and Boster’s (1976) proposed Scenic Beauty Estimation (SBE) made a major breakthrough in quantifying public preferences [
5]. Since then, new perspectives and dimensions are being gradually incorporated into the assessment of landscape visual quality, allowing for a more comprehensive understanding of landscape aesthetics. Examples include ecological quality and biodiversity [
6,
7,
8], multisensory landscape experience [
9,
10,
11], attractiveness and security [
12], accessibility [
13,
14,
15], and more.
Despite the achievements of SBE methods and other traditional techniques in landscape aesthetic assessment, they have limitations in capturing landscape diversity and dynamic character. Traditional methods tend to ignore complexity in changing urban environments and natural landscapes [
16]. In addition, expert-led assessment methods face challenges of subjectivity and repeatability [
17,
18].
In recent years, there has been a significant shift in the field of visual quality assessment. The use of big data sources such as street view images instead of traditional film photographs is becoming mainstream, an approach that captures a broader and more realistic view of the landscape and provides more comprehensive data support [
19,
20,
21,
22,
23,
24]. In addition, the application of convolutional neural networks (CNNs) has revolutionized the efficiency and accuracy of image aesthetic quality assessment. CNNs are able to not only process large amounts of image data but also learn complex visual features to more accurately predict the aesthetic quality of an image. The application of these new techniques improves efficiency while providing new possibilities for understanding the diversity and subjectivity of landscape aesthetics.
The application of neural networks in the visual arts is evolving, with photography and painting being key areas of research [
25]. For example, DiffRankBoost, based on RankBoost and support vector techniques, was proposed in 2010 to explore methods for estimating aesthetic scores at a fine-grain level [
26]. Subsequently, Deep Multi-Patch Aggregation Network methods were used to solve the problem of image style, aesthetics, and quality assessment in high resolution images [
27]. In the same year, new frameworks also emerged, such as methods for learning aesthetic features using convolutional networks and perceptual calibration systems for the automatic aesthetic assessment of photographic images [
28,
29]. Meanwhile, image segmentation models play an important role in landscape quantification studies, and convolutional neural network-based segmentation models can effectively identify and separate key visual elements in an image, providing more in-depth visual information for aesthetic assessment [
30].
While these early approaches have made progress in image aesthetic quality assessment, they usually fail to fully understand and model the complexity of human perception and response to aesthetics. They struggle to accurately assess aesthetic quality in the absence of referents and have limitations in dealing with highly subjective and complex aesthetic assessments [
31].
In recent years, the development of deep learning techniques, especially the application of CNNs, has opened up new possibilities for image aesthetic quality assessment. For example, the NIMA (Neural Image Assessment) model uses CNNs to predict the distribution of human perception rating numbers [
32]. The PSAA algorithm for aesthetic attributes based on unsupervised learning [
33], the MRACNN method [
34] that utilizes multimodal information for aesthetic prediction, and natural language processing (NLP) techniques can provide additional contextual information to aid computers in understanding aesthetic value [
35]. Deep neural networks are trained with real-life photos and corresponding scoring data to recognize visual features of urban street vitality for a better understanding of the development and evolution of cities [
36]. Simultaneous human–machine–environment quantitative analysis techniques have been demonstrated to be useful for improving the reliability and accuracy of visual quality assessment results in linear landscapes [
37]. All these methods and techniques provide new research perspectives for assessing the aesthetic quality of images.
In this study, we utilize CNNs and big data technology based on deep learning methods and related technical tools developed from 2015 to the present to improve the efficiency and accuracy of assessing the visual quality of landscapes and the aesthetic quality of images in landscape assessment, such as the identification of urban waterfront streetscape images.
In the subsequent sections, we discuss the following aspects in detail: (1) the selection of the study area and the basis for its delineation; (2) the process of fine-tuning the CNN model based on the SBE samples and the performance evaluation; (3) the prediction of the landscape quality of all the streetscapes in the study area using the fine-tuned model; and (4) the main findings of the study and the potential for the application of deep learning techniques in the discipline of landscaping.
4. Discussion
4.1. Main Findings
We found from the analysis results that there are obvious fluctuations in the landscape beauty rating values of the riverfront ecological space, other areas of the monitoring area, and the built-up area, among which the built-up area shows significant fluctuations in the beauty rating values. This indicates that there are large differences in the visual attractiveness of natural landscape features and architectural styles in different areas, which also reflects the diversity of visual landscapes and the subjectivity of evaluation groups.
By fine-tuning the pre-trained CNN and selecting appropriate hyperparameters, we were able to enhance the model’s generalization ability on small datasets. Ultimately, Model_3 is considered the optimal model for its high agreement with human evaluation on Pearson and Spearman correlation coefficients. In addition, the utility of Model_3 is validated by the prediction of a large-scale collection of street view images and the visualization of kriging interpolation in ArcGIS Pro 3.0.2. The cross-validation results show that Model_3 has a high prediction accuracy, and its root-mean-square and average standard error are at a low level, which ensures the accuracy and reliability of the interpolation results. The above metrics show that CNNs with smaller number of parameters are able to effectively learn the features of human subjective evaluations and make predictions over a wide geographic range.
According to the prediction results of Model_3, we find that the built-up area has the lowest mean value of aesthetics, while the riverfront ecological space and other areas of the monitoring area have the highest mean value of aesthetics. It indicates that landscapes with high naturalness play an important role in enhancing the visual quality of streetscape aesthetics. Overall, our study not only demonstrates the potential application of deep learning models in streetscape aesthetics evaluation but also provides an important visual aesthetics evaluation tool for urban planners.
4.2. Importance of Fine-Tuning for Performance Improvement
In analyzing the relationship between human ratings and each model (including one pre-trained model and four screened fine-tuned models), we found the following key phenomena. First, all the fine-tuned models showed a positive correlation with human ratings, indicating that these models were able to mimic human rating patterns to some extent. However, for the pre-trained models, the trend line showed a negative slope, suggesting a negative correlation with human ratings (
Figure 7). This suggests that without fine-tuning, the model’s scoring patterns are significantly different from human scoring patterns. This finding highlights the importance of fine-tuning in machine learning, especially in scenarios where models are required to accurately understand and mimic human behavior [
75]. Performance differences between fine-tuned models further reveal the possible impact of different fine-tuning strategies, highlighting the need to optimize models for specific tasks [
76]. These observations not only demonstrate the potential of machine learning to mimic complex human tasks but also highlight the importance of the fine-tuning process in enabling highly specialized machine learning applications [
77].
4.3. Application of Deep Learning Techniques in Landscape Disciplines
In recent years, deep learning techniques have become an important means to quantify the perception of the human living environment in the landscape discipline. In the landscape discipline, two types of deep learning models, mainly image segmentation models and natural language processing, are widely used.
4.3.1. Image Segmentation Model
The role of image segmentation models is to segment an image into different regions or objects with semantic information, which facilitates target detection and recognition tasks. In the discipline of landscape, SegNet was used to quantify the greenness, openness, and enclosure of Beijing’s hutong streetscapes to analyze the connection between the physical quality of the environment and human subjective perception [
78]. The association between green visibility and residents’ mental health can be analyzed by using the green visibility indicator in separated streetscape images [
79]. Image segmentation can also be used to study walking accessibility (Walkability) and street safety [
80,
81]
Compared to the widespread use of image segmentation models, models based on image aesthetic quality assessment have not yet received a great deal of attention in the landscape discipline at present, and the difference lies in the fact that the two yield different results: image segmentation models are mainly used to quantify landscape elements in streetscape images [
82], while image aesthetic quality assessment models are mainly used to quantify the overall aesthetic quality of a streetscape image [
83]. In addition, image segmentation models often do not need to be fine-tuned and can be segmented directly using the pre-trained model, while image aesthetic quality assessment models need to be fine-tuned with sample data to match experimental and practical needs [
84]. Despite the generalizability of the image aesthetic quality assessment pretraining model, it does not reflect the subjective needs of a specific population [
85]. The image aesthetic quality assessment model, with its prominent beauty metrics, can provide an intuitive and direct reflection of the landscape aesthetics of the study population without the need to rely on data other than street view images and human ratings.
4.3.2. Natural Language Processing Models
Natural language processing is an important branch at the intersection of computer science, artificial intelligence, and linguistics, which is dedicated to the study and development of computer systems that are capable of effectively understanding, interpreting, and modeling human language [
86]. Natural language processing covers a wide range of topics including, but not limited to, speech recognition, natural language understanding, natural language generation, machine translation, and sentiment analysis. Among them, sentiment analysis is the main direction of landscape disciplines using natural language processing models, usually using social media comments as research samples, by analyzing the sentiment polarity (positive or negative labels) of the comments and identifying the positive and negative sentiment parts of the venues by using the sentiment analysis technique based on the EASYDL deep learning platform [
87] for correlation analysis with other variables [
88]. Sentiment analysis can also be used to identify urban park attributes associated with positive emotions, finding that visitors to different types of urban parks have different levels of positive emotions [
89]. Recently, scholars have begun to combine sentiment analysis techniques with street scene image segmentation techniques. Coupling remote sensing imagery, streetscape images, social media comments, and PPGIS platform data is advocated to assist informal green space identification in the context of urban renewal [
90].
Natural language processing (NLP) models and image aesthetic quality assessment models differ significantly in their core purposes, with natural language processing models focusing on interpreting and generating human language [
91], while image aesthetic quality assessment models are dedicated to quantifying the overall aesthetic quality of images. However, the two exhibit certain similarities when dealing with complex data: methodologically, natural language processing models usually require a large amount of linguistic data for training in order to understand and model different linguistic structures and meanings, similar to image aesthetic quality assessment models that require a large number of images and related aesthetic scores to train their judgment criteria [
92]. Furthermore, similar to image aesthetic quality assessment models, natural language processing models face the challenge of needing to adapt to specific application scenarios, such as parsing and generating text within a specific scenic area or in a specific cultural context [
93]. However, despite the ability of NLP models to understand and reflect the emotion or style of a text in certain situations, this ability is not always a complete substitute for subjective human judgments, similar to the limitations of image aesthetic quality assessment models in evaluating the aesthetics of a specific population [
94]. In summary, despite the different purposes for which natural language processing models and image aesthetics quality assessment models can be applied in landscape disciplines, they show some similarities in terms of dealing with highly complex datasets, the challenges they need to face, and the search for a balance between model generalizability and specific needs.
4.4. Advantages and Limitations
Compared with traditional methods, landscape assessment using convolutional neural network models has the following advantages. First, it can reduce the time and economic cost required for landscape assessment. Traditional urban landscape assessment methods require a large amount of human resources for research, while the convolutional neural network model can predict the quality of a large range of landscapes by learning a small number of samples, which effectively reduces the cost of research and improves efficiency. Second, the model is very easy to adjust. The pre-trained convolutional neural network model only needs to be fine-tuned with a small number of samples, so it is more targeted in predicting landscape quality and can meet the aesthetic needs of different landscape types and different interest groups. Finally, the current field of artificial intelligence is developing rapidly, and the field of computer vision has made rapid progress in recent years, and many excellent models have emerged, such as the image segmentation model SAM (Segment Anything Model) and the generative model DALLE-2 [
95,
96]. These models have strong generalization properties and do not require post-process fine-tuning by the user. Since image aesthetic quality prediction models share similar principles with these models, strong generalization performance can theoretically be achieved as well.
Of course, there are some limitations to the current use of convolutional neural networks for landscape assessment. The first is the limitation of training samples. Although the sample size required for fine-tuning the model is small, the status of the characteristics of the test population in the process of making the sample will directly affect whether the model prediction results are universally representative. This can only be achieved by having some knowledge of the demographic composition of the study area. In addition, the street view images themselves have more variables (e.g., seasons, weather), making it difficult to present the landscape in a completely objective and consistent manner. Second, there are limitations in the configuration conditions of the hardware used to analyze the data. Large models with high generalizability often need to be obtained by high-level research institutions after a long time of training on large-scale computing devices. Individual users or small-scale research institutions may find it difficult to fine-tune large models due to hardware constraints, limiting the popularity and scope of application of customized models. The above advantages and limitations make convolutional neural network modeling a promising emerging technique in the field of landscape assessment.
5. Conclusions
In conclusion, this study successfully demonstrated the feasibility of employing convolutional neural networks to emulate human subjective assessments and forecast landscape visual quality, affirming their efficacy in this domain with outcomes closely mirroring human judgments, thereby attesting to a high level of precision and robustness. Moreover, it validated the potential for achieving superior performance through model fine-tuning with a minimal dataset, broadening practical applications. By visually rendering these predictive analyses, our work fosters an intuitive grasp of varying regional landscape aesthetics, thereby enriching the understanding and practical toolbox of urban planners, landscape designers, and environmental assessment specialists. In doing so, it not only supplies these professions with innovative techniques for assessing and comprehending landscape beauty but also imparts significant insights and empirical data to the realms of computer vision and machine learning research. Ultimately, this research endeavor contributes significantly to elevating societal consciousness and appreciation concerning the value of environmental aesthetics. Our research findings furnish municipal administrations with theoretical underpinnings and hold considerable promise across multiple domains. Primarily, this technology facilitates the development of a multidimensional urban landscape aesthetics evaluation framework, encompassing both natural and man-made landscapes, as well as visual and cultural–historical values, enabling comprehensive and objective assessments tailored to the distinct characteristics of each city. Furthermore, it empowers urban administrators to conduct simulation studies on the impact of various policy interventions, such as increasing green spaces, enhancing street scenery, or preserving historical structures, on the city’s aesthetic appeal. This step is pivotal in demonstrating the potential benefits of specific changes to policymakers. Augmented by social media, policy formulation bodies can also engage citizens in the appraisal of landscape aesthetics, collecting public perceptions and preferences regarding different urban areas. This process enhances our understanding of the key elements vital for improving urban aesthetics, informing policy recommendations more attuned to popular sentiment. Lastly, we may establish a sustained monitoring and evaluation mechanism for urban landscapes, periodically assessing whether policy implementations meet expectations, if aesthetic enhancements are notable, and how these transformations influence residents’ quality of life, urban attractiveness, and tourism economy, among others. Such a system not only facilitates timely strategic adjustments but also furnishes invaluable data for subsequent research endeavors. In summary, we believe that this approach has a wide potential for application in real life and much room for development. This method brings a new interpretation to the traditional means of environmental perception, so we will continue to explore the better application of this technology in urban landscape planning and habitat improvement.