Analysis of Influencing Factors and Distribution Simulation of Budget Hotel Room Pricing Based on Big Data and Machine Learning from a Spatial Perspective

Hu, Tao; Song, Haoyu

doi:10.3390/su15010617

Open AccessArticle

Analysis of Influencing Factors and Distribution Simulation of Budget Hotel Room Pricing Based on Big Data and Machine Learning from a Spatial Perspective

by

Tao Hu

and

Haoyu Song

^*

School of Tourism, Hainan University, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(1), 617; https://doi.org/10.3390/su15010617

Submission received: 10 November 2022 / Revised: 24 December 2022 / Accepted: 26 December 2022 / Published: 29 December 2022

Download

Browse Figures

Versions Notes

Abstract

:

The goal of investors in the hotel business is to maximize profits, and the price is an important means of achieving this goal. This has attracted many scholars to study the spatiotemporal relationship between hotel room prices and their possible influencing factors from different perspectives. However, most existing studies adopt the linear assumption of the hedonic model, with limited features and a lack of feature selection procedures. Additionally, there are few forecasts of hotel pricing from a spatial perspective. To overcome these gaps, this study adopts linear and nonlinear machine learning methods based on the “big data” of Sanya City to explore the influencing factors of budget hotel pricing. Based on the spatial perspective, 81 potential factors were considered. They are further selected using a feature extraction model called recursive feature elimination. Six machine-learning algorithms were evaluated and compared: random forest, extreme gradient boosting, multi-linear regression, support vector regression, multilayer perceptron regression, and K-nearest neighbor regression. The optimal value was used to further calculate the feature importance. They disclosed 40 important impact characteristics and predicted the spatial distribution of hotel pricing.

Keywords:

hotel pricing; machine learning; recursive feature elimination; spatial perspective

1. Introduction

1.1. Background

As one of the three pillar industries of tourism, the hospitality industry has attracted the attention of policymakers and practitioners in both the public and private sectors for a long time. The hotel industry usually contributes the most to tourism revenue [1]. The development of the hotel industry will promote the development of the urban economy and contribute to the attraction of the city’s investment. The hotel industry has a strong connection with the development of tourism, construction, transportation, and commerce. At the same time, as a tertiary industry, the hotel industry is a labor-intensive industry, which can bring many employment opportunities for the city [2,3], relieve employment pressure, and stabilize society. Therefore, hotel pricing is important in government and tourism enterprises’ strategic planning. Hotel pricing is not only representative of the hotel’s image and quality but also is the basis of market segmentation and positioning.

With the development of computer technology and artificial intelligence, some scholars and hospitality industry practitioners have considered phasing artificial intelligence (AI) algorithms for revenue management to replace traditional management techniques to improve performance [4]. Al et al. [5] used four forecasting models: the seasonal autoregressive integrated moving average (SARIMA) model, the restricted Boltzmann machine as a deep belief network model, the polynomial smooth support vector machine (SVM) model, and finally, the adaptive network fuzzy interference system (ANFIS) model for price forecasting in the hotel industry in terms of time latitude. Song et al. [6] used a classical unsupervised machine learning technique, Latent Dirichlet Allocation, to effectively process a large number of unstructured hotel online reviews for hotel satisfaction research. Therefore, AI models, such as SVM, random forest (RF), K-nearest neighbor (KNN), and eXtreme gradient boosting (XGBoost) [7,8,9,10], have proven their value in recent years.

Currently, with the introduction of big data approaches and open data sources, interest in new tools such as machine learning algorithms has increased [11,12], and both scholars and practitioners have recommended or used machine learning models. Compared with the results obtained by statistical models, the prediction accuracy of many researchers has been significantly improved by using machine learning models. Therefore, this study adopts a variety of machine learning algorithms to explore the factors influencing hotel prices and to predict hotel pricing.

1.2. Literature Review

Research on hotel room prices is abundant in the hedonic price theory field. The hedonic price theory states that goods have a set of features. These features combine to form a set of impact utilities, the set of features that a good sells. The quality of a product or its features corresponds to a set of prices, which we call hedonic prices [13]. In recent years, hedonic price analysis has been increasingly applied to provide insights into hotel pricing. Hotels are ideal objects for hedonic price analysis. Some scholars have tried to evaluate the importance of certain services or facilities and other internal factors in hotel pricing; for example, Zhang et al.’s [14] study on star-rated hotels in Beijing shows that star rating, age, and the number of rooms are the determinants of hotel pricing. Other factors such as chains, Internet accessibility, swimming pools, breakfasts, and parking lots [15,16,17,18,19] are internal features of hotel competition. Some scholars have focused on the influence of external factors on hotels. For example, Somphong et al. [20] investigated the impact of distance from the beach on hotel pricing. Kim et al. [21] determined the spatial variation relationship between distance to airports, highways, tourist attractions, and hotel pricing. Chica-Olmo [22] measured the impact of historical sites on hotel pricing in Seville, Spain. However, according to Zhang et al. [14], the only recognized attribute of the lodging industry is location; that is, hotel pricing is influenced by location (external factors). At the same time, no consensus has been reached on other variables, and even universal results cannot be obtained owing to different research destinations.

Additionally, with the advent of the Internet era, the lodging industry has been heavily influenced by online platforms. The competition on the Internet platform is becoming increasingly fierce [23]. These platforms help consumers search for information about internal factors and hotel reviews. Online reviews and ratings are becoming increasingly important in tourism and hospitality research, while online booking is becoming increasingly popular in the hospitality industry. Some scholars have confirmed that online scores significantly affect consumers’ purchasing decisions and service providers’ online sales [24] and impact prices [25]. El-Said et al. [26] analyzed the impact of online reviews on hotel reservations, and the results showed that negative reviews strongly impacted reservation intention, which further moderated the price. Other scholars have studied the impact of online evaluations on prices directly. Palić et al. [27] believe that evaluating the relationship between online evaluation and pricing is necessary for appropriate pricing in the hotel industry. Online ratings have a significantly positive impact on hotel prices. Hotels with high ratings could charge higher prices. In addition, the impact of online evaluation on hotel room rates is heterogeneous, with low-end hotels having a greater impact than high-end hotels [25].

Although existing studies have explored the relationship between hotel pricing and influencing factors from different perspectives, they also have some limitations. First, most previous studies on hotel pricing used the hedonic model as the primary research method. Hedonic models assume a linear relationship between hotel prices and influencing factors [14,28], but there are many nonlinear characteristics in the real world [29,30]. Therefore, this method ignores the influence of nonlinear relations, and the results may not conform to reality. Second, most previous studies studied the influencing factors of hotel pricing or predicted hotel prices from the perspective of time [5,31], and few predicted hotel prices from the perspective of space. Third, most of the research on hotel pricing is on star hotels [28,32], and there is less research on budget hotels. In addition, there is a problem of multicollinearity in multivariate analysis, which affects the accuracy of the model due to the mutual influence of explanatory variables. However, most studies have not addressed this question. In most previous studies, a feature selection method that could remove irrelevant features was still lacking. In short, there is a lack of research on the nonlinear relationship and distribution simulation of budget hotel pricing and its influencing factors from a spatial perspective to solve the above problems.

This paper aims to investigate the influencing factors of budget hotel pricing and forecast hotel prices from the perspective of space. This is one of the few attempts to explore budget hotel pricing using a nonlinear machine learning model, which proposes a spatial level of hotel pricing prediction and improves the model’s effectiveness by using the RFE model to filter features. By adopting an innovative methodology and integrating multi-source data, this study enriches the research of budget hotel pricing. Specifically, this paper adopts six models, random forest (RF), XGBoost, multiple linear regression (MLR), support vector regression (SVR), multi-layer perceptron (MLP) regression, and K-Nearest Neighbor (KNN) regression, to collect and simulate the influencing factors of hotel prices in Sanya City. Owing to the excessive number of variables, there may be redundant variables, and recursive feature elimination (RFE) is used to deal with the feature extraction to get the most important factors. A geographic information system (GIS) was used to visualize and spatially analyze the factors. Finally, the XGBoost model is used to analyze the relationship between hotel pricing and influencing factors and simulate the spatial distribution of hotel prices.

2. Materials and Methods

2.1. Data Source

This study used a budget hotel dataset from Sanya, China. Sanya is located south of Hainan Island, with a land area of 1921 square kilometers and a sea area of 3226 square kilometers. It is an international tourist city with tropical coastal scenery characteristics and is known as “Oriental Hawaii.” Sanya has famous scenic spots, including Yalong Bay, Tianya Haijiao, Sanya Bay, Luhuitou, Dadonghai, Xiaodongtian, and Wuzhizhou Island. According to the Sanya Bureau of Statistics, China is Sanya’s primary tourism market. In 2021, the city received 21,620,400 overnight tourists, an increase of 19.7% from the previous year, and the total annual tourism revenue was 74,703 billion yuan, an increase of 65.2%. As shown in Figure 1, the Sanya urban area was selected as the sample area. Sanya contained some very small islands. These were excluded because the small islands were not conducive to research predictions. As a tourist city with a large number of hotels, Sanya is a good research sample [33].

We chose a budget hotel in Sanya City as the research sample, which has two advantages. On the one hand, budget hotels are non-star hotels. Previous studies on hotel pricing have considered star hotels as the research object [28,32]. This study enriches the research on budget hotels in terms of hotel pricing. On the other hand, taking budget hotels as the research object can eliminate the influence of internal factors on hotel prices. This enables us to focus on the impact of external factors and online evaluations on hotel pricing. The core idea of budget hotels is no-frill cost, no-frill service, and products at no-frill prices [34]. Good price and location are the most important factors affecting customer choices [35]. Hung et al. [1] use quantile regression to study the main determinants of hotel pricing strategies, and the results show that the number of rooms and room attendants per room had no significant effect on hotel prices at low price quantiles. People’s expectations for budget hotels are mainly hot water, showers, standard sanitary environments, and wireless networks [36]. Sanya’s budget hotels met these requirements.

The data in Table 1 were obtained in April 2022. This study collected experimental data from several sources, including hotel data, point of interest (POI), nighttime light, housing transaction data, and road and water data.

2.2. Data Collection and Processing

2.2.1. Budget Hotel

In this study, a budget hotel is defined as a hotel listed by the booking website as a budget hotel, and the price of a double room is less than 300 yuan per night. Python software was used to obtain budget hotel data from the website (https://www.ly.com/ (accessed on 30 April 2022)) and the price of a double room as the hotel price. The final price was the average price of a double room in April. Online ratings play a more important role in budget hotels [36], and we also obtained online ratings for hotels. After the original data were collected, data preprocessing was performed. Duplicate hotels were found in the collection, and we removed them. After excluding hotels with no score, we obtained the data from 904 budget hotels. Hotel price is the dependent variable in this study. The three-sigma rule is adopted to check hotel prices to ensure the study’s validity. The results were all in the range [µ − 3σ, µ + 3σ]. Here, µ is the mean value, and σ is the standard deviation. We then obtained point data for each budget hotel.

2.2.2. Point of Interest

POI data have been used as a source to study the spatial relationships of hotels [37]. For example, Fang et al. [38] studied the correlation between hotel scores and three main location-related features: accessibility to points of interest. The results showed that attractions, airports, universities, public transport, and green spaces were important determinants. This study selects POI data as the data source obtained from the Amap API open platform. After screening, we adopted the platform classification and obtained 19 types of POI data. After obtaining the data, it was necessary to preprocess it. First, the median is used to fill in the missing values contained in the database. Next, the collected POI data were positioned on the map to generate and adjust the features related to the POI. Table 2 lists the relevant points of interest and their respective numbers. The POI contains spatial and attributes information as point data. However, this cannot be used directly as an explanatory variable. Therefore, this study created two POI-related features with each budget hotel as the basic unit. They are: (1) the number of POI within 1, 2, and 3 km of the hotel; (2) all 19 types included in the calculated POI density to reflect the overall distribution of poi.

2.2.3. Nighttime Light

At present, nighttime light data have been widely used in fine-scale reflection of socioeconomic characteristics, such as GDP and population density, with good effects [39,40], which are considered the basis for forming the spatial distribution pattern of urban hotels [41,42]. Therefore, it is theoretically feasible to use nighttime light as a surrogate variable for population or GDP data to estimate hotel prices, particularly when fine-scale GDP or population data within cities are difficult to obtain. Therefore, this study introduces nighttime light data as a data source for hotel prediction. The nighttime light images used in this study were obtained from the 2021 NPP-VIIRS monthly nighttime light data from the National Oceanic and Atmospheric Administration. The data were captured by the Suomi-NPP satellite. Its VIIRS sensor has a band called the Day Night Band (DNB), which provides highly sensitive (250 times better than OLS) noctilucent observations of the Earth’s surface once a day at a resolution of 500 m (six times better than OLS). NPP/VIIRS nighttime light data have become the main source of noctilucent observation data [43]. The monthly NPP-VIIRS data were projected, transformed, corrected, trimmed, and re-sampled to a 1000 m spatial resolution. The NPP-VIIRS data from January to December 2021 were superimposed to obtain the annual light data. The median value of the superimposed image was retrieved and assigned to each pixel to generate the median NPP-VIIRS image for 2021. As the NPP-VIIRS data did not filter fire, waste burning, or background noise, to reduce the impact of these uncertainties, negative pixels in the median image were regarded as background noise and replaced with a value of 0. For outlier processing, it was assumed that the brightness of the city center was the strongest, and the pixel whose brightness exceeded the city center was considered an outlier pixel. In the subsequent calculation process, the outlier value was replaced by the highest brightness value of the city center.

2.2.4. Housing Transaction Data

We used housing transaction data as the data source in combination with cost-oriented pricing [44]. The housing transactions of Anjuke used in this study come from the housing price transaction data of Hainan before 2022 on the official website of Anjuke. As they are in different years, they are not ready for machine-learning models. We take 2021 as the appraisal time point and uniformly revise the residential price to 2021 according to the residential price index of Sanya City in the corresponding year; that is, the residential price is multiplied by the ratio of the 2021 residential price index of Sanya City to the residential land price index of the transaction year, and the transaction date is fixed. Then 1212 sample points of the average transaction price of houses in different residential areas were obtained by eliminating outliers. Dotted data cannot be used directly as explanatory variables. We used the inverse distance weight method to carry out spatial interpolation to obtain housing transaction price data for the entire downtown area of Sanya. Then, each budget hotel point is taken as the basic unit, the value is extracted to the point, and the corresponding house transaction price for each hotel point is obtained.

2.2.5. Road and Sea Data

According to [19,45], we chose the distance to the road and sea. Road sea data were obtained using OpenStreetMap software. The distance from each budget hotel point to the road and sea was calculated by calculating the distance from the nearest hotel to the road and sea. Finally, the nearest distance to the road and sea is obtained for each hotel. All calculations were carried out using a GIS.

The details of the 81 features are shown in Table 3.

2.3. Standardization of Data

To transform data of different magnitudes into a unified measure and improve the performance of the model, we applied Z-score normalization to all numerical variables. Equation (1) is calculated as follows:

X_{t r a n s f o r m} = \frac{X - μ}{σ},

(1)

where µ represents the mean value and σ represents the standard deviation.

After data processing, we obtained a dataset comprising 81 explanatory variables and 904 samples.

2.4. The Modeling Method

As mentioned in Section 1, this study aims to examine the factors influencing budget hotel pricing in Sanya City from a spatial perspective, study the relationship between them through machine learning modeling of the data, and make spatial predictions. Six machine learning algorithms–RF, XGBoost, MLR, SVR, MLP, and KNN regression–were used to simulate the relationship between hotel pricing and influencing factors, and the same sample data were used to train each model and compare its accuracy. The prediction model with the highest accuracy was selected to estimate hotel pricing on the 1000 m resolution grid. Then the spatial distribution map of hotel pricing in the study area was drawn.

2.4.1. Random Forest

Breiman [46] proposed a new ensemble learning method that combines classification trees with random forests. RF is a very representative bagging ensemble algorithm, and all its basis evaluators are decision trees. A forest composed of classification trees is called an RF classifier, and a forest integrated by regression trees is called an RF regressor. This model is representative of the bagging method. The core idea of the bagging method is to construct multiple independent estimators and then apply the average or majority voting principle to their predictions to determine the result of the integrated estimator. Specifically, its implementation creates Tree-1 from the data and then recovers and extracts the data from all the data. The above method was used to create Tree-2, Tree-3, and Tree-n. The average of the trees was used for the prediction. Conversely, the choice of variables is determined by the Gini index. The index was calculated using Equation (2).

G i n i (D) = 1 - \sum_{k = 1}^{γ} P_{k}^{2},

(2)

where γ represents the number of categories in the label and

p_{k}

represents the probability of the existence of the kth class.

2.4.2. XGBoost

Designed by Chen et al. [47], XGBoost is committed to breaking the tree algorithm through its computational limits to achieve the engineering goals of the rapid operation and excellent performance. It can be faster than other ensemble algorithms using gradient lifting and has been recognized as an advanced evaluator with ultrahigh performance in classification and regression. Compared with traditional gradient-lifting algorithms, XGBoost offers many improvements. A gradient lifting regression tree is a lifting ensemble model focusing on the tree model of regression, and its modeling process is roughly as follows: First, we build a tree and then iterate gradually, adding a tree in each iteration, gradually forming a strong evaluator for the integration of many-tree models. For a random forest regression tree, the value at each leaf node is the mean of all the samples at the leaf node. However, for XGBoost, the predicted result for each sample can be expressed as the weighted sum of the results for all trees:

y_{i}^{(k)} = \sum_{k}^{K} l_{k} h_{k} (x_{i}),

(3)

where K is the total number of trees, k is the kth tree,

l_{k}

is the weight of the tree, and

h_{k}

is the prediction result of the tree.

2.4.3. Multi-Linear Regression

Only one independent variable and one dependent variable were included, and the relationship between them can be approximated by a straight line. This regression analysis is called a unary linear regression analysis. If two or more independent variables are included in the regression analysis, and there is a linear relationship between the dependent variable and the independent variables, it is an MLR. (In order to reduce the influence of multicollinearity, the Lasso algorithm in MLR is adopted.) Linear regression models are often fitted by least-squares approximation [48]. The following equation can express the general form of MLR:

y = b_{1} x_{1} + b_{2} x_{2} + \dots + b_{n} x_{n} + ε,

(4)

where

b_{i}

is the regression coefficient, ε is the error term, and n is the number of samples.

2.4.4. Support Vector Regression

SVR is a subset of SVM designed for regression problems. The aim was to find the function with the greatest deviation from the actual obtained target, and at the same time, it was as flat as possible. The aim is to find the function

f (x_{i})

that deviates the most from the actual objective

y_{i}

, and at the same time, it is as flat as possible. Given the training sample set (

x_{1}

,

y_{1}

), (

x_{2}

,

y_{2}

), … (

x_{i}

,

y_{i}

),

y_{i}

∈ R. The function obtained by the SVR is as follows:

f (x) = w^{T} x + b,

(5)

For samples, we can tolerate loss calculation based on the difference between the model output

f (x)

and real output y, and SVR can tolerate a deviation of at most θ between

f (x)

and y. This is equivalent to constructing a spacer band with a width of 2θ at the center of the

f (x)

. The prediction was considered correct if the sample fell within the spacer band.

f (x_{i})

approaches

y_{i}

by finding a value of w that satisfies the formula [49]:

\min \frac{1}{2} | | ω {| |}^{2} + C \sum_{i = 1}^{m} l_{ε} (f (x_{i}) - y_{i}),

(6)

C represents the penalty coefficient and

l_{ε}

is the insensitive loss-

L_{2}

canonical machine learning model.

2.4.5. K-Nearest Neighbor Regression

Cover et al. [50] originally proposed the KNN regression method. This method can be used to solve classification and regression problems. KNN regression was used to predict the results of the test set, given the training set and results. It is used to find the distance between each point of the test set and the training set, take the results of k datasets, and then average them as the result of prediction. The final predicted value

\hat{y}

is the output average of its K-nearest neighbors in the regression, as shown in Equation (7) [51]:

y = \frac{1}{k} \sum_{i = 1}^{k} y_{i} (x),

(7)

2.4.6. Multilayer Perceptron Regression

Multilayer perceptron (MLP) is an artificial neural network that maps a set of input vectors to a set of output vectors. The MLP can be viewed as a directed graph consisting of at least three node layers, each fully connected to the next layer. In addition to the input node, each node is a neuron with a nonlinear activation function. This function can be expressed as follows [52]:

S_{i} = F (ω_{0} x_{0} + \sum_{j = 1}^{n} ω_{j} \cdot x_{j}),

(8)

where

ω_{0}

denotes a threshold value,

x_{0}

is always 1 and

ω_{j}

are the weights.

We list the advantages and disadvantages of the above models in Table 4. In addition to MLR, other algorithms can handle nonlinear problems. In summary, because the above models have their advantages, it is difficult to determine the most suitable model according to theory. Therefore, the processed data are used to train each model and compare and analyze its accuracy, and the prediction model with the highest accuracy is selected to model the data together with the feature extraction model.

2.5. Feature Selection

Eighty-one features were included in this study. Although some noisy features were excluded after data preprocessing, the remaining features may still contain noisy information. Therefore, a feature extraction model is required to screen features and improve the model results.

2.5.1. Embedded Method

Embedding is a method that lets the algorithm decide for itself which features to use [53]. We use SelectFromModel to filter the features. When using the embedding method, we first use some machine learning algorithms and models for training, get the weight coefficient of each feature, and select the feature from the largest to the smallest according to the weight coefficient. These weight coefficients often represent some contribution or importance of the feature to the model. For example, the feature_importances_ attribute in the tree integration model lists the contributions of each feature to tree creation, so we can identify the most useful features for model creation. For a model with feature_importances_, if the importance is lower than the provided threshold parameter, the features are considered unimportant and removed. Feature_importances_ has a range of [0, 1]. If you set a small threshold, you can remove features that do not contribute to prediction at all. If it is set very close to 1, only one or two features may be left.

2.5.2. Wrapper Method

The wrapper method is also a method of feature selection and algorithm training at the same time. Similar to the embedding method, it also depends on the selection of the algorithm itself. We screened features using recursive feature elimination (RFE) in this study. It is a greedy optimization algorithm that aims to find the best-performing subset of features. It iteratively creates the model, preserves the best features, or eliminates the worst features in each iteration. In the next iteration, the next model was built using features not selected in the previous model until all features were exhausted. It then ranks the features according to the order in which they are retained or removed and selects the best subset [54]. The effect of RFE is the most conducive to improving the model performance among all feature selection methods, which can achieve excellent results with few features [55]. It has two important parameters: n_features_to_select is the number of features to be selected, and step is the number of features to be removed in each iteration.

2.6. Evaluation Criteria

The following error evaluation indexes were used to evaluate the accuracy of the model, specifically including the mean absolute error (MAE), mean Square Error (MSE), root mean squared error (RMSE), and r-squared are calculated by the following Equation:

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i, o} - y_{i, p} |,

(9)

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i, o} - y_{i, p})}^{2},

(10)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i, o} - y_{i, p})}^{2}},

(11)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i, o} - y_{i, p})}^{2}}{\sum_{i = 1}^{n} {(y_{i, o} - {\bar{y}}_{o})}^{2}},

(12)

where n is the sample number of budget hotels,

y_{i, o}

is the original price value of the ith hotel point,

y_{i, p}

is the predicted price value of the ith hotel point, and

{\bar{y}}_{o}

is the average of the actual value of all samples. Low values of MAE, MSE, and RMSE, as well as high values of r-squared, indicate a good fit of data. To prevent overfitting, scholars in the field of machine learning often use cross-validation (CV) techniques [56]. In this study, the accuracy of the hotel pricing prediction model was evaluated using the 5-fold cross-validation method. This method can fully use limited sample data and reduce the randomness of the accuracy verification. In the 5-fold cross-validation, the sample data were equally divided into five groups, four of which were used for model training, and the remaining 1 group was used for model accuracy verification. The model was run five times, and different training and test data were used for each run. The average accuracy of the five runs was used as the final validation result of the model. These results were unlikely to be overfitted.

3. Results

3.1. Algorithm Comparison

The processed data were input into each model for comparison, as shown in Table 5. As can be seen, the MLR scores were low compared to those of the other four models. This may be because it relies on the assumption of linearity. The results support our hypothesis that nonlinear machine-learning algorithms perform better in predicting hotel pricing. In addition, XGBoost had a higher R-square than the other models. This suggests that the XGBoost model is more suitable for modeling problems with many different characteristics. As a tree structure, XGBoost does not need to consider the influence of multicollinearity on the model. The bootstrapping process can be more generalized and less sensitive to noisy data.

3.2. Feature Selection

We chose XGBoost as the model for feature selection. This is not only because it outperforms the other models in Table 5 but also because it and the RF model have the feature_importances_ attribute to help with feature selection and provide the importance of each feature. As it is difficult to determine the best number of features, we choose R-squared as the evaluation metric of the model and find the best number of features through the learning curve.

3.2.1. Embedded Method

When a large number of features contribute to the model and their contributions are not uniform, it is difficult to define a valid threshold value. In this case, the model weight coefficient is our hyperparameter, and we need a learning curve to determine what the optimal value of this hyperparameter should be. Specifically, from 0 to 0.1, divide into 10 parts, increase the threshold successively, and plot the learning curve, as shown in Figure 2. As the threshold increases, the R-squared score of the model will decrease. The more features that are deleted, the greater the information loss. Therefore, this method is not suitable for the models in this paper.

3.2.2. Wrapper Method

For the selection of the number of features, from 1 to 81, the optimal number of features n is selected in steps of 5, and the learning curve is drawn, as shown in Figure 3a. It can be seen that the XGBoost-RFE model has the highest R-squared score when n is set to 46. We further refined it from 16 to 76 in steps of two. As shown in Figure 3b, when n was set to 40, the R-squared score of the XGBoost-RFE model was the highest. Therefore, we selected 40 out of 81 features. After extracting 40 features, they were input into six regression models. Table 6 shows the model modeling performance of the 40 features. It can be seen from the comparison of Table 5 that the performance of the six models has been improved after feature selection. MLR has the most obvious improvement effect, whereas XGboost has the least. This indicates that XGBoost is not susceptible to noisy data.

3.3. Hotel Pricing Simulation

Sanya City was divided into 1 km × 1 km grids to obtain more than 2000 grids. The independent variables of each grid centroid point were imported into the trained XGBoost model (in which the hotel rating was selected as the mean of the existing hotel score as the standard), and the predicted price of each grid centroid point was used to represent the hotel room price of the grid, as shown in Figure 4. Figure 4 shows the distribution of hotel room prices per square kilometer. We use the natural discontinuity point method to classify hotel prices into five categories. Green indicates a low price, and red indicates a high price (yuan). As shown in Figure 4, high-priced hotels are mainly located south and east of Sanya. For the convenience of analysis, we divide it into several regions, and it can be seen that the marked regions are all regions with higher prices.

3.4. Feature Importance

After parameter optimization and feature selection, the XGBoost model with optimized parameters was developed, and 40 features were selected. Subsequently, XGBoost was used to measure the importance of the 40 features. Table 7 presents the 40 most important features. We grouped the 40 features into four broad categories. Some features either describe the same dimensions or belong to the same category. We divided them into hotels, transportation, commerce, and public services. For example, the density of parking lots, the number of coach stations within 1 km, and the density of car services all describe the traffic situation of the hotel. The density of restaurants, companies, and integrated markets are all business conditions around the hotel. Therefore, the important factors affecting hotel pricing are divided into four aspects for illustrative purposes.

3.4.1. Traffic

In this study, 15 traffic-related features were considered. The top three influential features were selected to draw a graph of the independent variables. The natural discontinuity method was used to divide the data into five categories. The number of coach stations within 1 km has only four different numbers, so it can only be divided into four categories. Figure 5 shows the distributions of these three features. It can be seen that the distributions in Figure 5a,c are very similar to those in Figure 4 when comparing Figure 4 and Figure 5. For parts 2, 3, and 4, most areas with higher values have higher hotel prices per square kilometer. It can be inferred that good traffic conditions may be one reason for higher hotel prices.

3.4.2. Business

The second key factor summarized in this study is business. Figure 6 shows the distribution of the top three business influence features. It can be seen by combining Figure 4 and Figure 6a that the distribution of restaurant density is very consistent with the distribution of hotel prices. There is a certain density of restaurants in higher hotel pricing 1, 2, 3, and 4, so it is inferred that the catering industry’s prosperity will positively impact hotel pricing. Figure 6b,c show that the high hotel pricing in parts 2 and 4 may be due to high firm density and high aggregate market density.

3.4.3. Public Service

Public services are also an essential factor affecting hotel pricing. Figure 7 shows the distribution of the characteristics of the top three public services. Compared to Figure 4, it can be inferred that a certain density of public services may be one of the reasons for the high price of hotel rooms in these areas.

4. Discussion

In this paper, Sanya, a tourist city, is selected as a case study to explore the influencing factors of hotel pricing and simulate the spatial distribution, so as to provide the reference for other cases to study this problem.

There are several theoretical values: (a) the demonstrated superiority of nonlinear algorithms can compensate for the current methods of predicting hotel prices which are mainly based on the linear algorithm. In this paper, nonlinear machine learning algorithms such as RF, XGBoost, SVR, MLP, and KNN all performed better than MLR; (b) this study enriches the research of predicting budget hotel pricing at a spatial grid scale. This study used a variety of spatial data as data sources to predict the distribution and influencing factors of budget hotel pricing; (c) RFE is a great way to eliminate the noise that affects hotel pricing, which can help identify the most appropriate influencing factors for this case. After selecting features by RFE, we found that the effect of the six models was improved. Moreover, based on the contribution of features to the model, irrelevant and undifferentiated features were deleted because of a lack of contribution to the model; (d) this study enriches the research of budget hotel pricing. In this paper, 40 important characteristics of budget hotels in Sanya City are obtained, and the price distribution map of budget hotels is predicted, which enriches the research in this field.

There are several practical values: (a) the influencing factors extracted in this paper are conducive to city planning. For example, a small number of local restaurants may be detrimental to the development of budget hotels in the area. Consequently, governments and hotel investors may want to consider the problems identified during planning; (b) the result of pricing distribution simulation will benefit hotel investors, help with feasibility evaluation, and provide recommendations for hotel locations. Specifically, if a hotel investor is going to invest in a hotel, the main goal is to get a profit. The profit of the hotel mainly comes from the room revenue, so the results of this paper have a very important significance for the feasibility evaluation. It can help calculate a hotel’s revenue and thus help hotel investors decide whether to build a hotel in a selected location. By obtaining such competitive intelligence, hotels can gain a sustainable competitive advantage [57].

However, this study has certain limitations: (a) this study is only the result of numerical experiments. Owing to the black-box nature of machine learning, although we obtained the importance of each influencing factor, we do not know what nonlinear relationship exists between them and hotel pricing. Therefore, further research is required for those who are more concerned about causality and want to study pricing adjustments or trends in local hotels; (b) as a small and medium-sized city, the gap between urban and rural areas in Sanya is very obvious, so the distribution of budget hotels is not very even. By exploiting new study cases, there will be more interesting findings; (c) this study only simulated and mapped the spatial distribution of budget hotel pricing in April 2022. If the data of corresponding time nodes can be collected consecutively in the future, a continuous spatial distribution atlas of hotel pricing can be formed to support the dynamic prediction of hotel pricing based on big data.

5. Conclusions

This study used a nonlinear machine learning model to reveal the influencing factors and simulate the spatial distribution of budget hotel pricing in Sanya City. We constructed six models, RF, XGBoost, MLR, SVR, MLP, and KNN, to predict budget hotel pricing and compare their accuracy. The RFE model was applied to select the most suitable explanatory variables and 40 important impact features were retained. These features were grouped into four categories. Then, the selected features were input into the regression model, and the 1000 m resolution grid pricing was predicted. Finally, we got the spatial distribution map of hotel pricing in the study area. The results show that online reputation, transport, business, and public service all have an impact on budget hotel prices. This study provides theoretical support and a scientific basis for hotel planning and site selection.

Author Contributions

T.H.: conceptualization, methodology, article structure design, supervision, project administration, resources and funding acquisition; H.S.: software, data curation, data analysis, validation, visualization, writing—original draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Provincial Science Foundation of Hainan (under no. 2019RC060), the National Natural Science Foundation of China (under no. 72162014) and the Provincial Science Foundation of Hainan (under no. YSPTZX202035).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data connected to this research are available from the corresponding author under request.

Acknowledgments

We would like to express our sincere gratitude to those who participated in the study and provided us with valuable information.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hung, W.T.; Shang, J.K.; Wang, F.C. Pricing determinants in the hotel industry: Quantile regression analysis. Int. J. Hosp. Manag. 2010, 29, 378–384. [Google Scholar] [CrossRef]
Mercade Mele, P.; Molina Gomez, J.; Garay, L. To Green or Not to Green: The Influence of Green Marketing on Consumer Behaviour in the Hotel Industry. Sustainability 2019, 11, 4623. [Google Scholar] [CrossRef] [Green Version]
Benkő, B.; Dávid, L.; Farkas, T. Opportunities for the development of innovation among hotels in northern hungary. Geo J. Tour. Geo. 2022, 40, 267–273. [Google Scholar] [CrossRef]
Pereira, L.N.; Cerqueira, V. Forecasting hotel demand for revenue management using machine learning regression methods. Curr. Issues Tour. 2022, 25, 2733–2750. [Google Scholar] [CrossRef]
Al Shehhi, M.; Karathanasopoulos, A. Forecasting hotel room prices in selected GCC cities using deep learning. J. Hosp. Tour. Manag. 2020, 42, 40–50. [Google Scholar] [CrossRef]
Song, Y.; Liu, K.; Guo, L.; Yang, Z.; Jin, M. Does hotel customer satisfaction change during the COVID-19? A perspective from online reviews. J. Hosp. Tour. Manag. 2022, 51, 132–138. [Google Scholar] [CrossRef]
Shao, M.; Wang, X.; Bu, Z.; Chen, X.; Wang, Y. Prediction of energy consumption in hotel buildings via support vector machines. Sustain. Cities Soc. 2020, 57, 102128. [Google Scholar] [CrossRef]
Ray, B.; Garain, A.; Sarkar, R. An ensemble-based hotel recommender system using sentiment analysis and aspect categorization of hotel reviews. Appl. Soft Comput. 2021, 98, 106935. [Google Scholar] [CrossRef]
Kaya, K.; Yılmaz, Y.; Yaslan, Y.; Öğüdücü, Ş.G.; Çıngı, F. Demand forecasting model using hotel clustering findings for hospitality industry. Inform. Process. Manag. 2022, 59, 102816. [Google Scholar] [CrossRef]
Spiliotis, E.; Abolghasemi, M.; Hyndman, R.J.; Petropoulos, F.; Assimakopoulos, V. Hierarchical forecast reconciliation with machine learning. Appl. Soft Comput. 2021, 112, 107756. [Google Scholar] [CrossRef]
Sánchez-Medina, A.J.; C-Sánchez, E. Using machine learning and big data for efficient forecasting of hotel booking cancellations. Int. J. Hosp. Manag. 2020, 89, 102546. [Google Scholar] [CrossRef]
Anis, S.; Saad, S.; Aref, M. Sentiment analysis of hotel reviews using machine learning techniques. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics, Cairo, Egypt, 19–21 October 2020; Springer: Cham, Switzerland, 2020; pp. 227–234. [Google Scholar]
Sirmans, S.; Macpherson, D.; Zietz, E. The composition of hedonic pricing models. J. Real Estate Lit. 2005, 13, 1–44. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, J.; Lu, S.; Cheng, S.; Zhang, J. Modeling hotel room price with geographically weighted regression. Int. J. Hosp. Manag. 2011, 30, 1036–1043. [Google Scholar] [CrossRef]
Yang, Y.; Mueller, N.J.; Croes, R.R. Market accessibility and hotel prices in the Caribbean: The moderating effect of quality-signaling factors. Tour. Manag. 2016, 56, 40–51. [Google Scholar] [CrossRef]
Guizzardi, A.; Pons, F.M.E.; Ranieri, E. Advance booking and hotel price variability online: Any opportunity for business customers? Int. J. Hosp. Manag. 2017, 64, 85–93. [Google Scholar] [CrossRef]
Torres-Bagur, M.; Ribas, A.; Vila-Subirós, J. Incentives and Barriers to Water-Saving Measures in Hotels in the Mediterranean: A Case Study of the Muga River Basin (Girona, Spain). Sustainability 2019, 11, 3583. [Google Scholar] [CrossRef] [Green Version]
Schamel, G. Weekend vs. midweek stays: Modelling hotel room rates in a small market. Int. J. Hosp. Manag. 2012, 31, 1113–1118. [Google Scholar] [CrossRef]
Latinopoulos, D. Using a spatial hedonic analysis to evaluate the effect of sea view on hotel prices. Tour. Manag. 2018, 65, 87–99. [Google Scholar] [CrossRef]
Somphong, C.; Udo, K.; Ritphring, S.; Shirakawa, H. An estimate of the value of the beachfront with respect to the hotel room rates in Thailand. Ocean Coast. Manag. 2022, 226, 106272. [Google Scholar] [CrossRef]
Kim, J.; Jang, S.; Kang, S.; Kim, S. Why are hotel room prices different? Exploring spatially varying relationships between room price and hotel attributes. J. Bus. Res. 2020, 107, 118–129. [Google Scholar] [CrossRef]
Chica-Olmo, J. Effect of monumental heritage sites on hotel room pricing. Int. J. Hosp. Manag. 2020, 90, 102640. [Google Scholar] [CrossRef]
Fuentes-Moraleda, L.; Lafuente-Ibáñez, C.; Muñoz-Mazón, A.; Villacé-Molinero, T. Willingness to Pay More to Stay at a Boutique Hotel with an Environmental Management System. A Preliminary Study in Spain. Sustainability 2019, 11, 5134. [Google Scholar] [CrossRef] [Green Version]
Gavilan, D.; Avello, M.; Martinez-Navarro, G. The influence of online ratings and reviews on hotel booking consideration. Tour. Manag. 2018, 66, 53–61. [Google Scholar] [CrossRef]
Wang, M.; Lu, Q.; Chi, R.T.; Shi, W. How word-of-mouth moderates room price and hotel stars for online hotel booking an empirical investigation with expedia data. J. Electron. Commer. Res. 2015, 16, 72. [Google Scholar]
El-Said, O.A. Impact of online reviews on hotel booking intention: The moderating role of brand image, star category, and price. Tour. Manag. Perspect. 2020, 33, 100604. [Google Scholar] [CrossRef]
Palić, I.; Palić, P.; Banić, F. The pre-pandemic role of customer online satisfaction in price determination: Evidence from hotel industry. Croatian Rev. Econ. Bus. Soc. Stat. 2021, 7, 50–60. [Google Scholar] [CrossRef]
Sánchez-Pérez, M.; Illescas-Manzano, M.D.; Martínez-Puertas, S. Modeling hotel room pricing: A multi-country analysis. Int. J. Hosp. Manag. 2019, 79, 89–99. [Google Scholar] [CrossRef]
Yousefzadeh Barri, E.; Farber, S.; Jahanshahi, H.; Beyazit, E. Understanding transit ridership in an equity context through a comparison of statistical and machine learning algorithms. J. Transp. Geogr. 2022, 105, 103482. [Google Scholar] [CrossRef]
Chen, X.; Zheng, H.; Wang, H.; Yan, T. Machine learning algorithms perform better than multiple linear regression in predicting manure nitrogen output from lactating dairy cows. Anim. Sci. Proc. 2022, 13, 45–46. [Google Scholar] [CrossRef]
Wang, X.; Sun, J.; Wen, H. Tourism seasonality, online user rating and hotel price: A quantitative approach based on the hedonic price model. Int. J. Hosp. Manag. 2019, 79, 140–147. [Google Scholar] [CrossRef]
Zhang, Z.; Ye, Q.; Law, R. Determinants of hotel room price: An exploration of travelers’ hierarchy of accommodation needs. Int. J. Cont. Hosp. Manag. 2011, 23, 972–981. [Google Scholar] [CrossRef]
Ma, Y.; Li, H.; Tong, Y. Distribution Differentiation and Influencing Factors of the High-Quality Development of the Hotel Industry from the Perspective of Customer Satisfaction: A Case Study of Sanya. Sustainability 2022, 14, 6476. [Google Scholar] [CrossRef]
Ruetz, D.; Marvel, M. Budget hotels: Low cost concepts in the US, Europe and Asia. Trends Issues Glob. Tour. 2011, 2011, 99–124. [Google Scholar]
Nash, R.; Thyne, M.; Davies, S. An investigation into customer satisfaction levels in the budget accommodation sector in Scotland: A case study of backpacker tourists and the Scottish Youth Hostels Association. Tour. Manag. 2006, 27, 525–532. [Google Scholar] [CrossRef]
Ren, L.; Qiu, H.; Wang, P.; Lin, P.M.C. Exploring customer experience with budget hotels: Dimensionality and satisfaction. Int. J. Hosp. Manag. 2016, 52, 13–23. [Google Scholar] [CrossRef]
Cagliero, L.; La Quatra, M.; Apiletti, D. From Hotel Reviews to City Similarities: A Unified Latent-Space Model. Electronics 2020, 9, 197. [Google Scholar] [CrossRef] [Green Version]
Fang, L.; Li, H.; Li, M. Does hotel location tell a true story? Evidence from geographically weighted regression analysis of hotels in Hong Kong. Tour. Manag. 2019, 72, 78–91. [Google Scholar] [CrossRef]
Chen, X.; Nordhaus, W.D. VIIRS nighttime lights in the estimation of cross-sectional and time-series GDP. Remote Sens. 2019, 11, 1057. [Google Scholar] [CrossRef] [Green Version]
Kumar, P.; Sajjad, H.; Joshi, P.K.; Elvidge, C.D.; Rehman, S.; Chaudhary, B.S.; Tripathy, B.R.; Singh, J.; Pipal, G. Modeling the luminous intensity of Beijing, China using DMSP-OLS night-time lights series data for estimating population density. Phys. Chem. Earth Parts A/B/C 2019, 109, 31–39. [Google Scholar] [CrossRef]
Yang, Y.; Mao, Z.; Tang, J. Understanding guest satisfaction with urban hotel location. J. Travel Res. 2018, 57, 243–259. [Google Scholar] [CrossRef]
Xie, J.; Tveterås, S. Economic decline and the birth of a tourist nation. Scand. J. Hosp. Tour. 2020, 20, 49–67. [Google Scholar] [CrossRef]
Chen, Z.; Yu, B.; Hu, Y.; Huang, C.; Shi, K.; Wu, J. Estimating house vacancy rate in metropolitan areas using NPP-VIIRS nighttime light composite data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2188–2197. [Google Scholar] [CrossRef]
Kim, W.G.; Han, J.; Hyun, K. Multi-stage synthetic hotel pricing. J. Hosp. Tour. Res. 2004, 28, 166–185. [Google Scholar] [CrossRef]
Conroy, S.J.; Toma, N.; Gibson, G.P. The effect of the Las Vegas Strip on hotel prices: A hedonic approach. Tour. Econ. 2020, 26, 622–639. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Fitzenberger, B. The moving blocks bootstrap and robust inference for linear least squares and quantile regressions. J. Econom. 1998, 82, 235–287. [Google Scholar] [CrossRef]
Vapnik, V.; Vashist, A. A new learning paradigm: Learning using privileged information. Neural Netw. 2009, 22, 544–557. [Google Scholar] [CrossRef]
Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef] [Green Version]
Song, Y.; Liang, J.; Lu, J.; Zhao, X. An efficient instance selection algorithm for k nearest neighbor regression. Neurocomputing 2017, 251, 26–34. [Google Scholar] [CrossRef]
Velo, R.; López, P.; Maseda, F. Wind speed estimation using multilayer perceptron. Energy Convers. Manag. 2014, 81, 1–9. [Google Scholar] [CrossRef]
Liu, H.; Zhou, M.; Liu, Q. An embedded feature selection method for imbalanced data classification. J. Autom. Sin. 2019, 6, 703–715. [Google Scholar] [CrossRef]
Han, Y.; Huang, L.; Zhou, F. A dynamic recursive feature elimination framework (dRFE) to further refine a set of OMIC biomarkers. Bioinformatics 2021, 37, 2183–2189. [Google Scholar] [CrossRef] [PubMed]
Park, D.; Lee, M.; Park, S.E.; Seong, J.K.; Youn, I. Determination of optimal heart rate variability features based on SVM-recursive feature elimination for cumulative stress monitoring using ECG sensor. Sensors 2018, 18, 2387. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Berrar, D. Cross-validation. In Encyclopedia of Bioinformatics and Computational Biology; Elsevier: Amsterdam, The Netherlands, 2019; pp. 542–545. [Google Scholar]
Casado Salguero, G.; Fernández Gámez, M.Á.; Aldeanueva Fernández, I.; Ruíz Palomo, D. Competitive Intelligence and Sustainable Competitive Advantage in the Hotel Industry. Sustainability 2019, 11, 1597. [Google Scholar] [CrossRef]

Figure 1. Study area scope and budget hotel sample point.

Figure 2. Embedded selection learning curve.

Figure 3. Wrapper selection learning curve: (a) curve in steps of 5, (b) curve in steps of 2.

Figure 4. Simulated distribution of price.

Figure 5. The distribution of (a) density of parking lots, (b) number of coach stations within 1 km, (c) density of car services.

Figure 6. The distribution of (a) density of restaurants, (b) density of companies, (c) density of integrated markets.

Figure 7. The distribution of (a) number of middle schools within 1 km, (b) density of middle schools, (c) density of medical care.

Table 1. Data sources.

Data	Data Sources
Budget hotel	https://www.ly.com/ (accessed on 30 April 2022)
POI	https://ditu.amap.com/ (accessed on 30 April 2022)
Nighttime light	National Oceanic and Atmospheric Administration
Housing transaction data	https://sanya.anjuke.com/ (accessed on 30 April 2022)
Road and water data	OpenStreetMap

Table 2. Point of interest.

Index	POI	Count
1	Restaurant	13,317
2	Supermarket	582
3	Scenic spot	504
4	Port	30
5	Colleges and Universities	19
6	Bus stop	777
7	Company	3648
8	Park	96
9	Railway stations and airports	7
10	Car service	1360
11	Automobile maintenance	364
12	Market	135
13	Life service	9040
14	Parking lot	1363
15	Primary school	118
16	Medical care	1453
17	Coach station	10
18	Middle school	54
19	Integrated market	708

Table 3. Feature descriptions.

Data	Feature	Number
Budget hotel	Hotel ratings	1
POI	Number of POI within (1, 2, 3) km, Density of POI	76
Nighttime light	Nighttime light value	1
Housing transaction data	Housing transaction price	1
Road and sea data	Distance to the Nearest Road and sea	2

Table 4. Advantages and disadvantages of the model.

Algorithms	Advantages	Disadvantages	Literature Source
XGBoost	●Strong robustness ●Good at processing large-scale data sets	●Unable to handle image, voice, text and other high dimensional data well	Chen et al. [47]
RF	●Less prone to overfitting ●Good at handling large data sets	●Requires more computational power and resources	Breiman [46]
SVR	●Good at solving high-dimensional problems ●Improves model generalization performance	●Sensitive to missing data ●Not good at handling noisy data	Vapnik et al. [49]
KNN	●Retraining is less costly ●Good at dealing with class domain cross or overlap of more sample sets to be divided	●The output is not very interpretable ●Not good at dealing with uneven samples	Song et al. [51]
MLR	●Basic, simple	●Only deal with linear problems ●High requirements on data quality	Fitzenberger [48]
MLP	●Considers nonlinear and latent relationships ●Works well with large input data	●Easy to overfit ●Sensitive to feature scaling	Velo et al. [52]

Table 5. Model comparison.

Algorithms	MAE	MSE	RMSE	R-Squared
XGBoost	0.0600	0.0090	0.0943	0.818
RF	0.0710	0.0103	0.1015	0.789
SVR	0.0994	0.0152	0.1235	0.689
KNN	0.0347	0.0098	0.0987	0.801
MLR	0.1606	0.0372	0.1931	0.286
MLP	0.0998	0.0184	0.1357	0.624

Table 6. Modeling performance.

Algorithms	MAE	MSE	RMSE	R-Squared
XGBoost	0.0581	0.0088	0.0940	0.820
RF	0.0698	0.0097	0.0987	0.806
SVR	0.0980	0.0143	0.1195	0.708
KNN	0.0333	0.0090	0.0947	0.817
MLR	0.1497	0.0335	0.1831	0.315
MLP	0.0902	0.0162	0.1274	0.669

Table 7. Selected features.

Aspect	Feature	Importance	Number
Online reputation	Hotel ratings	0.0111	1
Traffic	Density of parking lots	0.0356	15
	Number of coach stations within 1 km	0.0344
	Density of car services	0.0305
	Density of coach stations	0.0261
	Density of railway stations and airports	0.0225
	Number of parking lots within 1 km	0.0200
	Density of automobile maintenance	0.0191
	Number of automobile maintenance within 1 km	0.0173
	The distance to the nearest road	0.0172
	The distance to the nearest sea	0.0167
	Number of car services within 1 km	0.0163
	Density of ports	0.0143
	Number of bus stops within 1 km	0.0131
	Density of bus stops	0.0121
	Number of ports within 1 km	0.0037
Business	Density of restaurants	0.1116	10
	Density of companies	0.0574
	Density of integrated markets	0.0386
	Density of supermarkets	0.0275
	Number of integrated markets within 1 km	0.0208
	Density of markets	0.0193
	Number of companies within 1 km	0.0169
	Transaction price of house	0.0166
	Number of restaurants within 1 km	0.0105
	Number of markets within 1 km	0.0060
Public service	Number of middle schools within 1 km	0.0647	14
	Density of middle schools	0.0449
	Density of medical care	0.0401
	Number of parks within 1 km	0.0320
	Density of life service	0.0281
	Density of scenic spots	0.0252
	Density of parks	0.0229
	Density of primary schools	0.0211
	Number of medical care	0.0185
	Number of life service within 1 km	0.0163
	Density of colleges and universities	0.0160
	Number of colleges and universities within 1 km	0.0142
	Number of scenic spots within 1 km	0.0119
	Number of primary schools within 1 km	0.0090
Total		1.0000	40

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, T.; Song, H. Analysis of Influencing Factors and Distribution Simulation of Budget Hotel Room Pricing Based on Big Data and Machine Learning from a Spatial Perspective. Sustainability 2023, 15, 617. https://doi.org/10.3390/su15010617

AMA Style

Hu T, Song H. Analysis of Influencing Factors and Distribution Simulation of Budget Hotel Room Pricing Based on Big Data and Machine Learning from a Spatial Perspective. Sustainability. 2023; 15(1):617. https://doi.org/10.3390/su15010617

Chicago/Turabian Style

Hu, Tao, and Haoyu Song. 2023. "Analysis of Influencing Factors and Distribution Simulation of Budget Hotel Room Pricing Based on Big Data and Machine Learning from a Spatial Perspective" Sustainability 15, no. 1: 617. https://doi.org/10.3390/su15010617

APA Style

Hu, T., & Song, H. (2023). Analysis of Influencing Factors and Distribution Simulation of Budget Hotel Room Pricing Based on Big Data and Machine Learning from a Spatial Perspective. Sustainability, 15(1), 617. https://doi.org/10.3390/su15010617

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Analysis of Influencing Factors and Distribution Simulation of Budget Hotel Room Pricing Based on Big Data and Machine Learning from a Spatial Perspective

Abstract

1. Introduction

1.1. Background

1.2. Literature Review

2. Materials and Methods

2.1. Data Source

2.2. Data Collection and Processing

2.2.1. Budget Hotel

2.2.2. Point of Interest

2.2.3. Nighttime Light

2.2.4. Housing Transaction Data

2.2.5. Road and Sea Data

2.3. Standardization of Data

2.4. The Modeling Method

2.4.1. Random Forest

2.4.2. XGBoost

2.4.3. Multi-Linear Regression

2.4.4. Support Vector Regression

2.4.5. K-Nearest Neighbor Regression

2.4.6. Multilayer Perceptron Regression

2.5. Feature Selection

2.5.1. Embedded Method

2.5.2. Wrapper Method

2.6. Evaluation Criteria

3. Results

3.1. Algorithm Comparison

3.2. Feature Selection

3.2.1. Embedded Method

3.2.2. Wrapper Method

3.3. Hotel Pricing Simulation

3.4. Feature Importance

3.4.1. Traffic

3.4.2. Business

3.4.3. Public Service

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI