A Sustainable Price Prediction Model for Airbnb Listings Using Machine Learning and Sentiment Analysis

Alharbi, Zahyah H.

doi:10.3390/su151713159

Open AccessArticle

A Sustainable Price Prediction Model for Airbnb Listings Using Machine Learning and Sentiment Analysis

by

Zahyah H. Alharbi

Department of Management Information Systems, College of Business Administration, King Saud University, Riyadh 12372, Saudi Arabia

Sustainability 2023, 15(17), 13159; https://doi.org/10.3390/su151713159

Submission received: 16 August 2023 / Revised: 27 August 2023 / Accepted: 29 August 2023 / Published: 1 September 2023

(This article belongs to the Special Issue E-learning, Digital Learning, and Digital Communication Used for Education Sustainability Volume II)

Download

Browse Figures

Versions Notes

Abstract

:

Since 2008, the company Airbnb has brought significant changes to the hospitality industry worldwide. Experiencing remarkable growth, it currently offers over six million listings in 191 countries across one hundred thousand cities. Airbnb has gained immense popularity among travellers seeking accommodations globally. Consequently, Airbnb generates extensive datasets from its listings that contain rich features that have captured the attention of researchers. These datasets offer potentially valuable information that can be extracted to greatly assist individuals and governments in making more informed decisions. Pricing rental properties on Airbnb still presents a challenge for owners, as it directly impacts customer demand. This research aimed to conquer the challenge by developing a sustainable price prediction model for Airbnb listings by incorporating property specifications, owner information and customer reviews. By utilising this model, owners can estimate the expected value of their Airbnb listings. We trained and fine-tuned several machine learning models using an Airbnb listing dataset from Barcelona. Performance evaluation metrics, such as mean squared error (MSE), mean absolute error (MAE), root mean square error (RMSE) and

R^{2}

score were then used to compare the models. To enhance the performance of the predictive models, sentiment analysis was used to extract relevant features from customer reviews. Feature importance analysis was also conducted to determine which attributes were the most influential on listing price predictions. The results show that the Lasso and Ridge models outperformed the others considered in the study, with an average

R^{2}

score of 99%. We found that amenities-related features had a negligible impact on all models’ performance. The most significant features found were polarity (positive/negative sentiment), the number of bedrooms, the accommodation’s maximum capacity, the number of beds and the quantity of reviews received by the listing in the past 12 months, respectively. We found that certain room types (categorized as entire home/apartment, private room or shared room) are associated with lower predicted prices.

Keywords:

Airbnb; sharing economy; sustainable price; sentiment analysis; machine learning; regression

1. Introduction

Peer-to-peer markets allow individuals to participate in offering their goods or services and dealing directly with customers. A considerable number of arguments and controversies have been generated around companies that enable this kind of business. Airbnb is a very popular example of this type of company, as it has enabled individuals to discover, list and book more than six million accommodations in more than one hundred thousand cities around the world through its online peer-to-peer marketplace [1].

Airbnb works as an intermediary between producers and consumers to reduce the cost and risk of offering a short-term rental at a location and to earn money from this previously unused capacity. The fast growth of Airbnb has raised many issues and questions for interested researchers to discuss and analyse [2,3,4]. One of the most discussed issues is the economic impact of Airbnb on cities, including its effect on hotels and local residents. Many researchers have used data-mining algorithms on Airbnb datasets to investigate its characteristics from an economic impact perspective [5,6,7,8,9].

Many researchers have used an Airbnb dataset to predict users’ booking destinations [10,11,12]. Several researchers have used data-mining algorithms on Airbnb listings to predict rental prices [13,14,15,16,17,18], while other researchers have examined the factors that impact room prices on Airbnb [19,20,21]. Some researchers have investigated users’ preferences and experiences based on reviews, comments and recommendations on the Airbnb platform [22,23,24]. How hosts describe themselves and their perceived trustworthiness on Airbnb has also been analysed by researchers [25,26]. A few researchers [27] aimed to predict the spatial distribution of Airbnb by using its dataset.

The prices of listings on Airbnb are typically decided independently by the hosts, in contrast with hotels, especially those that are part of larger organisations, which often have their own pricing mechanisms. This presents challenges for new hosts and those with new listings, as they must strive to set prices that are both competitive and appealing to be successful. Similarly, as consumers can compare prices among similar listings, they will continue to seek information on the value and timing of their bookings [14].

Machine Learning’s Impact on Sustainable Education

To align with global sustainability objectives, higher education institutions (HEIs) around the world must adapt their policies, curricula and practices. These institutions play a pivotal role in fostering a sustainable future. Over the past decade, the incorporation of sustainable development into higher education has grown, enhancing the impact of related research. Advances in artificial intelligence and machine learning (ML) not only emulate human intelligence, but also enrich the educational landscape. These technologies equip students with novel skills and foster a collaborative learning environment in HEIs, heralding significant future implications [28].

Leading HEIs globally are increasingly recognizing the transformative potential of ML in shaping modern educational paradigms and driving societal advancements. These cutting-edge technologies redefine the pedagogical landscape by offering enriched, interactive learning experiences. A testament to this shift is that 65% of U.S. universities have now integrated artificial intelligence and ML-enhanced learning tools into their curricula [29].

The positive trajectory of this integration is evident in the growth statistics; there was a remarkable 47.5% increase in the adoption of artificial intelligence within U.S. educational frameworks from 2017 to 2021 [29]. Leading this movement are institutions such as the University of Derby, which has instituted a predictive analytics system to proactively identify potential student attrition, and Deakin University, which leverages IBM Watson to address student inquiries in real time [30].

Research by Ilić underscores the multifaceted benefits of ML, from enhancing institutional security to fostering an environment conducive to collaborative learning and research [31]. In a parallel vein, Gollapalli et al. [32] have embarked on an exhaustive analysis of data mining and ML methodologies, assessing their efficacy in accurately predicting accreditation metrics, thereby spotlighting optimal strategies for sustainable educational excellence.

Given these insights, the research methodology we applied also demonstrated significant relevance to the realm of higher education. We strongly advocate that educational institutions strategically embrace and prioritise sustainable practices driven by ML and sentiment analysis. This approach can foster innovation, sustainability and societal progression.

The main purpose of this research was to determine the ability to predict the listing best prices for Barcelona on Airbnb based on multiple features and to understand exactly what affects pricing. This aim was to help Airbnb hosts set appropriate prices for their locations. To the best of our knowledge, no previous research has studied the relationship between the opinions of Airbnb users and the prediction of prices. The focus of previous studies has also tended to be more on predicting prices from the point of view of those booking rooms (i.e., guests). In this research, however, we focused on aiding Airbnb hosts in achieving a sustainable estimation of their listings’ valuation on the Airbnb platform.

This paper begins with a review of the relevant literature in Section 2, followed by a description of the methodology employed in the data collection and analysis in Section 3. Section 4 presents the findings and the ensuing discussion of the study. Finally, Section 5 contains concluding remarks regarding this study.

2. Literature Review

This section discusses several types of research that have focused on Airbnb prices and how ML methodologies can be beneficial in the field of education. The datasets used by various researchers, as well as their approaches and results, are presented.

2.1. Using Machine Learning in Higher Education as a Sustainable

Wen et al. [33] employed sentiment analysis and data mining to investigate public views on the World University Rankings (WURs). In this study, sentiment analysis paired with latent Dirichlet allocation (LDA) topic modelling was utilised to dissect the emotional and cognitive responses found in 18,466 pieces of Chinese feedback regarding WURs. The findings suggest that, while the Chinese public exhibits varied reactions to WURs, the predominant sentiment is positive. Their perceptions of WURs can be classified into four primary cognitive categories: standpoint, dialectical, interest and cultural. The primary concern among the public is whether the WURs align with their personal views, reflect their interests and validate their unique experiences. Notably, there is limited engagement with the more nuanced dialectical issues of rankings. Despite the longstanding importance of rankings, their inherent challenges have frequently been overshadowed, resulting in the paradoxical evolution of the ranking system. One limitation of this study is the absence of model evaluation metrics in the results.

Shi and Guo’s [34] research offers a mechanism to bolster sustainable online teaching, addressing teachers’ skills and technological system functions. Using traditional ML, the model analyses lecturer behaviour to predict teacher types, pinpoint roles and guide adaptation for online teaching sustainability. The system predicts teacher retention with 77% accuracy using three weeks of data. The key activities and roles were identified, such as in-class exercises and technology guidance. While traditional online teaching roles emphasise a shift from in-person to virtual moderation, effective roles depend on pedagogical beliefs and system features, sometimes diverging from expert views. Given that influential activities for teacher retention change across different settings, the mechanism should be regularly integrated into platforms to ensure a harmonious evolution between teaching methods and technological features.

2.2. Factors Impact Room Prices

Chattopadhyay and Mitra [19] used three different algorithms—random forest, decision tree and ordinary least squares (OLS)—to identify significant variables that impact room prices on the Airbnb platform. The researchers applied the algorithms to a large dataset of Airbnb listings in 11 U.S. cities. OLS regression was implemented on the full records of the listing data and detected 53 important factors that influenced room pricing. The other two models (decision tree and random forest) could not be applied to the full listing dataset because of the high number of computations required. Therefore, each model was applied to 11 sample cities. The results showed that the random forest had the best performance among the three algorithms. This study was based on the fact that the researcher worked with the listing price, not the actual one at which the rental was sold [19].

Kakar et al. [20] investigated the impact of the available online information of hosts, such as gender and race, on Airbnb price listings in San Francisco. They used a model of one dependent variable (price per unit listing) and several independent variables (host features, rental-listing features, user reviews and neighbourhood value). An estimation technique was applied to the proposed model, and a regression equation specification error test (RESET) was then performed on the data. The results showed that Asian and Hispanic hosts in general had lower prices by 9.3% and 9.6%, respectively, compared with their white counterparts after controlling for other factors, such as user reviews, unit values in the area and the characteristics of the property. There were no significant effects of gender on price listings. This research verified that there was racial discrimination in the online marketplace. The raw data taken from the website Inside Airbnb consisted of approximately 6000 listings dating back to 2 September 2015 [35]. The researchers used a sample of only 715 listings that included different features of race, gender and sexual orientation from the larger set of Airbnb listings to obtain their results. This smaller sample can be considered a limitation of this research [20].

Teubner et al. [21] studied trust building on Airbnb as an economic value and analysed the effect of common reputation features as determinants of the price; the examined features included ID verification, duration of membership, number of ratings, average rating score, photos and host status. A large-scale dataset was used that covered 86 German cities. After feature extraction, a price regression model was employed. The results showed that the membership duration and the rating scores of hosts were the factors that most impacted the price. In addition, the availability of photos for a property affects its listing price. A limitation of this study was that the results cannot be generalised to other sharing-economy platforms because the reputation features, type of resource and user motives are different for each platform [21].

2.3. Price Prediction

The following studies are relevant to our work, as they fall within a distinct category of studies that examine hotel and sharing economy rental prices. Gomez-Cravioto et al. [13] analysed information of previous listings in Mexico City from an Airbnb dataset to predict the price of a new listing on the platform. The researchers used ML techniques and statistical methods for the analysis of the information after it was extracted from the Airbnb website. First, the data were preprocessed by removing records in which the missing values were more than 50%. An imputation method was used to replace data if the missing values constituted less of the record. An experiment was then conducted to compare the methods of multinomial logistic regression [36], a generalised additive model (GAM) [37] and quantile regression [38] in predicting a listing price on Airbnb. The applied techniques were evaluated using r-square, the Akaike information criterion (AIC) and residual standard error. GAM outperformed the other models in predicting a new Airbnb listing price. A limitation of this study is that the GAM is an additive technique and does not consider the interactions of variables, so the prediction accuracy was not very high. Therefore, the researchers aimed to integrate more characteristics, such as poverty and race rates, into the model and use more tree-based and deep-learning algorithms [13].

Kalehbasti et al. [39] used deep learning, ML and natural language processing (NLP) techniques to develop a price prediction model to help customers as well as owners in pricing and evaluating a given price by offering them nominal knowledge of properties’ best values. First, an initial preprocessing was performed on the data, which removed records with irreparable missing values, converted features to binary variables, converted some features into floating-point representations, removed any irrelevant features and normalised the features. The data were then divided into three sets: training, validation and testing. Second, specific features of customer reviews, owner characteristics and rentals were selected as predictors. Several ML algorithms, including k-means clustering, support vector regression, linear regression and neural networks, were then applied to the prediction model. An Airbnb dataset from New York City was used, which included 50,221 entries, each with 96 features. The results showed that the large number of features resulted in weak performance; because of high variation on the validation set, the results of the training set were better [39].

Luo et al. [14] used different ML algorithms to predict Airbnb listing prices in three cities (Paris, New York City (NYC) and Berlin). The researchers split the data into three categories—training, validation and testing—for each city dataset. They then eliminated some features as a pre-processing step, such as country code and market (because most of the data had the same values), ID and host ID (which appeared to be noise) and jurisdiction and licence names (where most of the data values were null). Next, ML algorithms were applied to the selected features. The neural network achieved the best result with an r-square value of 0.746 on the Paris dataset, 0.739 on the NYC dataset and 0.816 for a combined dataset of both Paris and NYC. The results showed that neural networks built on the combined datasets of NYC and Paris, rather than on the individual datasets, could efficiently predict prices in each city. However, the prediction accuracy was not very high. Therefore, the researcher would like to enhance neural network performance with additional feature selection techniques and hyperparameter tuning.

Fuentes [15] aimed to predict the price of Airbnb listings by testing a hypothesis that the Airbnb listing prediction would be affected the most by the location of the property rather than other characteristics. The researcher used three machine-learning algorithms to analyse the Airbnb features: linear regression, artificial neural network and k-nearest neighbours. The results showed that location was the main factor in being able to predict the price [15].

Several studies in the property pricing literature have focused on predicting purchase or rental prices for non-shared properties. For instance, Ma et al. [40] analysed warehouse rental prices in Beijing and employed regression tree, linear regression, gradient boosting regression trees and random forest regression methods. Their findings indicated that the tree regression model performed best, yielding an RMSE of 1.05 CNY/m²-day.

In a different study, Yu and Wu [41] attempted to predict real estate prices using random forest regression, linear regression and support vector regression (SVR) in conjunction with feature importance analysis. They also explored price classification using seven classes with logistic regression, naive Bayes, random forest and support vector classification (SVC). Their SVR model outperformed the others with an RMSE of 0.53, while the SVC model with principal component analysis (PCA) achieved a classification accuracy of 69%.

Masiero et al. [42] conducted a study that employed a quantile regression model to analyse the relationships between holiday homes, travel traits and hotel prices. Similarly, Wang and Nicolau [43] investigated the factors influencing sharing-economy prices by analysing Airbnb listings using quantile regression and OLS analysis. In another study, Li et al. [44] utilised the multi-scale affinity propagation clustering method. Linear regression was then employed on the obtained clusters to explore price predictions for Airbnb in different cities. Notably, a significant feature of the clustering method is the distance between listed properties and city landmarks.

Therefore, in the current study, a variety of feature selections and ML techniques were examined to improve and add to the literature by addressing price prediction to aid property hosts in price evaluation. Additionally, sentiment analysis was used to leverage customer reviews. As for the last two contributions, they are novel approaches to predicting rental prices, distinct from the existing literature (except a study conducted by Kalehbasti et al. [16]). While their work incorporated neural networks and sentiment analysis in rental price prediction, it is important to emphasise that our approach and methodology differ significantly in terms of both the ML techniques used and the level of detail. Moreover, our study presents a feature importance analysis that is absent in [16]. Therefore, our contribution offers unique insights and advancements in this area, extending beyond the existing research. This research used Barcelona Airbnb listings because, to the best of our knowledge, the city has not been studied extensively in this regard, although it is a popular tourist attraction, making it a potentially productive candidate for observing Airbnb activity [45].

3. Research Methodology

The research methodology used is shown in Figure 1, and was comprised of six phases: data collection, data prepossessing, sentiment analysis of the reviews, feature construction, regression models and model performance evaluation. The following sections explain these phases in more detail.

3.1. Data Collection

The data utilised in this study were obtained from InsideAirbnb.com [1], an independent platform providing tools and data for exploring Airbnb usage across various cities worldwide. We specifically selected the city of Barcelona and downloaded both the listings.csv.gz (accessed on 15 February 2023) and reviews.csv.gz files (accessed on 15 February 2023), which offered the most comprehensive information. Compiled on 11 December 2022, the dataset provided a snapshot in time with 15,778 and 66,9754 entries, respectively. The files included 75 features encompassing lodging details, neighbourhood information, host characteristics, availability and customer reviews.

3.2. Data Pre-Processing

To preprocess the data, we carefully examined each feature in the dataset. Our pre-processing steps involved the following:

Remove features with frequent and irreparable missing fields.
Convert certain features to floating-point representations by eliminating currency symbols (e.g., dollar signs in prices).
Eliminate irrelevant or uninformative features, such as ‘picture_url’, ‘listing_url’, ‘host_id’, ‘host_name’, ‘description’ and more.
We also eliminated constant-valued fields and duplicate features.
Additionally, we transformed Boolean values into binary (zero or one) representations and converted ordinal values to numeric values as part of the data preprocessing phase. We converted the values of the features into integer values to easily train the dataset using different regression approaches.
One particular column of interest was ‘amenities’, a text variable with over 2000 unique values. To handle this, we performed examination, cleaning, aggregation and split the column into individual binary variables. Each binary variable represented whether a particular amenity was included in the listing, with 1 indicating its presence and 0 indicating otherwise.
The identification of the top 25 amenities was carried out by leveraging insights from various research studies that focused on investigating their impact. Notable contributions were from studies, such as those by Garcia et al. [45], who provided valuable insights and guided the selection process in determining the most influential amenities.

3.3. Sentiment Analysis on the Reviews

Given the significance of customer reviews in determining Airbnb listing prices, we incorporated the review column into our dataset after filtering out non-English reviews and irrelevant content. The reviews for each listing were subjected to sentiment analysis using the TextBlob [46] library.

TextBlob is a versatile text processing package available for Python 2 and 3. It offers a convenient API that facilitates various NLP tasks, including sentiment analysis, classification, noun phrase extraction, translation and part-of-speech (POS) tagging. When provided with a sentence as input, TextBlob provides the following two properties [47]:

Polarity. This measures the sentiment expressed in a sentence, ranging from −1 to 1. A value of −1 signifies a negative sentiment, while a value of 1 represents a positive sentiment.
Subjectivity. This captures the degree to which the sentence reflects personal states, such as opinions, emotions and beliefs. It is measured on a scale from 0 to 1, where values closer to 0 indicate an objective sentence based on factual information, while values closer to 1 indicate a subjective statement influenced by personal views [48].

TextBlob handles unfamiliar terms by ignoring them during the analysis. It considers the intensity and context of words and expressions to derive the overall sentiment score.

This analysis assigned sentiment scores to each reviewed text. For each listed property, the reviews were analysed, with scores averaged across all the reviews for that specific listing. This average score was then incorporated into the predictive model as a new feature. It should be noted that alternative approaches for mining customer opinions [49] can be explored in future work.

3.4. Feature Construction

Figure 2 presents the final set of features used in our study. The data dictionary provided by InsideAirbnb.com offers a comprehensive and clear description of the dataset features [50].

To analyse the impacts on price, we first employed scatterplots to visually examine the correlations between it and each feature. We observed that dividing our features into three datasets would be best for presenting our results. Therefore, we created three distinct feature sets. Some details of these sets are illustrated in Figure 3.

3.5. Regression Models

Regression is a methodology employed in two distinct contexts [51]. As part of ML applications, regression analysis is primarily used for predicting and forecasting numeric variables [52]. The second purpose of regression analysis is to establish causal relationships between independent and dependent variables. It is worth noting that regression alone reveals associations solely between a dependent variable and a fixed dataset comprising various variables. In regression models, independent variables serve as predictors for the dependent variables [51].

Therefore, in this study, we applied various regression approaches, such as least absolute shrinkage and selection operator regression (Lasso) [53,54], Ridge regression [55], Bayesian regression [56], K-nearest neighbours (KNN), SVR and decision tree regression, to each feature set. We conducted two sets of tests, one with the inclusion of the amenity features and the other without them.

To ensure accuracy and prevent overfitting, we performed a 10-fold cross-validation of the data. The dataset was divided into ten parts, with each run utilising nine parts for training and one part for testing. In each subsequent run, a different part was used for testing, while the remaining nine parts were used for training. The final error was calculated as the average of the errors obtained from all 10 runs.

3.6. Model Performance Evaluation

To assess the performance of the regression techniques, we employed a diverse range of evaluation metrics to allow for a comprehensive evaluation of the model’s performance. Specifically, we calculated metrics such as mean squared error (MSE), mean absolute error (MAE), root mean absolute error (RMAE) and

R^{2}

score. The MSE, MAE and RMSE values were computed with smaller values indicating higher prediction quality from the model. Additionally, the

R^{2}

score, which ranges from 0 to 1, serves as an indication of how well the model fits the data. A value closer to 1 implies a better fit. The formulas of the evaluation metrics are described in the following paragraphs.

MSE represents the variability of the residuals, which measure the distance between the data points and the regression line. The MSE is always nonnegative, and smaller values indicate a better fit of the estimator to the data [57]. The RMSE provides an indication of the dispersion of these residuals. In other words, it tells us how tightly the data points are clustered around the line of best fit [58]. RMSE is commonly employed in fields such as climatology, prediction and regression analysis to assess the accuracy of test results.

The formula for calculating the RMSE is as follows:

R M S E = \sqrt{\frac{1}{k} \sum_{i = 1}^{k} {(y_{i} - {\hat{y}}_{i})}^{2}},

(1)

where k denotes the number of observations,

y_{i}

represents the observed value and

{\hat{y}}_{i}

represents the predicted value.

R^{2}

score, known as the coefficient of multiple determination or coefficient of determination in multiple regression, serves as a statistical measure indicating how closely the data points align with the fitted regression line. It provides insights into the proportion of variability in the response data that is explained by the model [57]. The

R^{2}

score always ranges between 0% and 100%. A value of 0% implies that the model fails to explain any variability around the mean of the response data, while a value of 100% indicates that the model accounts for all the variability around the mean. The mathematical expression for the

R^{2}

score is provided as follows:

R^{2} = 1 - \frac{\sum {(y_{i} - {\hat{y}}_{i})}^{2}}{{(y_{i} - {\bar{y}}_{i})}^{2}},

(2)

where

y_{i}

represents the actual cumulative confirmed cases,

{\hat{y}}_{i}

denotes the predicted cumulative confirmed cases and

{\bar{y}}_{i}

represents the average of the actual cumulative confirmed cases.

MAE measures the average magnitude of errors in a set of predictions, disregarding their direction [59]. It computes the average of the absolute differences between each individual prediction and its corresponding true observation with equal weight given to each difference. The MAE value is obtained by applying the following equation:

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - y |,

(3)

where n represents the number of errors and

| y_{i} - y |

denotes the absolute errors.

4. Results and Discussion

In our research, we conducted experiments using Python 3 version 3.7.6, a high-level technical computing language, on a Surface Pro 7 Intel Core i7 with 16 GB of memory. The research findings revealed that some features directly affect the price of the property.

4.1. Feature Correlation Analysis

Figure 4 shows the visual analysis of Feature Set 1. Figure 4a presents the relationship of the Price versus Neighbourhood_group variables. It can be observed that the price values ranged from approximately 100 to 2000+. The Neighbourhood_group values, defined by integers, mostly lie between 0 and 6, indicating that the majority of the listings in the dataset belonged to these groups. However, fewer occurrences of Neighbourhood_group values 9 and 10 were observed in a specific price range of about 200 to 500. The Neighbourhood_group values of 0–8 exhibited a wider distribution of prices, covering a range of approximately 100 to 2000+. Overall, the chart shows that the majority of the groups had a diverse range of price values, while the groups identified as 9 and 10 had a limited presence in a specific price range, and the groups labelled 0–8 encompassed a broader distribution of prices.

Figure 4b shows that the price values for the property types ranged from approximately 100 to 2000+. The property types identified by integers were mostly between 0 and 20, indicating that the majority of the listings in the dataset belonged to these types. However, as the value of the property type increased, there was a decreasing trend in prices. In other words, property types 0–20 had the largest range of price values, while the price values decreased for property types with identifiers above 20. This suggests that there may be a correlation between the type of property and its price, with certain types commanding higher prices compared to others.

Based on the information provided in Figure 4c regarding the correlation between price and room types, it can be observed that there was a relationship between these two factors. Specifically, different room types were associated with varying price ranges. When the room type was categorised as 1, the price was expected to be higher, possibly reaching 5000+ in value. This indicates that accommodations with Room Type 1 likely offered more luxurious or premium features, amenities or larger space, resulting in higher pricing.

In contrast, when the room type was classified as 2, the price range was expected to be in the range of 4000+, but less than 5000. This suggests that accommodations with Room Type 2 still offered desirable features and amenities, albeit at a slightly lower pricing level compared to Room Type 1. For Room Type 3, the associated price range was approximately 2000, indicating that accommodations with this room type offered more modest or standard features and amenities, leading to a relatively lower price. Lastly, for Room Type 4, the price was expected to be around 1000. This suggests that accommodations with Room Type 4 provided basic amenities and may have had a more economical or budget-friendly pricing structure.

In summary, the information suggests a correlation between room types and corresponding price ranges. Accommodations with higher-cost room types tended to command higher prices, while accommodations with lower-cost room types were generally associated with lower price points.

It can be observed from Figure 4d that there was a relationship between price and accommodation factors. Specifically, when the accommodation value ranged from 0 to 6, there was a higher likelihood of encountering higher prices, particularly in the range of 2000 to 5000+. This suggests that accommodations with more features or amenities tend to command higher prices. Conversely, when the accommodation value was between 6 and 16, the corresponding price range tended to be lower, typically falling between 100 and 2000. This implies that accommodations with fewer features or simpler amenities were generally associated with lower price points. Therefore, the data indicate a correlation between the accommodation value and the corresponding price range. As the level of accommodation increased from 0 to 6, the price range tended to shift towards higher values, while for accommodation values ranging from 6 to 16, the associated price range was typically lower.

Additionally, based on the information provided in Figure 4e regarding the correlation between price and bathroom types, it can be observed that there was a relationship between these two factors. The 35 different types of bathrooms were associated with varying price ranges. When the bathroom type was between 0 and 25, the corresponding price values were expected to fall within the range of 100 to 1500. This suggests that accommodations with these bathroom types offer a range of features and amenities that correspond to a moderate price range. However, it is important to note that within this range, there may be certain bathroom types, such as Bathroom Type 10 and below, where the price can exceed the upper limit and reach higher values. This indicates that accommodations with these particular bathroom types may have had additional luxury features or unique characteristics that justified a higher price point. In contrast, when the bathroom type was higher than 25, the associated price values tended to decrease. This implies that the accommodations identified with these bathroom types were more basic or had fewer amenities, resulting in a lower price range.

In summary, the information suggests a correlation between bathroom types and corresponding price ranges. Accommodations with bathroom types between 0 and 25 were generally commanding prices ranging from 100 to 1500, with the possibility of higher prices for specific bathroom types, such as 10 and below. However, for the bathroom types identified beyond 25, the price range decreased, indicating accommodations with simpler or more basic bathroom features.

In addition, there was a relationship between these price and bedroom type factors, as shown in Figure 4f. The numerical values assigned to the bedroom types ranged from 0 to 20 and were associated with different price ranges. For Bedroom Types 4–7, the corresponding price values ranged from 100 to 1500. This suggests that the accommodations with these bedroom types offered a range of features and amenities that commanded a moderate price range. Additionally, for bedroom types below 4, although the price range was also 100 to 1500 most of the time, there were instances where the prices could reach 5000+ for certain accommodations. This indicates that accommodations with fewer bedrooms but higher price points may possess unique characteristics or luxurious features that justify higher prices. When the bedroom types were 8–11, the associated price values ranged from 500 to approximately 1800. This suggests that accommodations with these bedroom types offered a relatively lower price range compared to the previous categories. It is important to note that bedroom types from 13 to 20 indicate unsold accommodations, meaning they had no assigned price value.

In summary, the information suggests a correlation between bedroom types and corresponding price ranges. Accommodations with Bedroom Types 4–7 offered a moderate price range of 100 to 1500, while those below Bedroom Type 4 could occasionally have higher prices reaching 5000+. Bedroom Types 8–11 commanded a lower price range of 500 to approximately 1800. Lastly, Bedroom Types 13–20 indicate unsold accommodations with no assigned price values.

Figure 4g shows the correlation between price and bed types; there was a relationship between these two factors. The bed types were represented by numerical values ranging from 0–30 and were associated with different price ranges. For bed types ranging from 0 to 15, the corresponding price values mostly fell within the range of 100 to 1000. Occasionally, the prices could be slightly higher but still in proximity to this range. This suggests that accommodations with these bed types offer a range of prices, typically between 100 and 1000. However, it is important to note that the price values tended to shift slightly to the left, indicating that some accommodations within this range may have had prices expected to be between 500 and 1000.

In contrast, it was possible for Bed Type 5 and below to have price values reaching 5000+. This suggests that accommodations with fewer beds, but potentially higher prices may possess unique features or luxurious amenities that justify the higher price points.

For bed types with numerical identifiers above 20, there were no assigned price values. This indicates that accommodations with these bed types may have been unsold or not included in the dataset.

Figure 5 shows the visual analysis of Feature Set 2. This figure provided information regarding the correlation between price and the variables minimum_nights, maximum_nights, number_of_reviews, number_of_reviews_ltm and number_of_reviews_l30d, as there were relationships between these factors. The price values were influenced by the values of these variables. For the variable minimum_nights, when the value ranged from 0 to 100, the corresponding price values were at their highest, ranging from 100 to 5000+. As the minimum_nights value increased beyond 100, the price decreased, and when it reached a critical value of around 400, the price became zero. Similarly, for the variable maximum_nights, when the value ranged from 0 to 1000, the corresponding price values were at their highest, ranging from 100 to 5000+. However, when the maximum_nights value exceeds 1000, the price becomes zero.

Regarding the variable number_of_reviews, when the value was at 0, the price values were at their maximum, ranging from 100 to 5000+. As the number_of_reviews increased, the price gradually decreased. Once the number of reviews surpassed 1000, the price became zero. The variables number_of_reviews_ltm and number_of_reviews_l30d showed similar patterns. When these variables had a value of 0, the price values were at their highest, ranging from 100 to 5000+. As the number of reviews increased, the price decreased, and it became zero when the number of reviews exceeded a certain threshold, likely around 1000. In summary, the analysis indicated that the price values were influenced by the values of minimum_nights, maximum_nights, number_of_reviews, number_of_reviews_ltm and number_of_reviews_l30d. Higher price values were associated with lower values of these variables, while higher values of the variables led to lower prices. However, it is important to note that the specific thresholds and exact relationships may vary and require further analysis to determine precise patterns.

From the scatterplot shown in Figure 6a, we can see that the polarity had values ranging from −1 to 1. Polarity values between 0 and −1 had some effect on price. We can see polarity value effects on price ranging from ≅100 to ≅500. In contrast, we can visually observe that the polarity value −1 had an effect on price at one point that was approximately ≅100, while values ranging from 0 to 1 had the most effect on prices, ranging from ≅100 to 5000+.

Additionally, from the scatterplot shown in Figure 6b, we can see that the subjectivity values ranged from 0 to 1. Values from 0.2 to 1 had a greater effect on the price than those from 0.1 to 0. We can see that the subjectivity values from 0.4 to 0.8 affected the price in a range from

≅ 100

to

≅ 5000 +

, similar to the polarity.

4.2. Regression Model Analysis

Table 1 presents the analysis of the effects of applying various regression algorithms to the Feature Set 1 dataset. Several observations can be drawn from these results. The outcomes from the Lasso regression closely mirror those from the Ridge regression in terms of error metrics and

R^{2}

scores. Both Ridge and Lasso regressions exhibited outstanding performance with this dataset. In contrast, Bayesian regression presents marginally higher errors and a somewhat diminished

R^{2}

score. This may suggest that the Bayesian regression model is slightly less intricate and adaptable than the Ridge and Lasso regressions.

SVR, with and without amenities, displayed few errors and high

R^{2}

scores close to 1, indicating a strong fit with the data. KNN regression has significantly higher errors and a lower

R^{2}

score compared to Ridge, Lasso and Bayesian regression. A lower

R^{2}

score indicates that KNN might not capture the relationships in the data as effectively as linear-regression-based models. Decision tree regression performed slightly worse than the other models.

Table 2 showcases the analysis results of applying different regression algorithms to the Feature Set 2 dataset. From the results, we can also make numerous observations. Ridge and Lasso regression stand out as top performers, minimizing errors and maximizing the

R^{2}

score, thus providing strong predictive performance. While Bayesian regression and SVR also perform commendably, they lag slightly behind Ridge and Lasso regression in terms of error metrics and

R^{2}

scores. The KNN regression performed similarly for both datasets, with and without amenities, resulting in relatively high MSE and RMSE values and a moderate

R^{2}

score. Decision tree regression, in comparison, offers moderate performance, registering lower

R^{2}

scores. Overall, the Lasso and Ridge models, with or without amenities, seemed to be the most effective regression algorithm for the Feature Set 2 dataset.

Table 3 presents the results of applying various regression algorithms to the Feature Set 3 dataset. Notably, Ridge, Lasso and Bayesian regressions demonstrated superior performance, with near-perfect

R^{2}

scores close to 1, indicating robust predictive capabilities. While SVR also showed commendable results, it lagged slightly in terms of error metrics. The KNN regression produced a moderate performance with an

R^{2}

score of 0.5605, while the decision tree regression reported a notable

R^{2}

score of 0.9043. Interestingly, the inclusion of amenities did not significantly influence performance metrics in this analysis.

4.3. Feature Importance Analysis

Table 4 presents the top five features identified by the top-performing models as key determinants in evaluating an Airbnb listing’s price. Rankings were derived from their respective coefficients, and it is noteworthy that we utilized p-values to ascertain the significance of each coefficient. Evidently, bedrooms (representing the number of bedrooms), accommodate (indicative of the maximum capacity of the listing), and beds (denoting the number of beds) consistently emerged across all models. Notably, the top-ranked feature, ‘polarity’, bore a positive coefficient of 51.8069, signifying that listings with higher polarity values (more positive sentiment) correlated with increased predicted prices. Yet, in the Lasso model, the features ‘polarity’ and ‘subjectivity’ had coefficients of 0.0, which implies that these features were effectively omitted from the model due to L1 regularization. Further, within the Ridge model, the feature room_type—categorized as either an entire home/apartment, private room or shared room—carried a negative coefficient of −33.3805. This suggests that certain room types tend to be associated with reduced predicted prices. These findings potentially offer valuable insights into the predominant criteria employed when determining an Airbnb listing’s price.

5. Conclusions, Implications and Future Works

This research aimed to develop an effective predictive model for Airbnb prices using limited features, such as owner information, property specifications and reviews from customers. The objective of price optimisation was to assist Airbnb hosts in determining the optimal price for their listings when sharing their homes. Various ML techniques, such as Lasso regression, Ridge regression, Bayesian regression, KNN regression, decision tree regression and SVR, were employed, along with feature importance analysis to optimise the performance based on evaluation metrics, such as MSE, MAE, RMSE and

R^{2}

score.

In initial experiments with a baseline model, a high number of features led to high variance and poor performance. To address this issue, pre-processing techniques were implemented, including dividing attributes into three distinct sets and converting them into integer values. Feature importance analysis further helped to reduce the variance. Across all the feature datasets with and without the amenity features, advanced models, such as Ridge and Lasso regression, performed the best of all the models tested. They achieved an impressive

R^{2}

of 99% and a low MSE of 0.001. This level of accuracy is noteworthy considering the dataset’s heterogeneity and the presence of hidden factors, such as the personal characteristics of the owners, that were challenging to incorporate.

Additionally, this study examined the impact of lodging-related attributes on the performance of Airbnb listings. The findings reveal that the total number of beds, bedrooms and accommodation capacity of the listing exert a strong positive impact on its price. Interestingly, while specific amenities were not singled out as significant contributors, the overall sentiment—whether positive or negative—expressed by guests was pivotal in shaping the listing’s price. Notably, the elucidation of feature importance in our results can enrich perceived value, potentially guiding Airbnb hosts in more strategically pricing their properties.

The findings of this study have significant implications for practitioners regarding factors that indirectly influence listing prices. First, hosts are advised to implement strategies that enhance their personal exposure, such as improving their rates of promptly responding to guest inquiries. Investing in amenities has also been proven to be a worthwhile tactic. Above all, obtaining and maintaining a superhost badge is crucial, as it serves as a symbol of trustworthiness and quality. Second, owners of underperforming lodgings can utilise the predictive model developed in this study to conduct what-if analyses. This allows them to evaluate alternative actions and determine an optimal strategy for increasing occupancy, bookings and revenue. Furthermore, taxation authorities can leverage the predictive model to identify hosts who may be engaging in tax evasion. By comparing hosts’ declared profits with predictions from the model, authorities can identify discrepancies and take appropriate actions. Lastly, the Airbnb industry is characterised by its intricate and multifaceted nature. Its recent progression has been significantly influenced by advancements in information technology. Among these advancements, ML and sentiment analysis stand out. Unlike traditional IT methodologies, they refine data structures based on insights into consumer behaviour. Shifting our focus to the educational realm, institutional leaders grapple with the challenge of predicting student performance. However, through predictive analytics and strategic interventions, there is the potential to elevate student outcomes. This article explores how the transformative impact of ML in the Airbnb sector might provide valuable insights into advancements in higher education. Additionally, this piece offers insights that could guide courses in emphasizing key components of online Airbnb marketing [60].

Future work in this domain could explore alternative ML techniques, random forest feature importance, and neural networks [61,62,63]. Additionally, experimenting with Airbnb datasets from other cities is important to assess the generalisability of these findings, as the current results are specific to the Barcelona dataset. Leveraging specialised hardware, such as those used in ML and deep learning models, could enhance computational efficiency. Furthermore, improving sentiment analysis could involve weighing recent reviews more heavily and incorporating additional metrics during training alongside the average sentiment score for each listing. Finally, a comparative analysis of different model configurations in future studies could provide deeper insights into the efficacy of these regression algorithms.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the author. The raw data is publicly available on Inside Airbnb’s website http://insideairbnb.com/ (accessed on 15 February 2023).

Conflicts of Interest

The author declare no conflict of interest.

References

Airbnb. About inside Airbnb. Available online: https://www.airbnb.com/about/about-us (accessed on 22 March 2023).
Carrasco-Santos, M.J.; Peña-Romero, A.; Guerrero-Navarro, D. A Luxury Tourist Destination in Housing for Tourist Purposes: A Study of the New Airbnb Luxe Platform in the Case of Marbella. J. Theor. Appl. Electron. Commer. Res. 2023, 18, 1020–1040. [Google Scholar] [CrossRef]
Suh, J.; Tosun, C.; Eck, T.; An, S. A Cross-Cultural Study of Value Priorities between US and Chinese Airbnb Guests: An Analysis of Social and Economic Benefits. Sustainability 2022, 15, 223. [Google Scholar] [CrossRef]
Tian, F.; Sun, F.; Hu, B.; Dong, Z. The Impact on Bed and Breakfast Prices: Evidence from Airbnb in China. Sustainability 2022, 14, 13834. [Google Scholar] [CrossRef]
Gyódi, K. Airbnb in European cities: Business as usual or true sharing economy? J. Clean. Prod. 2019, 221, 536–551. [Google Scholar] [CrossRef]
Barron, K.; Kung, E.; Proserpio, D. The Effect of Home-Sharing on House Prices and Rents: Evidence from Airbnb. Mark. Sci. 2020, 40, 23–47. [Google Scholar] [CrossRef]
Sheppard, S.; Udell, A. Do Airbnb properties affect house prices. Williams Coll. Dep. Econ. Work. Pap. 2016, 3, 43. [Google Scholar]
Ndaguba, E.; Zyl, C.V. Professionalizing Sharing Platforms for Sustainable Growth in the Hospitality Sector: Insights Gained through Hierarchical Linear Modeling. Sustainability 2023, 15, 8267. [Google Scholar] [CrossRef]
Sutherland, I.; Kiatkawsin, K. Determinants of guest experience in Airbnb: A topic modeling approach using LDA. Sustainability 2020, 12, 3402. [Google Scholar] [CrossRef]
Zhang, K.; Pan, Z.; Shi, S. The Prediction of Booking Destination on Airbnb Dataset; UC San Diego: San Diego, CA, USA, 2015. [Google Scholar]
Wu, Y.; Zhou, Z. New User Booking Prediction for Airbnb Historical Data; UC San Diego: San Diego, CA, USA, 2015. [Google Scholar]
Ulfsson, H. Predicting Airbnb User’s Desired Travel Destinations. Ph.D. Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2017. [Google Scholar]
Gómez, D.; Cantu-Ortiz, F.; Contreras, V.; Diaz Ramos, R. Mexico city’s airbnb listing price analysis using regression. In Proceedings of the 6th IADIS International Conference Connected Smart Cities, Virtual Conference, 21–23 July 2020. [Google Scholar]
Luo, Y.; Zhou, X.; Zhou, Y. Predicting Airbnb Listing Price Across Different Cities; Stanford University: Stanford, CA, USA, 2019. [Google Scholar]
Fuentes, J.E.G. Airbnb Listings in New York City: Price Prediction and Analysis. Ph.D. Thesis, Utica College, Utica, NY, USA, 2020. [Google Scholar]
Rezazadeh Kalehbasti, P.; Nikolenko, L.; Rezaei, H. Airbnb Price Prediction Using Machine Learning and Sentiment Analysis. In Proceedings of the Machine Learning and Knowledge Extraction: 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Virtual Event, 17–20 August 2021; Proceedings 5. Springer: Berlin/Heidelberg, Germany, 2021; pp. 173–184. [Google Scholar]
Zhao, C.; Wu, Y.; Chen, Y.; Chen, G. Multiscale Effects of Hedonic Attributes on Airbnb Listing Prices Based on MGWR: A Case Study of Beijing, China. Sustainability 2023, 15, 1703. [Google Scholar] [CrossRef]
Zhang, Z.; Chen, R.J.; Han, L.D.; Yang, L. Key factors affecting the price of Airbnb listings: A geographically weighted approach. Sustainability 2017, 9, 1635. [Google Scholar] [CrossRef]
Chattopadhyay, M.; Mitra, S. Do airbnb host listing attributes influence room pricing homogenously? Int. J. Hosp. Manag. 2019, 81, 54–64. [Google Scholar] [CrossRef]
Kakar, V.; Voelz, J.; Wu, J.; Franco, J. The visible host: Does race guide Airbnb rental rates in San Francisco? J. Hous. Econ. 2018, 40, 25–40. [Google Scholar] [CrossRef]
Teubner, T.; Hawlitschek, F.; Dann, D. Price determinants on AirBnB: How reputation pays off in the sharing economy. J. -Self-Gov. Manag. Econ. 2017, 5, 53–80. [Google Scholar]
Cheng, M.; Jin, X. What do Airbnb users care about? An analysis of online review comments. Int. J. Hosp. Manag. 2019, 76, 58–70. [Google Scholar] [CrossRef]
Abdar, M.; Yen, N. Analysis of user preference and expectation on shared economy platform: An examination of correlation between points of interest on Airbnb. Comput. Hum. Behav. 2020, 107, 105730. [Google Scholar] [CrossRef]
Mohsin, A.; Lengler, J. Airbnb hospitality: Exploring users and non-users’ perceptions and intentions. Sustainability 2021, 13, 10884. [Google Scholar] [CrossRef]
Ma, X.; Hancock, J.T.; Lim Mingjie, K.; Naaman, M. Self-disclosure and perceived trustworthiness of Airbnb host profiles. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, Portland, OR, USA, 25 February–1 March 2017; pp. 2397–2409. [Google Scholar]
Ma, X.; Neeraj, T.; Naaman, M. A computational approach to perceived trustworthiness of airbnb host profiles. In Proceedings of the International AAAI Conference on Web and Social Media, Montreal, QC, Canada, 15–18 May 2017; Volume 11, pp. 604–607. [Google Scholar]
Quattrone, G.; Greatorex, A.; Quercia, D.; Capra, L.; Musolesi, M. Analyzing and predicting the spatial penetration of Airbnb in US cities. EPJ Data Sci. 2018, 7, 31. [Google Scholar] [CrossRef]
Kuleto, V.; Ilić, M.; Dumangiu, M.; Ranković, M.; Martins, O.M.; Păun, D.; Mihoreanu, L. Exploring opportunities and challenges of artificial intelligence and machine learning in higher education institutions. Sustainability 2021, 13, 10424. [Google Scholar] [CrossRef]
Chang, R. Report Artificial Intelligence to Grow 47.5 Years. 2017. Available online: https://thejournal.com/articles/2017/03/24/ai-market-to-grow-47.5-percent-over-next-four-years.aspx (accessed on 21 August 2023).
Lacity, M.; Scheepers, R.; Willcocks, L.; Craig, A. Reimagining the University at Deakin: An IBM Watson Automation Journey. The Outsourcing Unit Working Research Paper Series; OUWP: London, UK, 2017. [Google Scholar]
Ilić, M.P.; Păun, D.; Popović Šević, N.; Hadžić, A.; Jianu, A. Needs and Performance Analysis for Changes in Higher Education and Implementation of Artificial Intelligence, Machine Learning, and Extended Reality. Educ. Sci. 2021, 11, 568. [Google Scholar] [CrossRef]
Gollapalli, M.; Rahman, A.; Alkharraa, M.; Saraireh, L.; AlKhulaifi, D.; Salam, A.A.; Krishnasamy, G.; Alam Khan, M.A.; Farooqui, M.; Mahmud, M.; et al. SUNFIT: A Machine Learning-Based Sustainable University Field Training Framework for Higher Education. Sustainability 2023, 15, 8057. [Google Scholar] [CrossRef]
Wen, Y.; Zhao, X.; Li, X.; Zang, Y. Explaining the Paradox of World University Rankings in China: Higher Education Sustainability Analysis with Sentiment Analysis and LDA Topic Modeling. Sustainability 2023, 15, 5003. [Google Scholar] [CrossRef]
Shi, Y.; Guo, F. Exploring Useful Teacher Roles for Sustainable Online Teaching in Higher Education Based on Machine Learning. Sustainability 2022, 14, 14006. [Google Scholar] [CrossRef]
Said, C. Window into Airbnbs hidden impact on S.F. San Francisco Chronicle, June 2014. Available online: https://www.sfchronicle.com/business/item/window-into-airbnb-s-hidden-impact-on-s-f-30110.php (accessed on 5 March 2023).
Deisenroth, M.; Faisal, A.; Ong, C. Mathematics for Machine Learning; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
Mason, C.; Quigley, J. Non-parametric hedonic housing prices. Hous. Stud. 1996, 11, 373–385. [Google Scholar] [CrossRef]
Koenker, R. Quantile Regression in R: A Vignette. 2012. Available online: https://cran.r-project.org/web/packages/quantreg/vignettes/rq.pdf (accessed on 10 November 2019).
Kalehbasti, P.; Nikolenko, L.; Rezaei, H. Airbnb price prediction using machine learning and sentiment analysis. arXiv 2019, arXiv:1907.12665. [Google Scholar]
Ma, Y.; Zhang, Z.; Ihler, A.; Pan, B. Estimating warehouse rental price using machine learning techniques. Int. J. Comput. Commun. Control. 2018, 13, 235–250. [Google Scholar] [CrossRef]
Yu, H.; Wu, J. Real Estate Price Prediction with Regression and Classification; CS229 (Machine Learning) Final Project Reports; Stanford University: Stanford, CA, USA, 2016. [Google Scholar]
Masiero, L.; Nicolau, J.L.; Law, R. A demand-driven analysis of tourist accommodation price: A quantile regression of room bookings. Int. J. Hosp. Manag. 2015, 50, 1–8. [Google Scholar] [CrossRef]
Wang, D.; Nicolau, J.L. Price determinants of sharing economy based accommodation rental: A study of listings from 33 cities on Airbnb. com. Int. J. Hosp. Manag. 2017, 62, 120–131. [Google Scholar] [CrossRef]
Li, Y.; Pan, Q.; Yang, T.; Guo, L. Reasonable price recommendation on Airbnb using Multi-Scale clustering. In Proceedings of the 2016 35th Chinese Control Conference (CCC), Chengdu, China, 27–29 July 2016; pp. 7038–7041. [Google Scholar]
Garcia-López, M.À.; Jofre-Monseny, J.; Martínez-Mazza, R.; Segú, M. Do short-term rental platforms affect housing markets? Evidence from Airbnb in Barcelona. J. Urban Econ. 2020, 119, 103278. [Google Scholar] [CrossRef]
Loria, S.; Keen, P.; Honnibal, M.; Yankovsky, R.; Karesh, D.; Dempsey, E. Textblob: Simplified text processing. Second. Textblob Simpl. Text Process. 2014, 3, 2014. [Google Scholar]
Abiola, O.; Abayomi-Alli, A.; Tale, O.A.; Misra, S.; Abayomi-Alli, O. Sentiment analysis of COVID-19 tweets from selected hashtags in Nigeria using VADER and Text Blob analyser. J. Electr. Syst. Inf. Technol. 2023, 10, 5. [Google Scholar] [CrossRef]
Abayomi-Alli, A.; Abayomi-Alli, O.; Misra, S.; Fernandez-Sanz, L. Study of the Yahoo-Yahoo Hash-Tag tweets using sentiment analysis and opinion mining algorithms. Information 2022, 13, 152. [Google Scholar] [CrossRef]
Petz, G.; Karpowicz, M.; Fürschuß, H.; Auinger, A.; Stříteskỳ, V.; Holzinger, A. Opinion mining on the web 2.0–characteristics of user generated content and their impacts. In Proceedings of the Human-Computer Interaction and Knowledge Discovery in Complex, Unstructured, Big Data: Third International Workshop, HCI-KDD 2013, Held at SouthCHI 2013, Maribor, Slovenia, 1–3 July 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 35–46. [Google Scholar]
Airbnb. Airbnb Data Assumptions. Available online: http://insideairbnb.com/data-assumptions/ (accessed on 15 June 2023).
Maulud, D.; Abdulazeez, A.M. A review on linear regression comprehensive in machine learning. J. Appl. Sci. Technol. Trends 2020, 1, 140–147. [Google Scholar] [CrossRef]
Frank, E.; Trigg, L.; Holmes, G.; Witten, I.H. Naive Bayes for regression. Mach. Learn. 2000, 41, 5–25. [Google Scholar] [CrossRef]
Ranstam, J.; Cook, J. LASSO regression. J. Br. Surg. 2018, 105, 1348. [Google Scholar] [CrossRef]
Li, Y.; Yang, R.; Wang, X.; Zhu, J.; Song, N. Carbon Price Combination Forecasting Model Based on Lasso Regression and Optimal Integration. Sustainability 2023, 15, 9354. [Google Scholar] [CrossRef]
McDonald, G.C. Ridge regression. Wiley Interdiscip. Rev. Comput. Stat. 2009, 1, 93–100. [Google Scholar] [CrossRef]
Bishop, C.M.; Tipping, M.E. Bayesian regression and classification. Nato Sci. Ser. Sub Ser. III Comput. Syst. Sci. 2003, 190, 267–288. [Google Scholar]
Khan, M.A.; Khan, R.; Algarni, F.; Kumar, I.; Choudhary, A.; Srivastava, A. Performance evaluation of regression models for COVID-19: A statistical and predictive perspective. Ain Shams Eng. J. 2022, 13, 101574. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
De Myttenaere, A.; Golden, B.; Le Grand, B.; Rossi, F. Mean absolute percentage error for regression models. Neurocomputing 2016, 192, 38–48. [Google Scholar] [CrossRef]
Bangare, M.L.; Bangare, P.M.; Ramirez-Asis, E.; Jamanca-Anaya, R.; Phoemchalard, C.; Bhat, D.A.R. Role of machine learning in improving tourism and education sector. Mater. Today Proc. 2022, 51, 2457–2461. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Aggarwal, K.; Kirchmeyer, M.; Yadav, P.; Keerthi, S.S.; Gallinari, P. Conditional generative adversarial networks for regression. arXiv190512868 Cs Stat. 2019, 133, 142–146. [Google Scholar]
Yu, J.; Wen, Y.; Yang, L.; Zhao, Z.; Guo, Y.; Guo, X. Monitoring on triboelectric nanogenerator and deep learning method. Nano Energy 2022, 92, 106698. [Google Scholar] [CrossRef]

Figure 1. The research methodology, including six phases.

Figure 2. The final set of features utilised in this study.

Figure 3. The main attributes of each feature set.

Figure 4. The visual analysis of Feature Set 1. (a) The relationship between Price and Neighbourhood_group features; (b) The relationship between Price and Property_type features; (c) The relationship between Price and Room_type features; (d) The relationship between Price and Accommodates features; (e) The relationship between Price and Bathrooms features; (f) The relationship between Price and Bedrooms features; (g) The relationship between Price and Beds features.

Figure 5. The visual analysis of Feature Set 2.

Figure 6. The visual analysis of Feature Set 3. (a) The relationship between Price and Polarity features. (b) The relationship between Price and Subjectivity features.

Table 1. The evaluation of models performance using Feature Set 1.

Model Name	MSE	MAE	RMSE	$R^{2}$
KNN regression	4369.473	2.2966	66.102	0.9728
SVR (without amenities)	0.0007	0.0025	0.0257	0.995
SVR (with amenities)	0.0006	0.0019	0.0245	0.996
Decision tree regression (without amenities)	5518.656	23.7928	74.2877	0.9656
Decision tree regression (with amenities)	5855.591	23.8494	76.5219	0.9635
Ridge Regression	0.0006	0.0016	0.0253	0.997
Lasso Regression	0.0006	0.0019	0.0258	0.997
Bayesian Regression	0.0007	0.0179	0.02759	0.985

Table 2. The evaluation of models performance using Feature Set 2.

Model Name	MSE	MAE	RMSE	$R^{2}$
KNN regression	2,268,707.48	42.7961	1506.2229	0.2998
SVR (without amenities)	0.0006	0.0024	0.0245	0.9998
SVR (with amenities)	0.0005	0.0008	0.02144	0.9998
Decision tree regression (without amenities)	2,257,681.98	50.4614	1502.5585	0.3032
Decision tree regression (with amenities)	2,156,169.00	49.2969	1468.3899	0.3345
Ridge Regression	0.0005	0.0016	0.0241	0.9998
Lasso Regression	0.0005	0.0016	0.0241	0.9998
Bayesian Regression	0.0008	0.0018	0.0284	0.9992

Table 3. The evaluation of models performance using Feature Set 3.

Model Name	MSE	MAE	RMSE	$R^{2}$
KNN regression	843,216.88	16.12	918.27	0.5605
SVR (without amenities)	0.0008	0.0015	0.0278	0.99995
SVR (with amenities)	0.0008	0.0013	0.0277	0.99996
Decision tree regression (without amenities)	183,687.91	29.87	428.59	0.9043
Decision tree regression (with amenities)	183,687.91	29.87	428.59	0.9043
Ridge Regression	0.0005	0.0016	0.0234	0.99995
Lasso Regression	0.0005	0.0016	0.0233	0.99995
Bayesian Regression	0.0005	0.0016	0.0234	0.99952

Table 4. Feature ranking for combined dataset: top 5.

Ridge Regression	Lasso Regression	Bayesian Regression
subjectivity	bedrooms	polarity
polarity	accommodates	subjectivity
bedrooms	beds	bedrooms
accommodates	bathrooms	accommodate
beds	property type	beds

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alharbi, Z.H. A Sustainable Price Prediction Model for Airbnb Listings Using Machine Learning and Sentiment Analysis. Sustainability 2023, 15, 13159. https://doi.org/10.3390/su151713159

AMA Style

Alharbi ZH. A Sustainable Price Prediction Model for Airbnb Listings Using Machine Learning and Sentiment Analysis. Sustainability. 2023; 15(17):13159. https://doi.org/10.3390/su151713159

Chicago/Turabian Style

Alharbi, Zahyah H. 2023. "A Sustainable Price Prediction Model for Airbnb Listings Using Machine Learning and Sentiment Analysis" Sustainability 15, no. 17: 13159. https://doi.org/10.3390/su151713159

APA Style

Alharbi, Z. H. (2023). A Sustainable Price Prediction Model for Airbnb Listings Using Machine Learning and Sentiment Analysis. Sustainability, 15(17), 13159. https://doi.org/10.3390/su151713159

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Sustainable Price Prediction Model for Airbnb Listings Using Machine Learning and Sentiment Analysis

Abstract

1. Introduction

Machine Learning’s Impact on Sustainable Education

2. Literature Review

2.1. Using Machine Learning in Higher Education as a Sustainable

2.2. Factors Impact Room Prices

2.3. Price Prediction

3. Research Methodology

3.1. Data Collection

3.2. Data Pre-Processing

3.3. Sentiment Analysis on the Reviews

3.4. Feature Construction

3.5. Regression Models

3.6. Model Performance Evaluation

4. Results and Discussion

4.1. Feature Correlation Analysis

4.2. Regression Model Analysis

4.3. Feature Importance Analysis

5. Conclusions, Implications and Future Works

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI