3.1. Model Proposal
With the widespread development and adoption of various travel apps and websites, tourists often refer to evaluations of hotels by other users. They combine these evaluations with their own preferences for hotel attributes such as service, price, and location when assessing and choosing hotels. However, due to time constraints and the abundance of hotel review information, tourists can only access a portion of the information, leading to a lack of comprehensive and objective assessments of hotels. Addressing the challenge of providing valuable information to tourists from vast amounts of data is essential, and recommendation algorithms are a commonly used approach to solve this problem.
Considering the aforementioned issues, current research on hotel recommendation algorithms rarely integrates users’ historical data, such as textual reviews. Furthermore, probabilistic linguistic term sets (PLTSs) exhibit favorable characteristics. They can effectively capture users’ perceptions and intentions from textual data, enabling precise recommendations. Hence, this paper proposes a model that combines PLTS with online reviews to enhance the accuracy of hotel recommendation algorithms. This model consists of three main modules.
Module 1: Sentiment Analysis.
Using the jieba tool for tasks such as word segmentation and part-of-speech tagging, the original review text is processed. Sentiment words and intensity adverbs are analyzed to construct the PLTS for the review text.
Module 2: Attribute Weight Calculation and Similarity Computation.
Based on the PLTS of the review text, attribute weights and similarities are computed to obtain a hotel similarity evaluation matrix.
Module 3: Prediction and Recommendation.
Utilizing the hotel similarity evaluation matrix based on the probabilistic linguistic dataset and incorporating user historical data, the model generates a recommendation list.
3.2. Material and Methods
With the large-scale development and popularization of various tourism software packages and websites, tourists prefer to read other users’ evaluation of hotels and evaluate and choose hotels based on their own preferences for hotel attributes, such as service, price, and transportation. However, due to the limited time and the large amount of hotel review information, tourists can only obtain partial information and lack a comprehensive and objective evaluation of the hotel. Determining how to provide tourists with valuable information from a large number of data is an urgent problem to be solved, and recommendation algorithms are common used to solve this problem.
In this paper, the set
will be employed to represent the set of hotels, while the set
will represent the hotel attributes that users consider when referencing other users’ review information during the hotel selection process. A large number of online review texts are processed based on the probabilistic language term set, and after obtaining useful information, the system will recommend multiple hotels according to the historical data of hotel selection by tourists. The specific idea is shown in
Figure 1.
3.3. Data Acquisition and Preprocessing
After determining the research content, the data will be crawled and processed to prepare for the subsequent recommendation algorithm. The specific steps are shown in
Figure 2.
3.3.1. Data Crawling
In this paper, Python 3.6 is used to crawl the hotel review data of a tourism website in the same period of time, and the set represents the hotel set, that is, , where is the number of hotels. Each hotel is crawled for review information, which is exported to a csv file.
3.3.2. Online Comment Text Processing
After obtaining the review data, the jieba tool performs word segmentation and parts-of-speech tagging processing on the original review text. In addition, we select and refer to the stop word table UTF-8 provided by IKAnalyzer to delete words with little or no value, such as modal words and function words, so as to reduce the interference of stop words on sentence analysis, and the sentences with ambiguous semantics are processed manually. After word segmentation and stop word processing, the words describing hotel attributes and evaluation in the review are obtained. In this paper, represents the word set of the comment statement of the hotel, and represents the word in the word set .
3.3.3. Word Analysis of Hotel Evaluation
- (1)
Construct word knowledge base
As there is currently no sentiment dictionary specifically for hotel review analysis, this paper developed its own vocabulary for this model, which was constructed by a professional familiar with the web terminology of online hotel reviews, by manually examining the review content. Tourists’ online evaluations of hotels generally involve multiple attributes of the hotel and represent different attitudes, such as the attitude towards hotel facilities and services, as shown in
Table 1. By looking at online reviews of various hotels, we can identify the attribute objects evaluated in the reviews and judge their tendency fields. After consulting experts, we finally built a word knowledge base for analyzing hotel reviews [
25].
- (2)
Establish standard attribute words
In the model, the attribute words in the reviews reflect the tourists’ attention to some aspects of the hotel when choosing the hotel, but the expression of the attributes is slightly different. Therefore, before the analysis and processing of the attribute words, the word frequency of the reviews is analyzed, the standard attribute words are selected, and the common attribute words are classified. The properties of the hotel are represented by the set
,
, and
is used to represent the set of attribute words involved in the comment
of the
hotel,
. Assuming that
represents the common attribute words in the comment, and
represents the selected standard attribute words, the similarity between
and
is calculated according to the calculation method of semantic similarity in the Synonymous Word Forest [
19], and the common attribute words are transformed into standard attribute words.
- (3)
Determine sentiment analysis rules
Since the expression of comments is highly flexible, the HowNet sentiment dictionary and manual annotation methods are used for sentiment analysis of comment statements [
22]. In the analysis of hotel emotion words, in addition to the positive and negative emotion analysis, 7-granularity is also adopted to distinguish the degree of emotion. As shown in
Table 2 below, the positive and negative of emotion words express the tourists’ positive, neutral, or even negative attitudes towards the hotel, while the degree adverbs describing emotion are mainly used to describe the intensity of the attitude, such as “extremely”, “very” and other degree adverbs.
In this paper, we construct a more detailed 7-granularity classification of emotion words, transform emotion adverbs into 7-granularity language terms, and explore the intensity of emotion words. Suppose is a 7-granularity size scale set, , where the positive and negative signs of the s subscript represent positive and negative emotions, respectively, represents a neutral attitude, and the number of subscripts represents the degree of attitude. The larger the number, the stronger the attitude. represents “very poor”, represents “very poor”, represents “bad”. represents “general”, represents “good”, represents “very good”, represents “very good”.
In order to make the degree of positive and negative emotions more accurate, the division method follows the following rules:
- (a)
Positive or negative words without degree adverbs are classified as or ;
- (b)
Other degree adverbs are uniformly classified as or according to the classification of emotion words;
- (c)
Emotion intensity words containing “extreme”, “invincible”, “only”, and “very” are classified as or according to the positive and negative direction of emotion words.
3.3.4. Conversion of Online Comments
After the hotel reviews are preprocessed, the data are described and counted using probabilistic language term sets (PLTSs). After data crawling, the attributes in the q review data of each hotel are transformed with language terms, and all attributes and the frequency of language terms are counted. This is represented by a probabilistic language term set, which is divided into the following three steps, as shown in
Figure 3:
Step 1: Assume that hotel exists, analyze its review text set , attribute word set , and then transform the emotion words in the evaluation text into 7-granularity language terms according to the rules.
Step 2: Calculate the frequency of each evaluation language term in each attribute word, and count the total frequency of each attribute .
Step 3: Calculate the evaluation term probability
of attribute words. The calculation formula is shown in Formula (6). After calculating the
of all attribute words, use the probabilistic language term set
to describe the evaluation set of the
th attribute of the
hotel.
3.4. Steps for Recommendation Algorithms
After the comments are converted into a probabilistic language term set, the recommendation algorithm will be constructed. The specific steps are shown in
Figure 4.
The evaluation matrix
of the hotel is constructed to describe the obtained probabilistic language term set.
is used to represent the evaluation information about the hotel
attribute
.
,
and
are integers.
Each tourist has different needs for hotels, so they will pay different degrees of attention to the attributes of hotels [
18]. In this paper, the maximum deviation method is adopted to determine the attribute weight. In general, the greater the attribute weight, the greater the deviation degree between hotels, and the stronger their attribute differentiation degree [
20]. Assuming that there is a hotel
, under the same attribute
, the deviation degree between the hotel
and other hotels is first calculated, and then the total deviation degree between the hotel
and other hotels is calculated. Secondly, the total deviation of attributes in the evaluation matrix is calculated, and the maximum deviation optimization model is constructed. Finally, the Lagrange function is used to solve the problem, and the standardized attribute weights are obtained. Specific calculations are as follows:
Suppose that there is an attribute weight set
. According to Formula (3), under attribute
, the deviation degree between hotel
and other hotels is:
where
.
The total deviation degree between hotel
and each hotel is:
In the evaluation matrix
, the total deviation degree between all attributes is:
The maximum deviation optimization model is constructed as follows:
Then, the Lagrange function is constructed to solve the model:
Finally, the normalized attribute weights are obtained:
The method of weighted similarity is used to calculate the similarity between hotels, that is, the similarity of two hotels under the same attributes is first calculated, and then the weight of attributes is added to calculate the weighted similarity, and finally the similarity matrix between hotels is constructed, from which the similarity between the two matrices can be directly observed.
In view of the fact that the cosine similarity calculation results tend to be optimistic [
19], while the modified cosine similarity calculation results tend to be pessimistic, this paper adopts a compromise and an improved similarity calculation formula, as shown in Formula (13), where α is the adjustment coefficient:
- (1)
According to Formula (13), calculate the similarity of
of two hotels under the same attribute
,
,
- (2)
The weighted similarity between the two hotels is obtained according to the attribute weights.
- (3)
Construct the hotel similarity matrix
M:
According to the similarity matrix obtained in Step 3 and the historical hotel stays of tourists, the hotels that reach the threshold are recommended and sorted according to the similarity size, and the hotel recommendations are made to tourists. Assume that is the hotel where the visitor has previously stayed, is the hotel subscript, and the recommended threshold is . When , the system will recommend the hotel to the tourist; otherwise, the hotel will not be recommended.