1. Introduction
Tourists typically make informed decisions before traveling to visit a country or a specific destination. Their final choice may depend on several reasons. For example, they may decide their destination based on local culture to gain insights into the rich history, traditions, and customs of the destination [
1]. The cultural distance between the country of origin and the destination may also be a driving factor [
2]. They may wish to explore historic landmarks and archeological sites to understand the destination’s past. They may be driven by the natural beauty of the destination, including the presence of unique landscapes, flora, and fauna [
3]. Of course, they may also act based on the search for personal relaxation or recreation [
4], as well as to participate in an event [
5] or being moved by their personal nostalgia for the place [
6]. In some cases, tourists may visit a country a second time or even become loyal visitors, choosing that country as their travel destination multiple times over the years. Tourists deciding to revisit a country have a special appeal, since it was shown in [
7] that loyal visitors are more likely to spread word-of-mouth advertising and offer a lower risk associated with their profitability. However, first-time visitors are reported in that paper to be less price-sensitive and spend more.
Entities in charge of promoting tourism have a significant interest in identifying would-be visitors. Such entities may be tourism ministries, national or regional tourism boards, chambers of commerce, private tourism companies, and event and conference organizers. There are several reasons for that interest. Identifying potential visitors allows tourism entities to tailor their marketing efforts to specific demographics, interests, and preferences. This targeted approach increases the chances of attracting the right audience for a destination. Knowing the characteristics and preferences of potential visitors may enable tourism organizations to create customized promotional campaigns, such as personalized advertising, special promotions, and tailored messages that resonate with the target audience. Furthermore, by understanding the demographics and geographic locations of potential visitors, tourism entities can allocate their resources more efficiently and even plan and develop the necessary infrastructure to accommodate the increased demand, e.g., by improving transportation networks, expanding accommodation options, and enhancing other essential services.
However, where can tourism-promoting agencies spot would-be visitors?
Social media is gaining an increasing role in tourism promotion [
8,
9,
10], acting as a modern and wide-ranging version of word-of-mouth (eWoM) mechanisms [
11,
12]. eWoM is particularly significant for tourism [
13,
14,
15]. Tourism promotion bodies may use social media to disseminate destination-related information. Social media is also a means through which past or prospective visitors express their opinions, influencing other would-be visitors. Travel blogs have arisen to reach a wide audience and share the blogger’s travel experiences [
16]. Social media influencers may turn into real Internet celebrities whose endorsement may significantly increase the popularity of a destination [
17]. The influence on would-be visitors may be quite remarkable. For millennials, it has been shown that sharing luxury travel experiences and benign envy toward the experience sharer stimulates consumers’ own intentions to visit the same destination [
18].
It is then a natural strategy for tourism promotion bodies to monitor social media and detect would-be visitors. This could be the first step towards opinion analysis tasks and more targeted promotional activities.
The literature has, so far, focused on identifying or checking the antecedents for visiting (or revisiting) a specific destination. Our purpose is, instead, to predict whether a specific prospective tourist manifests an intention to visit a destination. Detecting such an intention may be the first step in an organized strategy to transform that intention into a real visit. The information we exploit to perform that prediction is the text expressing prospective tourists’ opinions on social media. This is a further difference with respect to the literature where questionnaires (i.e., a structured tool rather than a freeform tool) are employed to elicit tourists’ attitudes. As far as the authors know, this is the first such attempt.
In this paper, we propose a sentence-transformer architecture to predict visit intention based on the text of posts on social media. We employ two datasets extracted from Twitter (now X) as the social media of choice and approach the problem as a classification task using logistic regression as a classifier. We show that we reach an average accuracy of 90%, with minimal deviations from that average performance figure.
Our original contributions are the following:
We built two labeled datasets from social media containing opinions about Italy as a tourism destination;
We applied machine learning techniques based on sentence transformers to predict visit intentions, considering both the cases where we have a largely imbalanced dataset and where the imbalance is mild, or nonexistent;
We show the results of a preliminary explainability analysis to identify the most significant linguistic features that are associated with posts showing visit intentions.
2. Literature Review
The analysis of visit intentions as expressed on social media has gathered much attention in the literature. In this section, we review the literature, dividing it into two main streams: the papers examining the determinants of the intention to visit and those investigating the use of social media to express or trigger the intention to visit a destination. Much work has been devoted to analyzing either of two well-known investigation areas, which are the
tourism destination image and the
Product Country Image. The relevance of social media as a means to promote tourism by national tourism organizations was already established in [
19]. However, as far as the authors know, there is no paper addressing the same research question as the one dealt with in this paper, i.e., predicting the intention to visit a destination based on the text posted on social media. It is to be noted that our intention is not to understand the determinants of the intention to visit a country but to understand whether a post on social media expresses such an intention or not.
As factors that delimit the scope of the literature analysis, we considered the country of origin, the status (currently visiting versus not), and the method of opinion sampling (in a structured way as those expressed in a survey or through an unstructured approach, as the posts freely submitted on a social media platform).
We can first consider the papers aiming to identify the determinants of the intention to visit (or revisit) a country.
A seminal paper is attributed to [
20], who built a structural equations model to describe travel intentions based on both paradigms of tourism destination image and Product Country Image.
In [
21], a survey was employed to find out whether social media content about Saudi Arabia influences the decision to revisit the country, also through the mediation of perceived value and perceived trust. The survey was distributed to tourists from all over the world who were visiting the town of Al-Ahas in Saudi Arabia. The method employed to assess the degree of influence was standard Structural Equation Modeling (SEM).
Satisfaction was considered to mediate the intention to revisit for British tourists currently visiting Spain or Turkey in [
22]. The research methodology employed a self-administered questionnaire.
The impact of risk and uncertainty on the intention to visit was examined in [
23]. The destination country of choice was Australia, and panels from South Korea, China, and Japan were interviewed through a structured questionnaire for that purpose. A similar study was conducted for Table Mountain National Park in South Africa [
24]. The perception of risk was also found to be critical for the decision to attend mega-events [
25].
Instead, the influence of e-reputation, destination image, and social media marketing efforts on the intention to visit was examined in [
26] through the Stimulus–Organism–Response (SOR) approach. The data were collected again through a survey distributed to tourists in a specific location in India.
The influence of food on the intention to revisit was examined in [
27]. The sample was collected in Delhi, India, and was examined through SEM. The same aspect was investigated in [
28], where the impact of the food images of France, Italy, and Thailand was assessed through an online survey targeting online travel and food groups from
Yahoo.com and
MSN.com.
The image of the destination is also a relevant factor, as shown in [
29], where a SEM-based approach was employed to study its impact for American tourists traveling to Cuba. Four aspects that influence the perception of a tourism destination image were identified in [
30]: experience, history and culture, leisure services, and tourist destination. Data employed in [
30] came from an online travel platform (Ctrip Travel, considered to be the largest in China). Again, the most relevant dimensions were identified in [
31], where posts from TripAdvisor and Mafengwo are analyzed. Sentiment analysis was carried out through VADER. A questionnaire-based survey was further employed to identify which attributes represent the image of a specific tourism destination.
A systematic literature review was conducted in [
32] to investigate the use and impact of user-generated photos.
Two constructs were examined in [
33] to predict the intention to visit a country: destination brand personality self-congruity and perceived risks derived from criminal activities. The data were collected in the USA and concerned the intention to visit Mexico (it was, therefore, a deviation from the other papers where the data collected concerned tourists from different countries). The method chosen for the analysis was, again, SEM.
The topics of interest in a tourism destination were determined by analyzing microblog social media in [
34], both on-site and after the trip.
A standard SEM model was employed in [
35] to investigate the effects of motivation, past experience, perceived constraint, and attitude on revisit intention for mainland Chinese tourists traveling to Hong Kong as their destination. Post-visit impressions were also examined in [
36] by collecting data from micro-forums, blogs, e-commerce platforms, and websites.
Of course, the characteristics of tourists may also play a role in their destination decisions, namely due to the presence of an emotional connection between tourists’ perceived self-image and the brand personality of destinations [
37].
We can now turn to the investigation of social media as a means to express or strengthen the intention to visit a destination.
As to the relevance of social media, it was found that word-of-mouth spread through social media is a significant factor in triggering the intention to visit [
38].
A specific analysis was carried out through questionnaires, considering Malaysia as the destination for medical tourism [
39]. The role of social media (in comparison with other information sources) was found to be particularly relevant for attitudinal loyalty, i.e., the intention to visit coupled with past visits [
40]. Again, the questionnaire tool was employed, targeting tourists from European countries. Social media was also found to be a mediating device for the influence of the subjective norm (i.e., the way the others perceive the tourist behaviors) on the intention to visit a destination [
41]. The impact of social media on the effect that the destination image has on the intention to visit was also found to be significant in [
42], where questionnaires compiled from Malaysian tourists were examined for a very specific destination in Saudi Arabia. A specific investigation concerning the role of posts on Facebook in triggering visit intentions through benign envy was examined in [
43]. A questionnaire-based survey, though restricted to Vietnamese nationals, was carried out in [
44] to determine the intention to visit as a function of the desire to visit, the destination image, envy, and the attitude towards visiting a travel destination. A SEM model was employed. Furthermore, the impact of user-generated content on the intention to revisit and word-of-mouth was analyzed in [
45] through questionnaires collected in a specific destination (the Gulangyu island). The sentiment towards a tourism destination image was also analyzed in [
46], employing nine measurement dimensions.
The opposite target was adopted in [
47], where factors for non-revisiting were investigated using sentiment analysis techniques to label reviews posted on social media and, subsequently, examine the factors leading to positive or negative sentiments.
The relationship between the information quality of social media and the intention to travel was investigated in [
48]. The method of analysis was based on the Elaboration likelihood model framework and employed multiple linear regression to test the research hypotheses. The viewpoint adopted in [
48] is, however, different from ours: they study how the intention to travel of a social media user is influenced by other social media users through their posts (and the quality of the information they convey), while we considered whether a post by a social media user expresses their intention to travel.
Similarly, in [
49], the information posted on social media was used as a source of recommendations for tourist destinations, which are filtered and summarized by a BERT architecture. The same considerations put forward earlier apply to the target of the investigation, though a machine learning tool, rather than multiple linear regression, was employed.
Again, questionnaires were employed in [
50] to investigate the impact of social media on tourists’ destination decisions, where the SEM method was employed to analyze the questionnaires. The same considerations on the focus of the study and the method of choice done for the previous papers apply.
The impact of advertising (namely the types of destination advertising) through social media on tourists’ travel intentions was examined in [
51], where an experiment was carried out, and statistical hypothesis testing was adopted.
Tourists’ level of satisfaction, as well as the interaction with digital marketing channels, were investigated as the the antecedents of tourists’ behavioral intentions in their destination selection [
52]. The method was, again, SEM, relying on a set of questionnaires.
Means of avoiding the negative impacts of social media-induced tourism were studied in [
53], where the leveraging of social media to encourage desirable traveler behaviors was also proposed. Social media were seen in this case as a tool to manage and exploit to obtain a desirable behavior, rather than a source of information.
A similar role for social media by influencers, who can actively influence the travel decisions of other social media users, was advocated in [
54], where a mediation model was built and tested through SEM.
The reasons for using social media to plan one’s travels were investigated in [
55], where technological convenience and perceived enjoyment were found to be the biggest motivations for using social media.
The attitude of the endorser, namely their facial expression (e.g., whether smiling or not), was also found to be a significant factor in the success of influence [
56].
The choice of sampling in a specific geographical location, however important it was, is a significant limitation of the studies conducted, with just a few exceptions.
With respect to the literature, our paper differs as follows. Our research question concerned detecting the intention to visit a destination rather than understanding the factors that influence that decision. Furthermore, we did not rely on questionnaires but drew on the potentially much larger data source represented by social media. The opinions we collected came from users that were not limited to a specific geographical area, unlike most literature. Questionnaires impose limitations due to the requirement of answering just the questions included in the questionnaire, while posts may include any opinion by the person posting it. Considering social media posts rather than questionnaires represents an improvement, since they widen the number of participants, their geographical origins, and their possibility of expression. Finally, we employed machine learning techniques rather than techniques based on a superimposed structured model like SEM. Structural Equation Models are linear by nature, while a machine learning approach allowed us to adopt intrinsically nonlinear relationships, hence offering a more flexible representation of reality. The papers employing machine learning in the literature survey do it for sentiment analysis, which is a task different from the one we were tackling here. This is the case of [
36] (which employs BERT) and [
47], which uses a selection of methods, including decision trees and Support Vector Machines. Latent Dirichlet allocation is employed in [
30] to identify the topics of interest, which, again, is a task different from what we investigated here.
Overall, we can conclude that the goal that we considered here has not been pursued yet in the literature, so a comparison of our results with the literature cannot be made. On the other hand, the methods attempted so far to elicit the opinion of potential visitors (which, again, is a different task than that dealt with here) rely on tools (like questionnaires) and methods (like SEM) that are intrinsically less flexible and less powerful than our machine learning approach based on social media posts.
3. Dataset
In order to predict the visit intention as expressed on social media, we considered Twitter (now X) as the social media of choice. We extracted a large number of tweets using words related to tourism to filter out unrelated tweets. We focused on Italy as a country with a very large tourist base. We employed Twitter’s API to retrieve the tweets of interest [
57,
58], following the approach in [
59]. The query we used for the search was
Italy AND (visit OR holiday OR travel OR trip). We retrieved two datasets at different times using the same query. The two datasets differed largely in size and were labeled by different groups of examiners. The intent was to explore the performance of our method when fed with diverse inputs. In the following, we refer to those datasets as
(the smaller one) and
(the larger one). After collecting the data using Twitter’s API, we removed the duplicates (there were 169 in
and 23 in
). The text of each tweet was subsequently retrieved and labeled as either showing the intention to visit the country or not. Manual labeling was adopted. Each tweet was submitted to a couple of independent examiners (different for the two datasets). In the case of contrasting labels, the decision was reached by consensus. Consensus was built by a discussion between the two examiners. In the case of persisting differences, an expert adjudicator was called to decide. Cross-verification, where each examiner examines the labels provided by the other examiners, was therefore not needed as the examiners discussed the cases of contrasting labeling. The percentage of cases where consensus building was needed lay below 5% in both datasets. The accuracy of the process relied on the use of a couple of examiners, which reduced the risk of mislabeling. In fact, if we identify the probability of mislabeling for the single examiners as
and
, the probability of obtaining a correct decision without resorting to consensus building was
, and the probability of straightforward error was
(i.e., when both examiners err), which is quite lower than the individual probabilities of error. We referred to the group showing the intention to visit as the
Visit class. Hence, the tweets concerning tourism but with no visit intentions were marked as the
NoVisit class. We stress the fact that we did not have any information about whether the people who posted the tweet actually visited Italy. We acted based on the text only and identified the sheer intention to visit.
The datasets’ size and composition are shown in
Table 1. As can be seen, we have two datasets of quite different sizes,
and
, with
being roughly seven times as large as
. The two datasets are made of completely different tweets, i.e., there is no overlapping. Furthermore, they exhibit quite different levels of balance. The composition of
is largely imbalanced, with the
Visit class representing just 4.2% of the total. Instead,
, though not perfectly balanced, exhibits a mild imbalance, again with the
Visit class being the smaller one, representing 37.3% of the total.
Since an imbalanced dataset may lead to misleading results, we employed rebalancing techniques for the dataset , which is heavily imbalanced. We refrained from using rebalancing techniques for the dataset , since it is just mildly imbalanced.
For the dataset
, we employed downsampling to achieve a balanced dataset. Downsampling, as all rebalancing actions, aims at removing or mitigating the bias due to the larger presence of the majority class. Though downsampling could introduce bias of its own if done incorrectly, here, we employed random downsampling for the majority class. Specifically, we employed simple random sampling without replacement, where the size of the subsample obtained after downsampling is fixed, and each possible subsample has the same probability of being extracted [
60], thus preserving the original distribution. Since we were applying sampling to a single class (the majority one), class-proportional techniques to preserve the original probability distribution, as in [
61], were not needed. It has been shown, e.g., in [
62], that downsampling helps manage the skewness due to the majority class and achieves better classifier performance. In order to exploit the minority class (
Visit instances) as much as possible, we kept all the
Visit instances and randomly sampled an equal number of
NoVisit instances. Since the number of
Visit instances was 230, we collected 230
NoVisit instances and ended up with a dataset made of 460 tweets (230 per class). However, in order to fully exploit our imbalanced dataset, we employed cross-validation [
63]. Cross-validation involves dividing the available data into multiple folds (subsets), using one of these folds as a testing set, and training the model on the remaining folds. This process is repeated multiple times, each time using a different fold as the testing set. Finally, the results from each testing step are averaged to produce a more robust estimate of the model’s performance. As to the number of folds employed in cross-validation, most textbooks and libraries adopt a default value of 10. However, a recent experiment highlighted that the differences in performance due to the number of folds are quite limited, and the optimal number of folds lies between 10 and 20 [
64], but depends on the overall size of the dataset. We opted for the higher end of the range and chose to use 20 folds. It is to be noted that the use of cross-validation practically removes the loss of information that is typically associated with downsampling, as the construction of several training datasets ends up exploiting all the instances in the dataset. Since we had 4601
NoVisit tweets for
, we partitioned them into 20 groups of 230 tweets each to match the number of
Visit tweets. This left a single tweet out, so we practically exploited the entire original dataset to its full extent. We selected each of those 20 groups, in turn, to couple them with the single group of
Visit tweets, making 20 balanced datasets of 460 tweets each. Each balanced dataset was made of the unique group of 230
Visit tweets and one of the 20 groups of 230
NoVisit tweets. We carried out a 20-fold cross-validation, expecting something less than a 20-fold improvement in the variance, as all folds had half of their data in common, namely the
Visit tweets. We ran our classification algorithm for each derived balanced dataset, randomly splitting the dataset into two portions, devoted, respectively, to training and testing according to the fixed proportions 80% and 20%, respectively.
As for the dataset , the imbalance was not so large as to call for rebalancing. In addition, the dataset was quite large. Hence, we employed it as it was, without any rebalancing or cross-validation.
4. Method
In this section, we describe the method we employed to identify the presence of a visit intention in posts on social media.
The task of distinguishing between Visit and NoVisit instances was modeled as a binary classification task. For the sake of avoiding confusion, we do not use the term positive for the Visit class (and, similarly, the term negative for the NoVisit class), since we will later use those terms in the contrastive approach that is embedded in SetFit. We did not distinguish whether the visit was actually a revisit, as we did not have information about the previous behavior of the people submitting the post. Revisit intentions were then labeled as visit intentions nonetheless, regardless of the fact that the would-be tourist had already visited the country. We stress, once again, that our purpose was not to predict whether the social media user would actually visit the country but rather to predict whether that user was showing the intention to visit the country in their post. In the absence of true, validated data about the social media user’s identity and a tracking system that links that user to their travel behavior (e.g., tracking their entry into a country or their purchase of tickets), predicting users’ actual visits could not be validated, and we were limited to predicting intentions.
In the following, we describe the methodology we employed to analyze the two datasets. As already mentioned, the two datasets exhibited quite a different degree of imbalance. For that reason, we employed different algorithms. Dataset needed a thorough rebalancing activity, while that was not the case for . First, we describe the methodology for . What follows refers exclusively to , until we explicitly turn to .
As described in
Section 3, the size of each training/testing dataset in the dataset
was not very large by machine learning standards: each fold was made of 460 tweets, which we further subdivided into a training portion (made of 368 tweets) and a testing one (made of 92 tweets). Nowadays, datasets in excess of hundreds of terabytes may be available [
65]. The small-dataset problem is well known in machine learning. The capability of machine learning algorithms to recognize patterns is related to the size of the dataset: the smaller the dataset, the less accurate the algorithm [
66]. However, good results may also be obtained with datasets as small as hundreds of instances [
67]. The set of techniques to deal with small datasets in machine learning falls under the name of
few-shot learning [
34,
68], where examples falling in the categories of few-shot learning exhibit a number of instances per class ranging from tens of instances to slightly beyond one thousand (see Table 3 in [
68]).
The scarcity of labeled data made this dataset prone to be categorized as a case of few-shot learning [
69]. For this reason, we resorted to the
SetFit method presented in [
70], which should allow us to obtain good accuracy even in the presence of a small number of labeled instances in our training dataset.
SetFit relies on sentence transformers and falls into the class of few-shot learning methods, which aim to achieve good classification performances.
A graphical description of the flow of operation in
SetFit is shown in
Figure 1. It is based on two phases. In the first phase, a sentence transformer is fine-tuned after being fed with sentence pairs under a contrastive approach. In the second phase, a text classification head is trained using the data output in the first phase as training data. We now describe the steps composing these two phases in
Figure 1 in detail.
The first step, as described in
Figure 1, is represented by feeding the algorithm with our documents. In our case, the dataset described in
Section 3 was fed as a
csv file, where each row contained a tweet and its label.
In the second step in
Figure 1, we generated the pairs of positive and negative sentences. As we will describe hereafter, our method employed a Siamese configuration. Hence, we had to feed the model with pairs of labeled tweets rather than directly with
Visit and
NoVisit instances. For that purpose, we randomly picked pairs of tweets and labeled them. Labeling was performed using a contrastive criterion, i.e., we labeled pairs from the same class (e.g., two tweets both in the
Visit class) as positive and pairs from different classes (e.g., one tweet belonging in the
Visit class and the other in the
NoVisit) as negative. It is to be noted that labeling now applies to pairs of tweets rather than single tweets. This contrastive fine-tuning approach allowed us to obtain a much larger training set. As shown in [
70], we obtained a roughly quadratic increase in the number of training instances. We now describe the sentence-pair generating process in more detail.
Formally, we had a dataset of n tweets , and their respective labels , , with {Visit, NoVisit}. For each fold, we generated a set of 460 tweets, perfectly balanced between the Visit class and the NoVisit class (i.e., 230 instances for each class). We recall that the Visit instances were always the same, while a subdivision into twenty folds was made for the much larger NoVisit class. By stratified sampling (so as to maintain the balance), we extracted the training and testing datasets according to the 80/20 proportions, which led us to have a training dataset made of 368 tweets and a testing dataset made of 92 tweets. From the dataset, we randomly picked R same-class pairs and R different-class pairs. If the number of tweets in the Visit class is v and that in the NoVisit class is , the number of potential same-class pairs is , where the first term represents all the pairs that we can form by coupling two Visit instances (which can be obtained as the number of combinations of v instances two by two, i.e., ), and the second term refers, instead, to the NoVisit instances. For the training dataset, since we must have a balanced dataset, in the end, we needed after downsampling from the original dataset of 4831 instances (we recall, once again, that downsampling actually refers to the NoVisit class only). The number of potential same-class pairs was then . On the other hand, the number of potential different-class pairs was , where we coupled each Visit instance with each NoVisit instance. This formula is simply the cardinality of the Cartesian product of the set of Visit instances by the set of NoVisit instances. Again, in the case at hand, , so we had . We marked the same-class (positive) pairs with bit 1 and different-class (negative) pairs with bit 0. For both classes, we had two sets of pairs (positive and negative). For the class Visit, after adding the bit showing class concordance, we had the sets of triplets , where Visit, and , where Visit. Similarly, for the class NoVisit, we had , where NoVisit, and , where NoVisit. We generated the fine-tuning dataset by aggregating the positive and negative triplets across the two classes .
The third step in
Figure 1 consists of fine-tuning a sentence transformer. Sentence transformers are derived from pre-trained transformer models that use Siamese and triplet network structures to obtain semantically meaningful sentence embeddings [
71]. By semantically meaningful, it is intended that the vector representation of two semantically similar words are close to each other in the vector space. In a Siamese (triplet) neural network, identical algorithms are applied to two (three) input vectors along two (three) parallel paths, with their outputs being compared at the end [
72,
73]. The sentence transformer we used is a modification of the BERT (Bidirectional Encoder Representations from Transformers) model [
74], namely
paraphrase-mpnet-base-v2 (see the model on the Hugging Face repository at
https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2 (accessed on 29 September 2024)). Fine-tuning the model was accomplished by feeding our Siamese configuration with the pairs of positive and negative instances. The semantic similarity was evaluated through the cosine similarity between the embedding vector representation of those tweets. Fine-tuning was carried out by minimizing the cosine similarity (which plays the role of loss function) between the vectors representing positive pairs and maximizing it for negative pairs, as in [
70]. This optimization (minimization and maximization) task was accomplished through the AdamW optimizer [
75]. This optimizer has been shown in [
76] to yield models that generalize much better than Adam and to compete with the stochastic gradient descent optimizer while training much faster. The fine-tuned model was then used to encode the sentences (step 4 in
Figure 1) and obtain a single sentence embedding per training sample (step 5 in
Figure 1).
After obtaining the optimal encoding, we could use the embedding that was now associated with the tweets as features for the classifier. Specifically, the components of each embedding vector, together with the associated label, were fed as input to the classifier. As a classifier, we employed logistic regression, adopting the same choice as in [
70]. The logistic classifier returned a probability value for each of the two classes. We chose the higher-probability class as the output of our classifier.
After this long description of the method employed for the dataset , we can now describe the much simpler procedure for the dataset . We did not need rebalancing. We did not need cross-validation. We did not need to adopt a few-shots learning approach. For , we just coded each tweet through the same sentence transformer employed for , without any fine-tuning, and then fed those embeddings to the logistic classifier.
5. Results
In this section, we show the results of our classification task to predict the intention to visit a tourism destination, Italy in our case.
We employed the following well-established performance metrics:
Accuracy;
Precision;
Recall;
F1.
In order to precisely define these metrics, we now associate the term
Positive with
Visit and
Negative with
NoVisit, so that, e.g., True Positives (TPs) will represent the
Visit instances that were correctly classified as such, while False Negatives (FNs) will represent the
Visit instances that were incorrectly classified as
NoVisit ones. The labels assigned to the instances by the human experts mentioned in
Section 3 were considered as the ground truth.
Accuracy is defined as the percentage of correctly classified instances:
Precision is the percentage of truly
Visit instances over all the instances that have been classified as
Visit ones:
Recall is the percentage of
Visit instances that have been correctly classified as
Visit, i.e., is the accuracy over the
Visit class:
Finally, F1 is the harmonic mean of precision and recall:
We first report the results for the dataset and then for the dataset .
As stated in
Section 3, for the dataset
, we adopted a cross-validation approach to make full use of our data. Since we obtained 20 training and testing datasets, we carried out a 20-fold cross-validation. For each fold, we computed all four performance metrics introduced earlier: accuracy, precision, recall, and F1. We iterated network training through ten epochs.
The average accuracy is over 90% up to five epochs and goes slightly below (89.4%) for ten epochs. The minimum accuracy achieved over the 20 folds is 85%. The coefficient of variation (ratio of standard deviation to average value) for accuracy is never larger than 3.5%. Hence, the accuracy is very high and pretty stable across the folds. We obtain similar performances by looking at other metrics. The average precision is not lower than 86.7%, with a peak of 90% at epoch 1. The standard deviation is a bit larger, but always around 5%. We obtain higher values for the recall metrics, which exhibit an average value significantly higher than 90% and a coefficient of variation around 3%. As can be seen, all the metrics exhibit a very low standard deviation across the 20 folds. Hence, the performance of our algorithm appears very stable and reliable. Though the performance metrics exhibit very close value, recall is always found to be larger than precision: the algorithm is slightly generous in assigning instances to the Visit class, allowing us to capture more Visit instances at the risk of including some NoVisit instances in the Visit class.
Recall is found to be higher (on average) than precision on all epochs. The algorithm is definitely cleverer at recognizing posts showing the intention to visit than rejecting posts that do not show that intention. In a real scenario, where negatives largely outnumber positives (with roughly a 20:1 ratio in some datasets like
, as can seen in
Table 1), this behavior can lead to quite a higher number of false positives than false negatives and make the algorithm err on the more tolerant side, labeling as positive more posts than what would be right.
We can see how the estimated accuracy changes with the number of epochs. In
Figure 2, we have plotted the accuracy with three-sigma error bars vs. the number of epochs. Under the Gaussian assumption for the distribution of the sampling accuracy, the three-sigma interval ensures that accuracy values fall into this interval with 99.7% probability [
77]. Hence, the three-sigma interval can be considered a safe bracket for the accuracy values we can obtain for out-of-sample data. We can see that the average accuracy degrades by a tiny amount (1.8%), probably highlighting a slight overfitting. In order to provide a quantitative indication of the degree of overfitting, we compare the accuracy we have obtained on the training and testing datasets, respectively. The presence of overfitting is due to the ML model parameters being too fit for the training dataset at the expense of the capability of generalizing and obtaining good accuracy on the testing dataset. For that reason, we considered the following overfitting metric, where the accuracy gap is normalized to the average accuracy over the testing and training datasets:
We obtained
, i.e., the accuracy gap was slightly larger than 2%, which can be considered quite a low value. Going back to
Figure 2, the estimation interval width is quite steady, with a 3-
roughly around ± 0.09 (i.e., roughly 10% of the average accuracy).
We can achieve a better understanding of the way our algorithm errs and where it may be improved by looking at the confusion matrix. We focused on the two extreme cases, i.e., the best-accuracy fold (which was 9 in epoch 1) and the worst-accuracy fold (which was 14 in epoch 1). We report the two confusion matrices in
Table 6 and
Table 7, respectively. The number of total instances is 92, i.e., 20% of the total dataset made of 460 instances. As can be seen, our algorithm is quite balanced, with type I errors (i.e., mistaking no-intention posts for intention ones or false positives) slightly prevailing in the best-performing fold and the reverse happening for the worst-performing fold (where false negatives slightly prevail). Of course, as already hinted, these values refer to our balanced dataset and would change in the presence of an imbalanced scenario that may occur in the actual application of the algorithm.
We can now examine the results for the dataset
. In
Table 8, we show the confusion matrix. As stated in
Section 4, for the dataset
, we did not need to resort to rebalancing and cross-validation. Hence, there is no across-fold analysis. The overall accuracy is 90.2%, while the per-class accuracy values are, respectively, 89% for the
Visit class and 91% for the
NoVisit class. As can be seen, the performances of the two classes are very close, and the bias towards the majority class is extremely limited without resorting to rebalancing. If we look at the other performance metrics as we did for
, we obtain a precision of 85.4% and a recall of 89%. Again, the two metrics are rather close, but recall is higher than precision. This result confirms what we observed for
: the algorithm tends to include more instances than necessary in the
Visit classes. Furthermore, the performance appears quite aligned for both the small, heavily imbalanced dataset
and the larger, mildly imbalanced dataset
.
Finally, we analyzed the linguistic features that were associated with the user’s intentions. First, we analyzed two sample tweets to examine how the sentence elements contribute to the algorithm’s decision in a concrete case. We then considered the top features across the whole dataset. We first considered two sample tweets labeled, respectively, as
Visit and
NoVisit and correctly classified by our classifier. For both, we carried out an explainability analysis by adopting the LIME (Local Interpretable Model-agnostic Explanation) method [
78], using the words in the tweets as features (see [
79] for a recent survey of explainability methods for transformers). The LIME method provides a surrogate linear model, where the weights in the regression equation represent the relative importance of the associated features (each feature acts as a regressor in the linear model) in leading to the classifier’s decisions. In
Figure 3, we see that the major features justifying the label
Visit mention Italy and the verb
wanna shows a voluntary action. On the other hand, the major features for the
NoVisit tweet in
Figure 4 are, again,
Italy and
want as the major positive features, while the
think verb and the personal noun
Daniel appear as the major negative features leading to the overall negative intention. In both cases, the use of verbs appear as a major element in explaining the classifier’s decisions.
In
Figure 5, we see, instead, the ten features with the highest average weight, considering all the tweets. All the terms have been pre-converted to lowercase. We see two terms,
travel and
visit, that form the object of the intention expressed in the tweet. We see, again, the voluntary verb
wanna and the motion verb
go, associated with the motion preposition
to. A specific destination appears (
Taormina). The picture is completed by the reference to the user itself (
myself) and
watercolour, related to the landscape sometimes associated with the destination.