1. Introduction
User-modeling and personalization has been the cornerstone of many of the new services and products in the high-tech industry. The personalization of contents reduces information overload and improves both efficiency of the marketing process and the user’s overall satisfaction. This is especially relevant in e-commerce web sites—e.g., Amazon- and Social Networking Services (SNSs) where users may read published opinions to gather a first impression on an item before purchasing it and express their opinions on products they like or dislike. Recommender systems use this information transforming the way users interact and discover products on the web. When users assess products, the website models how the assessments are done to recommend new products they may be interested in [
1], or to identify users of similar taste [
2]. Most existing recommendation systems fit into one of the following two categories: (
i) content-based recommendation or (
ii) collaborative filtering (CF) systems [
3].
The first approach defines the users profile that best represents all the gathered personal information (such as tags, keywords, text comments, and likes/dislikes [
4]) and recommends items based on each individual characteristics. These recommendations are then generated by comparing items with user profiles [
4]. The main benefit of this approach is the simplicity and ease of interpretation of the recommendations it provides. One of its main drawbacks is the fact that it ignores user opinions on different products, taking only their preferences into account.
Figure 1 shows the metadata of a real text review that includes the user’s opinion. There are several works that try to solve this limitation by including the text features of the user’s reviews (e.g., frequencies of occurrence of words) into the model. In [
5] the authors incorporate reviews, items, and text features into a three-dimensional tensor description to unveil the different sentiment effects that arise when the same word is used by different users in ranking different items. These improvements lead to better ranking predictions when compared to previous models. Also, ref. [
6] presents an extension of the user-
algorithm that uses the similarities between text reviews to gather the similarity between users, outperforming the conventional algorithms that only use the ratings as inputs.
On the other hand, the CF category has achieved the most successful results as shown in the Netflix Prize Challenge [
7,
8]. This method uses the similarities among users to discover the latent model which best describes them and retrieves predicted rankings for specific items. Some modifications have been later proposed to properly address the negative latent factors [
9,
10] as well as to gain interpretability [
11]. In this last reference, authors present a hidden factor model to understand why any two users may agree when reviewing a movie yet disagree when reviewing another: The fact that users may have similar preferences towards one genre, but opposite preferences for another turns out to be of primary importance in this context. These same authors also propose in [
1] the use of the latent factors to achieve a better understanding of the connection between the rating dimensions and the intrinsic features of users and their tastes.
During the last decade, the proliferation of social media has gone hand in hand with the analysis of user opinions and sentiments [
12], which has been applied successfully in a variety of fields such as social networks [
13] or movies [
14]. In [
15] authors use text reviews to describe user interests and sentiments, hence improving the results obtained in the prediction of the ratings. These text reviews are also used [
16,
17] to guide the learned latent space by combining content-based filtering with collaborative filtering techniques. However, being capable of predicting the opinion or sentiment associated with non-existent reviews remains an open question that deserves further attention [
18]. This article in particular focuses on improving the current recommendation systems by predicting the sentiment strength scores for a series of opinion keywords that describe each item.
This can be illustrated with the following text review (see
Figure 1): “
These are probably the best sounding guitar strings without breaking the bank. I have made them last up to 2 months …still obtaining decent sound”. Its overall rating score is 4 and a natural language processing analysis provides information about what the most representative opinion-words of that review are (see
Section 2 for more detail on this analysis). Here, the proposed dictionary for the musical instruments domain that this review belongs to is composed by the words {feel; guitar; string; price; quality; sound; string; tone} and the extracted sentiment for opinion keywords subject to prediction in the review are {
guitar (+1.2);
string (+0.3);
sound (+2)} (sentiment strength scores are enclosed in brackets). Similarly, the sentiment keywords for a second review example “
These strings sound great. I love the sustained bright tone. They work well with my cedar top guitar. Easy on the fingers too” (with an overall score of 5) are {
string (+4);
cedar (+1);
finger (+2);
guitar (+5);
tone (+3)}.
The previous examples exemplify the usefulness of the prediction of sentiment strength scores in this context: They provide insights into the reasons why a user would like or dislike a product (recommendation explanation). These predictions are highly valuable for personalized product recommendation since they explain why the recommendation system ’thinks’ users would like the recommended product. They also lead to a double-check verification on the assumptions made by the system according to the user needs, making the systems more robust. Furthermore, it is well known that emotions are important factors that influence overall human effectiveness including rational tasks such as reasoning, decision making, communication, and interaction [
19]. In this context, the prediction of sentiment keywords can be a middle step towards the prediction of users emotions (e.g., using six basic Ekman’s emotions [
20,
21]) associated with the item under recommendation [
22,
23].
Our main contribution is to propose and describe a model that combines the use of latent spaces—connected with user tastes and product features, sentiment language analysis, and matrix factorization techniques to predict user opinions. Concretely, we generate a distinct vocabulary for each item based on previously submitted reviews and will predict the sentiment strength a user would give to each opinion-word in the vocabulary. This article provides an extensive analysis (both qualitative and quantitative) showing the capabilities of the proposed method on the Amazon benchmark dataset [
24]. This gives us a better understanding of the main reasons why a user would like or dislike a product.
We would like to stress that we are not interested in predicting the opinion-words themselves, but the strength of the sentiment associated with the item characteristics. Our approach naturally extends previous works [
25,
26] assuming the existence of a latent space that accurately represents the user interests and tastes [
11]. The approach is based on a two-step process: (i) Setting up the opinion dictionary associated with the Amazon dataset under study, and (ii) Prediction of the sentiment scores that users would assign to some keywords that describe the item should they have the opportunity to review it, based on the hidden dimensions that represent their tastes and interests. It is well known that the input matrix sparsity (which is a sub-problem of the cold-start problem) of this type of datasets is typically very large (≈99%) [
27]. This explains the need to reach a trade-off between the maximum number of opinion keywords which are desirable to predict without enlarging the sparsity too much, and the minimum amount to be predicted in order to ensure a minimal functionality.
2. Latent Space Based Learning
The first step toward the creation of a model to predict the sentiment score of previously unseen item characteristics is to define a specific vocabulary for each product, since user overall opinion orientations are based on the opinion-words they use to describe the items.
2.1. Opinion Dictionary Generation
Opinion-words are often referred to as sentiments in the literature and categorized as:
Rational sentiments, namely “rational reasoning, tangible beliefs, and utilitarian attitudes” [
38]. An example of this category is given by the sentence ‘This camera is good”, which does not involve emotions like happiness at all. In this case, the opinion-word (the adjective “good”) fully reveals the user’s opinion on the phone.
Emotional sentiments, described in [
12] as “entities that go deep into people’s psychological states of mind”. For example, consider the sentence “I trust this camera”. Here, the opinion-word “trust” clearly conveys the emotional state of the writer.
An opinion dictionary contains both keywords that represents item characteristics and the sentiment strength scores associated with them. As with [
39,
40], these scores are determined using a word-similarity-based method that assumes that similar meanings imply similar sentiments, and then weighted based on the SentiWordnet sentiment corpus [
41].
Table 1 shows an example of the obtained sentiment scores associated with several item characteristics of a phone review.
The dictionary can be of constant length—common to all products—containing the most commonly used opinion-words in the whole review dataset [
42], or as proposed of variable length solving the problems of: (
i) using a large set of words that are not shared by different products and thereof introducing unnecessary sparsity in the input matrix, (
ii) this same increase on sparsity worsens to some extent the prediction capabilities of the algorithm, and (
iii) the only use of opinion-words as vocabulary allows the overall opinion prediction for a product but not for the sentiment associated with its characteristics. Since different products have truly different features, we must consider a distinctive set of opinion-words for each one of them, reducing the number of features and the sparsity of the input matrix.
In this context, every user review of a particular item reduces to a (possibly sparse) array of sentiment scores associated with the item vocabulary. If an opinion keyword occurs more than once in the review, its final score is calculated simply as an average.
These feature vectors were obtained using a solution similar to that described in [
43] (we used the NTL services graciously provided to us by
www.bibtex.com), that carries out sentiment analysis at the sentence level and can detect as many opinions as contained in a sentence. This analysis identifies the opinion keyword (the
what) for each opinion sentence and calculates a numerical sentiment strength score for it based on its associated sentiment text (the
why). Also, the analysis enables us to syntactically analyze the texts and extract the simple (e.g., ‘guitar’) and compound (e.g., ‘quality_interface’) opinion keywords to include in the feature vectors. To further elaborate one of the examples provided above, three distinct sentiment opinions may be extracted from the review “
This phone is awesome, but it was much too expensive and the screen is not big enough”, namely:
It is worth noting that we validated the generated vocabularies verifying that they accomplish with the frequency distribution that characterizes the majority of natural languages and is defined by a power law distribution known as Zipf’s law [
44,
45]. This validation is shown in
Figure 2, where the frequency of occurrence of the opinion keywords in the Amazon’s “Instant Video” dataset is plotted as a function of its ordering number
n. The distribution that best fits the data is
with
.
2.2. Notation and Input Matrix
In this section, we explain the opinion prediction model based on tensor factorization in full detail, and introduce the terminology and notation used throughout this article. One of the challenges in the design of the model lays in the use of a vocabulary rich enough as to characterize the user viewpoints, but not too large (notice that without any filtering the number of possible words obtained for the used datasets is ≈10.000) as to impede numerical computations or the learning capabilities. The main difficulty in dealing with such models does not lie on its computational aspects however, but in the sparsity and noise of the data. This sparsity engages critically with the cold-start problem in recommendation systems. A typical online shopping website with SNS capabilities provides, for the purposes of this article, N users writing reviews on a set of M items. Generally, a given user will have scored and reviewed only a subset of these M items, thus making the website’s database highly sparse.
Let S denote the set of user-item pairs for which written reviews do exist and let be their associated feature vectors. Our information domain consists therefore of triples of the form . In general, each item is described by a different set of opinion topics (keywords) of size . The vector is populated with the sentiment strength scores () given to the topics in the -th review.
The website’s 2-dimensional input matrix is set up by concatenating the sparse matrices containing the reviews for the products. Thus, , where denotes the sum of the vocabulary sizes for the M products.
Let
and
be, respectively, the minimum and maximum sentiment scores contained in the
-th column of the input matrix
. Each
entry is then replaced by its normalized value
defined as
Please note that two different normalization scales are used depending on the entry sign to prevent turning positive scores into negative ones and vice versa. In what follows, normalized scores will be denoted simply by .
Table 2 summarizes the terminology and notation used in this article. See also
Figure 3 for further clarification.
2.3. Prediction Model
We aim to obtain a prediction model that minimizes the reconstruction error of the sentiment strength scores over all users, items, and opinion-words. For minimizing the above expression our model uses the Alternating Least Squares (ALS) method to reconstruct user opinions—that is, vectors—not included in S. This factorization collects patterns of taste among similar reviewers generating automatic predictions for a given user.
Specifically, we subject the input matrix
to an ALS factorization [
30] of the form
in order to estimate the missing reviews. Here,
and
, where
is the number of latent factors or features [
8]—a predefined constant typically of the order of ten. Any sentiment score
can then be approximated by the scalar product
(recall that
is the prediction), where
is the
u-th row of
and
the
-th row of
. The procedure monotonically minimizes the following quadratic loss function until convergence is reached
where
. Here,
denotes the regularization parameter that prevents overfitting and plays a crucial role in balancing the training error and the size of the solution (see
Section 4 for further details).
3. Experiments
We tested our model using different Amazon datasets (
http://jmcauley.ucsd.edu/data/amazon/) as explained in
Section 2.1 and
Section 2.2. We performed an exhaustive number of analysis in order to optimize the internal parameters of the model—namely
K,
, and the number
of ALS iterations—for comparative and validation purposes, providing the updated results of the global performance.
3.1. Dataset
To conduct our experiments, we chose four 5-core version Amazon datasets [
24] of increasing size—it basically doubles from one dataset to the next—and similar characteristics—especially regarding their population ratios between positive and negative sentiment scores. These datasets belong to different domains: Musical instruments, automotive, instant video, and digital music, and the union of these datasets is also used to measure generalization capabilities of our method.
Table 3 lists the datasets and summarizes their main features.
To reduce sparsity we selected those records with at least five reviews for each item and user, and also a constraint of minimum three occurrences
was set for the inclusion of a word in the opinion dictionaries to retain only the most relevant opinion keywords while keeping the complexity of the problem manageable. Thus, only a term that occurred at least three times in the subset of reviews for an item was considered relevant, and hence included in that item dictionary. Recall that this same term it may or may not be present in other item dictionaries.
Table 3 shows the number of entries before and after this threshold was applied (see its fourth column).
Table 4 also shows some of the vocabularies generated in this way.
3.2. Technical Aspects
The implemented algorithms are scalable and may be executed over large datasets by using any Hadoop-based cluster or distributed computational infrastructure. Our numerical computations were performed in a distributed system of 2 executors with 16 cores and 256GB RAM on a Hadoop cluster with a total of 2.8TB RAM, 412 cores, and 256GB HDFS. We implemented our codes using the Python 3.5 programming language and the collaborative filtering RDD-based algorithms provided by Apache Spark, both being solutions of well-known efficiency, robustness, and usability.
3.3. Experimental Design
We have performed five experiments related with each one of the datasets (musical instruments, automotive, instant video, and digital music) to test the learning capabilities of the proposed method. Moreover, we performed an additional experiment to test the generalization capabilities by joining the fours datasets to create a bigger one that is domain independent. This implies that the vocabulary is global, and the goal is to test how large is the difference between domain dependent and independent approaches.
We evaluated our model performance and set the ALS parameters for each dataset by means of a 5-fold cross-validation. The performance was measured by analyzing the Mean Squared Error (MSE), accuracy, -score, precision, recall and Area Under the Curve (AUC) metrics. The MSE measures the error obtained by the prediction model during the cross-validated learning process and has been used to prevent overfitting. The accuracy provides an overall success criterion () without considering the data distribution. The precision () provides the percentage of the correctly predicted opinion keywords and their sentiment. Recall metric () provides the percentage of the correctly detected opinion keywords and their sentiment, and the -score () has been used as application metric success because it is equally important to detect an opinion keyword that exists than not to say that exists when is not. Finally, the AUC metric has been used to verify that the obtained model behavior is independent of the data distribution.
To obtain the input matrix
S we first processed the reviews using NLP and the opinion dictionary generation method explained in
Section 2.1 and then obtained the optimal parameters by conducting the following experiments:
Combined analysis of and parameters to probe whether they can be set independently for each dataset.
Analysis of the influence of the optimal K value on the model performance for different datasets and values.
Analysis of the number of iterations needed until convergence is reached.
We want to point out that despite the use of a cluster, it was still very time-consuming to perform an exhaustive trial and error for all the possible parameter combinations of the model. Hence, when deciding on a choice for
K and
, we had to reach a compromise between the model error and the running time of the ALS method, which is proportional to
[
46]. For this purpose, we followed an incremental approach. We first corroborate that
and
K are, in effect, unrelated parameters in our model (see
Figure 4. Indeed, the
-score versus
curves have the same shape for different values of
K, and all of them attain their maxima at the same
. Consequently,
and
K may be optimized independently. Note also that the vertical distance between
-scores for a given
is negligible for sufficiently large values of
K (the percentage MSE error reduces to less than
from this value forward) and therefore we use
throughout this article. Then, we evaluated the optimal number of iterations until convergence of the ALS method.
Figure 5 shows the MSE values relative to the number
of ALS iterations for
and
. The number of iterations necessary for the model convergence is expected to depend mostly on the size of the dataset (the higher
, the lower MSE once
and
K have been set in advance). Curves converge almost asymptotically towards low error values, reaching a plateau in which the model performance is virtually constant and the best learning results are attained. A value of
seems to be a suitable number of iterations for subsequent experiments. The last set of performed experiments to configure the parameters of our model measure the
and AUC metrics as functions of
in order to obtain the optimal regularization parameter for each dataset. Both
Figure 6 and
Figure 7 show similar behavior—particularly, they attain their corresponding maxima at the same
, although AUC has larger standard deviations for some datasets—see, for instance,
Figure 7a). When needed, the
curves were used for clarification purposes, helping us to point out the optimal
K rank. The obtained optimal parameters are
and
.
4. Results
To extract the user’ opinion (positive or neutral/negative) for the different item characteristics, once the ALS factorization of the input matrix is completed, we proceed to categorize the output data (i.e., predicted sentiment scores) into two classes depending on whether the opinion for a specific topic is positive ( class), or negative or neutral ( class).
The model performance is assessed by means of confusion matrices and their associated metrics: accuracy, F1-Score, precision and Recall; that measures the percentage of the correctly predicted values. These metrics provide insights on result goodness but, due to the unbalanced number of positive/negative samples ratio and the equally importance of misleading a positive or negative prediction, the AUC as performance metric is recommended. Recall that AUC considers the relationship between the sensitivity (true positive rate) and fall-out (false positive rate) being a more robust metric for our case.
The final results of the above described optimal model are shown in
Table 5. It can be observed that there exists consistency among the metrics for the different datasets. Overall, smaller datasets perform slightly better than larger ones in terms of
, accuracy, precision, and recall metrics. The reason for this behavior is two-fold: First, the input matrix sparsity increases with the dataset size, which hinders the learning process. Second, the positive score population is also patently larger in smaller datasets (with a maximum at 87.1%), dominating over the negative/neutral class. This does not undermine the model foundation, however: This positive-negative score ratio is observed to be shared by most Amazon datasets and, in any case, we are equally interested in both classes of opinions.
It is in this sense, the AUC metric gives equal importance to positive and negative/neutral classes resulting in a more suitable metric to measure the model prediction capabilities from an application perspective. Its best results are now obtained for the larger datasets and it is also worth highlighting the relatively large AUC achieved for the combined dataset. These results follow from the existence of user complementary information in intersecting datasets.
To be certain about the significance of the results using the AUC metric, it must be noticed that any random binary classifier is expected to yield an average
of
. All our experiments were significantly better than
compared with the Monte Carlo simulated frequency distribution of the computed AUC over our datasets. The distribution peaks at
, as expected, and is relatively narrow, with approximately
of all repeated measurements lower than
. This demonstrates a negligible probability of generating the true experimental value of
by chance (see
Table 5) and the significance of the improvement obtained with our method.
Finally, it is worth discussing an actual review instance of the full experimental procedure to spot its strengths and aspects that call for improvement. Such an example is shown in
Table 6. This particular review contains eight syntactic structures which are shown independently (notice that the NLP analysis for obtaining the opinion keywords and its sentiment is performed at sentence level). Our sentiment analysis service detects five topics/opinion keywords and assesses them with unprocessed sentiment strength scores (third column). The frequencies of occurrence of two of these opinion keywords, namely ‘weld’ (at row 6) and ‘quality’ (at row 8), are lower than the threshold frequency explained in
Section 3.1, and are therefore discarded. The remaining three entries, normalized as explained in
Section 2.2, were originally positive and therefore successfully predicted as such by our binary classifier.
5. Conclusions and Future Work
The fundamental hypothesis behind our model is that any user review encodes preferences, emotions and tastes that may be captured by latent factors that are common to similar users. These emotions are then open to be predicted as opinion keywords containing a sentiment strength score of item characteristics.
In our model, every product has its own vocabulary, which is rich enough as to allow users to be expressive, but not too large as to increase the data sparsity—which burdens the learning process. A careful choice of opinion keywords for the domains under analysis proves to be crucial if we want to avoid topics of low descriptive value. We want to highlight that this article is, to the best of our knowledge, the first study that tries to predict the user’s textual opinion by means of predicting the sentiment strength scores for item characteristics. The obtained results prove that the proposed model is indeed able to learn despite they still are far away to be considered in real environments. Nonetheless, the model is still open to improvements and we expect it to be useful as a baseline for comparative purposes in SNS solutions.
Our future work includes using larger datasets to verify the generalization capabilities of the system, since our experiments show a consistent reduction of MSE values with their size at least in domain dependent scenarios, as well as reduce the sparsity of the dataset by using the current prediction system (which has an accuracy above 60%) to generate new samples and populate incrementally a new dataset following a best-candidate sampling. We expect that this reduction of sparsity will improve the generalization capabilities of the system.