1. Introduction
Recommender systems (RS) are intended to provide the online user with advice, reviews, and opinions from previous purchasers on products and services mainly through methods such as collaborative filtering (CF) [
1]. The main RS objective using CF is to persuade users to buy items or services they have not previously bought/seen before based on the buying patterns of others. This can be achieved by ranking either the item-to-item similarity or the user-to-user similarity and then predicting the top scoring product that ought to appeal to the potential buyer. Unfortunately, CF has a number of limitations such as the
cold-start problem, i.e., generating reliable recommendations for those with few ratings or items. However, this issue can be alleviated to some extent by reusing pre-trained deep learning models and/or using contextual information [
2]. Since CF is generally an open process, they can be vulnerable to biased information or fake information [
3,
4]. Fake user profiles can easily manipulate recommendation results by giving the highest rates to targeted items and rate other items similar to regular profiles. This behavior is called a “shilling attack” [
5].
Initially launched in 2005, YouTube has seen an exponential growth of submitted videos and is the most popular platform for viewing material that informs, educates, and entertains its users. YouTube is a free video sharing service allowing users to view online videos and also for them to develop and upload their own materials to share with others [
6,
7]. However, for many YouTube contributors the opportunity to earn money from their channel’s popularity is a great incentive. To earn money from YouTube, a contributor must have 1000 subscribers and at least 4000 watch hours in the past year. Contributors can then apply to YouTube’s Partner Program and monetize their channel. However, YouTube keeps careful surveillance on any mechanism that artificially inflates the number of comments, views, or likes. Unscrupulous contributors often achieve increased rankings by using bots or automatic systems or even presenting videos to unsuspecting viewers.
The objective of our work is to demonstrate that a recommendation engine can be used to provide users with reliable YouTube videos based on initial keyword searches. The topic of interest is global warming/climate change, but the system could be applied to any subject. The objectives are two-fold: once we can identify their sentiment/opinions on global warming we can provide users with authoritative videos with scientific credence based on their beliefs. Then, we can present users with authoritative videos representing the opposite stance. The intent is to balance out the debate with evidence that they would perhaps not necessarily seek out. Our intention is not to change opinions, but to help users become more aware of the issues.
To achieve these objectives, we combine sentiment analysis and graph theory to provide deeper insights into YouTube recommendations. Rather than using different software platforms, we combine several R libraries into a unified system, making overall integration easier. The overall system workflow is shown in
Figure 1. An initial search topic is defined and fed into the APIs of the three platforms (Twitter, Reddit, and YouTube). The resulting posts are preprocessed and parsed, and the text data is then analyzed by graph theoretic measures that provide statistical metrics of user posts and how they interact. The sentiments of user posts are used to create topic maps that reflect common themes and ideas that these users have. Ratings of YouTube videos and the provenance of their sources are estimated to provide some indication of their validity and integrity.
The main contribution of this work is three-fold. First we integrate sentiment mining with graph theory, providing statistical information on the posters and contributors. We also use up-votes and down-votes as a recommendation source. Finally we create a logical structuring of the Twitter, Youtube, and Reddit data using topic maps. Topic modellng is necessary since most topics of interest will comprise a mixture of words and sentiments, which is a feature of human language. Therefore, some overlapping of concepts will occur, so an unsupervised classification method is required. We use Latent Dirichlet allocation (LDA), which is commonly used for fitting a topic model [
8].
The remainder of this paper is structured as follows:
Section 2 describes the related work and recent advances in recommender systems.
Section 3 outlines the social media data used in this study.
Section 4 describes the computational methods used.
Section 5 presents the experimental results and the discussion. Finally,
Section 6 presents the conclusions and future work.
5. Results
The flow of data and processing the posts begins with the conversion of raw text from Twitter, Reddit, and Youtube into Corpora, essentially creating term-document-matrices (as described in Algorithm 1). Once they are processed, we can extract topic models from the Corpora to aid our understanding of the posts due to this logical grouping of keywords.
For an example of sentiment analysis using the R package
sentimentr on Twitter posts, see
Figure 7. The posts are identified by a number (1–10). They are ranked as either positive (green), neutral (gray), or negative (red), and each has a number denoting the strength of the sentiment. There are three word sentiment lookups available for the Bing, NRC, and Afinn dictionaries, and each has a differing number of words rated and differing sentiment values attached to each word. This can be at the word level, sentence level, or the entire post (paragraph). As can be seen, the Twitter data shown here represents a number of opinions on the climate change debate.
We can see that comment 1 is rated at zero sentiment since the sentence is fairly neutral in its wording. In comment 2 we can observe that the first sentence is neutral but the second sentence has a positive sentiment word (optimistic) and is rated +0.082. Comment 3 is more negative because of the words scam, scammer and hoax, rated at −147.
The next stage is to develop topic models holding keywords that are coherently related to key concepts and will be data mined for bigrams. The optimum number of topic models for each Corpora is determined using the Harmonic mean, which is described in Equation (
5). In
Figure 8, the optimum numbers are presented with 24 for Twitter, 44 for Reddit, and 31 for YouTube concepts and issues. We used LDA to generate the topic maps with a value starting at 10 up to 100 possible topic maps. Therefore, at the first iteration, 10 topic maps would be selected to describe the Corpora, then 11, 12, 13, etc., until 100 topic maps have been generated. Adding more topic maps beyond a certain point would simply degrade the performance. When the harmonic mean decreases, that is the number of maps to use. The harmonic mean method has some known instabilities, but it is in general sufficiently robust.
In
Figure 9, 5 out of 24 Twitter topic maps are shown. Generally, the terms
climate and
change are present throughout some of the 24 topic maps. Topic map 2 is generally related to the energy consumption of fossil fuels such as oil and gas. Topic map 3 is concerned with public health and net zero. Topic map 4 has gathered words on environmental impact and statements issued by the Intergovernmental Panel on Climate Change (IPCC). Topic map 5 seems to have grouped human rights and social justice as key themes.
To augment the statistics and text mining, we also generated wordclouds, which perhaps give a better visualization and understanding of the main themes that dominate user posts. Individual word frequencies are used to highlight the important themes. The more frequent a word is, the more its size increases. In
Figure 10 the wordclouds for Twitter, Reddit, and Youtube are presented. Clearly,
climate and
change totally dominate the user posts for Twitter, while Reddit and Youtube have a wider range of concepts with more or less equal frequency of occurrence. Only words that appear with at least five occurrences are displayed.
The next stage is to build graph theoretic models of bigrams of co-occurring words building of up a picture of sentiment relating to each Youtube video. The graph models of Twitter and Reddit are also constructed to support the ratings/rankings of the videos in terms of the esteem/trust in which the video producers are held. In
Table 3 the graph statistics for YouTube are shown for five users. The key variables are
Betweenness and
Hubness, which indicates for each word the relative connectivity importance. The other columns have identical values: the mod (modularity) column refers to the structure of the graph and can take a range from 0.0 to 1.0, indicating that there is structure and not a random collection of connections between the nodes.
Nedges indicates the number of connections in this small network,
nverts is the number of nodes in the network. The
transit column refers to the transitivity or community strength. It is a probability for the network to have interconnected adjacent nodes.
As the graph is highly disconnected (bigrams linking to other bigrams), it shows zero for all entries. Degree refers to the average number of connection per node, which is is around 2.0, diam is the length of the shortest path between the most distanced nodes. Connect refers to the full connectedness of the graph, which it is not, in this case. Closeness of a node measures its average distance to all other nodes, and a high closeness score suggests a short distances to all other nodes. Betweenness detects the influence a given node has over the flow of information in a graph. The Density represents the ratio between the edges present in a graph and the maximum number of edges that the graph can contain. The Hubness is a value used to indicate the nodes with a larger number of connections than an average node.
Table 4 shows the basic statistics of several YouTube videos. We collected data such as the ID of the video, e.g., in the first row,
oJAbATJCugs would normally be used to select the video in a web browser by using the string
https://www.youtube.com/watch?v=oJAbATJCugs (accessed on 29 May 2023). The number of comments received for each video is collected, along with the average number of
likes. We also collect the number of comments that had
zero likes and we should note that this does not imply that the video was disliked, only that the person sending the comment neglected to select
like irrespective of their feelings for the video. The number of unique posters posting comments is also recorded. Next we perform sentiment analysis, examining the overall sentiment for the video and then breaking down the comments into neutral, negative, and positive sentiments. The total number of sentiments (positive, negative, and neutral) are based on the sentence level and we therefore have more than the overall number of comments, i.e., the number of posts.
The YouTube bigrams are displayed in
Figure 11. We only show the words that have at least 100 co-occurrences based on key topic map groupings. The bigrams for YouTube are more strongly linked to coherent topics and follow a logical pattern of subjects with more linkages between bigrams. Generally, the comments on YouTube are more calm and balanced with some thought given to the subject of global warming.
In
Figure 12, it can be seen that the situation for Twitter bigrams is more complicated. Additionally, because of the large number of posts, we filter the number of word co-occurrences to 200 before they can appear on the plot. However, it provides a richer source of data by illuminating the issues and concerns once the most frequently occurring words are revealed based on key topic map groupings. The general trend for Twitter posts seems to contain a lot of off-topic issues such as legal aspects and gun violence. Another issue is the text limits on Tweets (280 characters), which may cause posters some constraints in their dialog. The text limit has been raised for fee paying subscribers to 4000 characters.
In
Figure 13 the Reddit bigrams are displayed. Again, for clarity, we only show those words that have at least 100 co-occurrences based on key topic map groupings. Similar in tone and style to YouTube posts, the majority of Reddit posts are more objective and less inclined to be sensationalist. Reddit has very strict rules on posting and any user breaching these may be banned. Other inducements for good behavior are
karma awards and
coins, which are given to a poster by other users.
Having gathered statistics from sentiment analysis of the topic maps, comments, and bigrams of paired common words, we now structure the data to build the recommendation engine. The difficulty we face is that the matrix of items (videos) and users is very sparse. This is alleviated to a certain extent by generic profiling of users from Twitter and Reddit data.
The rating matrix is formed from the YouTube user rankings of videos. These are normalized by centering to remove possible rating bias by subtracting the row mean from all ratings in the row. In
Table 5 we present normalized ratings data for a small fraction of the overall ratings matrix. Empty spaces represent missing values where no rating data exists. The missing values are usually identified in our software as
Not Available (NA). We did not attempt to impute the missing values.
In
Table 6 we present the model error for the test data. We evaluate the predictions used to compute the deviation of the prediction from the true value. This is the Mean Average Error (MAE). We also use the Root Mean Square Error (RMSE) since it penalizes larger errors than MAE and is thus suitable for situations where smaller errors are likely to be encountered. Here, UBCF is user-based collaborative filtering and IBCF refers to item-based collaborative filtering. We also use mean squared error (MSE) as a measure.
In
Table 7 we display the confusion matrix, where
n is the number of recommendations per list, TP, FP, FN, and TN are the entries for true positives, false positives, false negatives, and true negatives. The remaining columns contain precomputed performance measures. We calculate the average for all runs from four-fold cross validation.
Examining the evaluation of popular items and the user-based CF methods appear to have a better accuracy and performance than the other methods. In
Figure 14, we see that they provide better recommendations than the other method since for each length of top predictions list they have superior values of TPR and FPR. Thus, we have validated our model and are reasonably certain of its robustness.
When in operation, the recommender system makes suggestions for selected users of YouTube based on their ratings of previous videos, their comments (if applicable), and related statistics. In
Table 8 we highlight 20 suggestions based on 5 users selected at random. Each user may obtain a differing number of recommendations. Column one identifies the video, column two gives the user ID (selected at random), column three gives the YouTube video ID (which can be pasted into a browser), column four gives the title of the video, column five gives the number of views, and, finally, column six gives the recommender score. Where the videos stand in relation to climate change is obvious from the titles, with the exception of video 20, which appears to take a neutral stance. The score or ranking of a video is based on a value between 0.0 and 1.0, formed by the statistics generated and YouTube recommendations. Experimentally we have determined that values below 0.5 are unlikely to be of interest as we detected videos that are off-topic and minimally related to global warming.