User Representation Learning for Social Networks: An Empirical Study
Abstract
:1. Introduction
- (a)
- A method such as the Gensim-doc2vec_model[’document_tag’] cannot be used in the case of extracting embeddings for new tweets that are not included in the corpus used in the training of the doc2vec model. In this case, after training the model, infer_vector() function is used to infer vectors for new text. This generative method is not fully deterministic and subsequent calls for the same text produce different vectors, however, they would be close to each other in the latent-space. The performance of the models is assessed when no information (tweet, comment, followers, etc.) of the users to be represented are included in the training data. See Appendix A for some basic information on the Gensim Framework.
- (b)
- It is not always feasible to have a gold standard training set consisting of manually selected users. An important question is whether there are alternative resources for training user representation models. Therefore, utilization of available and easily/readily accessible textual resources such as Twitter and Wikipedia are examined. Also, combinations of the datasets are mixed together in different scenarios.
- (c)
- A wide range of information (interests, opinions, sentiment, or topic) can be present in the tweet text. NLP techniques are used in extracting information from tweets, which consist of words, hashtags, mentions, smileys, etc. However, there is so much more to leverage from when representing a Twitter user. Twitter handles of users who engaged with the tweets by Retweeting, Liking, and Replying are collected into a dataset. Another dataset is created with the user’s profile information: handle, name, description, and location. Several approaches are proposed for training doc2vec-based user representation models on the aforementioned datasets that embody both textual and non-textual information.
- (d)
- The performance of PV-DM and PV-DBOW algorithms are compared. Also, the effect of the min_count parameter (min_count=1 and min_count=5) is investigated for each algorithm. (see Appendix A for the explanations of these parameters)
- (e)
- A traditional method, TF-IDF; pre-trained word embeddings (PWE) such as Glove, FastText; and pre-trained sentence encoders (contextualized embeddings) such as BERT, ELMO have been studied in various aspects in obtaining document vectors.
2. Related Work
3. Datasets
- ○
- Relevance: High degree of relevance to the specified category.
- ○
- Popularity: High number of followers (at least a few thousand followers).
- ○
- Posting behavior: Being an active Twitter user who tweets occasionally.
- -
- There is not a fixed number of users who retweeted/commented/liked a tweet. The script collects only the handles of the users that are shown at that moment.
- -
- A total of 3,599,588 different Twitter handles were collected. Their distributions for different interactions are shown in Figure 1. For example, it shows that:
- ○
- for 38% of the tweets, there are found to be up to five handles of the users who liked the tweet.
- ○
- for the 17% of the tweets, there are found to be 6–20 handles of users who retweeted the tweet.
- -
- The number of handles for each group are shown in Table 2. For example, it shows that a total of 247,109 handles are collected from the retweet, like and comment lists of the tweets of the Economy group’s users.
- -
- The users that appear in the user lists of all five groups (Intersection of handles among groups) .
4. Methodology
- Doc2Vec Approach
- Using non-textual features in Doc2Vec (enrichment)
- PWE Approaches
4.1. Doc2Vec Approach
4.2. Using Non-Textual Features in Doc2Vec (Enrichment)
4.3. Pre-Trained Word Embedding (PWE) Approaches
5. Experimental Results and Discussion
5.1. Doc2Vec and Doc2Vec Enrichment Experiments
5.2. Comparison of Different PWE Models
6. Conclusions and Future Work
- In regards to proposed doc2vec models: A fine-grained tweet corpus like Dataset-A and its enriched versions Datasets D and E are unlikely to be present for most of the cases. Therefore, we recommend using datasets similar to Dataset C. The performance of a model trained on Dataset C can be increased with adding more tweets from users of a specific group if there is an intended focus group. Our doc2vec model can give 97.5% overall accuracy. PV-DM algorithm is used with min count = 5 for the best model.
- We envision that the attempts for incorporating a wide range of social media activity data such as comment, like, retweet, mention, follow, etc. in user representation learning are at an early stage and there are still many challenges to be solved. Our model that is trained on the combination of dataset A and E can give 97.0% accuracy.
- Although the publicly available word vectors are very simple to use and can give acceptable accuracies, they fail to perform as well as the doc2vec approaches. In the case of using a PWE, one should consider the corpus it was trained on and the length of the vectors. Our word2vec model trained on Dataset A can give 84.5% accuracy. The accuracy of the Glove model trained on Tweet corpus can reach to 84.0%.
- Bert and Elmo do not outperform the PWE models. We believe that they would perform better than the given results if they are fine-tuned or trained from scratch. We plan to conduct broader experiments using contextualized embedding methods.
- ○
- Download a publicly available PWE and fine-tune it for your specific task.
- ○
- Train a language model from scratch using the algorithm of PWE (Glove, FastText, Google News, or NumberBatch).
- ○
- A broad investigation of using Contextualized word embeddings in user representation models. In this study, only their pre-trained multilingual versions are used.
- ○
- User similarity: Users who are most similar to a user in a network can be queried through the similarities of their embeddings. With such a system, unknown users can be defined over other similar users automatically without the need for applying manual analyses over their social media data. This approach can be useful in many social media analysis tasks such as trend analysis and reputation analysis.
- ○
- Social media sentiment analysis: In classical sentiment analysis, models learn from previously expressed comments with the aim of determining user opinions without considering who their authors are. We predict that sentiment analysis methods will continue to improve with the new techniques that incorporate user information as well as the actual text.
- ○
- Fake news detection: Early detection of whether a piece of information spreading on social media is true or fake is an important need in today’s online world. Similar to the sentiment analysis example, a system can be designed where a social media post is evaluated not only by its content, but also by the representations of the users who create and support it.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Gensim Framework
Appendix B. Computing Platform and Its Performance
Dataset | Iteration Running Time (Minutes) | Infer (Minutes) |
---|---|---|
A | 0.5 | 256 |
B | 10.4 | 303 |
C | 1.5 | 282 |
A + B | 11.16 | 321 |
A + C | 2 | 286 |
B + C | 12.93 | 322 |
A + B + C | 13.46 | 330 |
A + E | 2.33 | 2152 |
A + B + E | 11.18 | 2521 |
A + C + E | 5.11 | 2189 |
A + B + C + E | 13.8 | 2534 |
Appendix C. Precision, Recall, and F1 Scores of Doc2vec Experiments
C | G | Results | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PV-DM (dm = 1) | PV-DBOW (dm = 0) | ||||||||||||
min Count = 1 | min Count = 5 | min Count = 1 | min Count = 5 | ||||||||||
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | ||
1 | G1 | 0.923 | 0.900 | 0.911 | 0.973 | 0.900 | 0.935 | 0.971 | 0.825 | 0.892 | 0.968 | 0.750 | 0.845 |
G2 | 1.000 | 0.975 | 0.987 | 1.000 | 0.975 | 0.987 | 0.974 | 0.925 | 0.949 | 0.929 | 0.975 | 0.951 | |
G3 | 0.946 | 0.875 | 0.909 | 0.949 | 0.925 | 0.937 | 0.971 | 0.825 | 0.892 | 1.000 | 0.900 | 0.947 | |
G4 | 0.889 | 1.000 | 0.941 | 0.909 | 1.000 | 0.952 | 0.741 | 1.000 | 0.851 | 0.800 | 1.000 | 0.889 | |
G5 | 0.975 | 0.975 | 0.975 | 0.976 | 1.000 | 0.988 | 0.975 | 0.975 | 0.975 | 0.976 | 1.000 | 0.988 | |
2 | G1 | 0.892 | 0.825 | 0.857 | 0.938 | 0.750 | 0.833 | 0.919 | 0.850 | 0.883 | 0.923 | 0.900 | 0.911 |
G2 | 0.865 | 0.800 | 0.831 | 0.773 | 0.850 | 0.810 | 0.946 | 0.875 | 0.909 | 0.972 | 0.875 | 0.921 | |
G3 | 0.886 | 0.975 | 0.929 | 0.864 | 0.950 | 0.905 | 0.907 | 0.975 | 0.940 | 0.929 | 0.975 | 0.951 | |
G4 | 0.976 | 1.000 | 0.988 | 0.974 | 0.950 | 0.962 | 0.952 | 1.000 | 0.976 | 0.952 | 1.000 | 0.976 | |
G5 | 0.976 | 1.000 | 0.988 | 0.976 | 1.000 | 0.988 | 0.951 | 0.975 | 0.963 | 0.951 | 0.975 | 0.963 | |
3 | G1 | 1.000 | 0.800 | 0.889 | 1.000 | 0.850 | 0.919 | 0.971 | 0.825 | 0.892 | 0.970 | 0.800 | 0.877 |
G2 | 0.950 | 0.950 | 0.950 | 0.949 | 0.925 | 0.937 | 0.971 | 0.850 | 0.907 | 0.923 | 0.900 | 0.911 | |
G3 | 0.905 | 0.950 | 0.927 | 0.884 | 0.950 | 0.916 | 0.809 | 0.950 | 0.874 | 0.844 | 0.950 | 0.894 | |
G4 | 0.952 | 1.000 | 0.976 | 0.952 | 1.000 | 0.976 | 0.950 | 0.950 | 0.950 | 0.974 | 0.950 | 0.962 | |
G5 | 0.909 | 1.000 | 0.952 | 0.952 | 1.000 | 0.976 | 0.886 | 0.975 | 0.929 | 0.886 | 0.975 | 0.929 | |
4 | G1 | 0.972 | 0.875 | 0.921 | 0.971 | 0.850 | 0.907 | 0.917 | 0.825 | 0.868 | 0.969 | 0.775 | 0.861 |
G2 | 1.000 | 0.875 | 0.933 | 0.974 | 0.925 | 0.949 | 0.872 | 0.850 | 0.861 | 0.826 | 0.950 | 0.884 | |
G3 | 0.780 | 0.975 | 0.867 | 0.800 | 1.000 | 0.889 | 0.884 | 0.950 | 0.916 | 0.950 | 0.950 | 0.950 | |
G4 | 1.000 | 0.925 | 0.961 | 1.000 | 0.925 | 0.961 | 0.974 | 0.950 | 0.962 | 0.952 | 1.000 | 0.976 | |
G5 | 0.952 | 1.000 | 0.976 | 0.975 | 0.975 | 0.975 | 0.930 | 1.000 | 0.964 | 1.000 | 1.000 | 1.000 | |
5 | G1 | 0.906 | 0.725 | 0.806 | 0.972 | 0.875 | 0.921 | 0.889 | 0.800 | 0.842 | 0.971 | 0.825 | 0.892 |
G2 | 0.821 | 0.800 | 0.810 | 0.949 | 0.925 | 0.937 | 0.846 | 0.825 | 0.835 | 0.923 | 0.900 | 0.911 | |
G3 | 0.822 | 0.925 | 0.871 | 0.929 | 0.975 | 0.951 | 0.881 | 0.925 | 0.902 | 0.907 | 0.975 | 0.940 | |
G4 | 0.930 | 1.000 | 0.964 | 1.000 | 1.000 | 1.000 | 0.952 | 1.000 | 0.976 | 0.952 | 1.000 | 0.976 | |
G5 | 0.976 | 1.000 | 0.988 | 0.930 | 1.000 | 0.964 | 0.927 | 0.950 | 0.938 | 0.952 | 1.000 | 0.976 | |
G1 | 0.974 | 0.950 | 0.962 | 0.974 | 0.950 | 0.962 | 0.970 | 0.800 | 0.877 | 0.971 | 0.850 | 0.907 | |
G2 | 1.000 | 0.925 | 0.961 | 1.000 | 0.950 | 0.974 | 0.974 | 0.950 | 0.962 | 0.927 | 0.950 | 0.938 | |
6 | G3 | 0.929 | 0.975 | 0.951 | 0.951 | 0.975 | 0.963 | 0.816 | 1.000 | 0.899 | 0.864 | 0.950 | 0.905 |
G4 | 0.976 | 1.000 | 0.988 | 1.000 | 1.000 | 1.000 | 0.974 | 0.925 | 0.949 | 0.974 | 0.925 | 0.949 | |
G5 | 0.976 | 1.000 | 0.988 | 0.952 | 1.000 | 0.976 | 0.927 | 0.950 | 0.938 | 0.929 | 0.975 | 0.951 | |
G1 | 0.972 | 0.875 | 0.921 | 0.974 | 0.950 | 0.962 | 0.914 | 0.800 | 0.853 | 0.973 | 0.900 | 0.935 | |
G2 | 1.000 | 0.925 | 0.961 | 1.000 | 0.950 | 0.974 | 0.872 | 0.850 | 0.861 | 0.974 | 0.950 | 0.962 | |
7 | G3 | 0.830 | 0.975 | 0.897 | 0.951 | 0.975 | 0.963 | 0.870 | 1.000 | 0.930 | 0.951 | 0.975 | 0.963 |
G4 | 1.000 | 0.925 | 0.961 | 1.000 | 1.000 | 1.000 | 0.974 | 0.950 | 0.962 | 0.976 | 1.000 | 0.988 | |
G5 | 0.930 | 1.000 | 0.964 | 0.952 | 1.000 | 0.976 | 0.951 | 0.975 | 0.963 | 0.952 | 1.000 | 0.976 |
C | G | Results | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PV-DM (dm = 1) | PV-DBOW (dm = 0) | ||||||||||||
min Count = 1 | min Count = 5 | min Count = 1 | min Count = 5 | ||||||||||
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | ||
1 | G1 | 0.902 | 0.925 | 0.914 | 0.895 | 0.850 | 0.872 | 1.000 | 0.625 | 0.769 | 0.900 | 0.675 | 0.771 |
G2 | 0.974 | 0.950 | 0.962 | 0.974 | 0.925 | 0.949 | 1.000 | 0.900 | 0.947 | 0.884 | 0.950 | 0.916 | |
G3 | 0.939 | 0.775 | 0.849 | 0.939 | 0.775 | 0.849 | 0.841 | 0.925 | 0.881 | 0.854 | 0.875 | 0.864 | |
G4 | 0.909 | 1.000 | 0.952 | 0.833 | 1.000 | 0.909 | 0.780 | 0.975 | 0.867 | 0.870 | 1.000 | 0.930 | |
G5 | 0.907 | 0.975 | 0.940 | 0.930 | 1.000 | 0.964 | 0.867 | 0.975 | 0.918 | 0.975 | 0.975 | 0.975 | |
2 | G1 | 0.884 | 0.950 | 0.916 | 0.950 | 0.950 | 0.950 | 0.897 | 0.875 | 0.886 | 0.750 | 0.750 | 0.750 |
G2 | 1.000 | 0.950 | 0.974 | 1.000 | 0.975 | 0.987 | 1.000 | 0.950 | 0.974 | 1.000 | 0.950 | 0.974 | |
G3 | 0.974 | 0.925 | 0.949 | 0.975 | 0.975 | 0.975 | 0.826 | 0.950 | 0.884 | 0.776 | 0.950 | 0.854 | |
G4 | 0.889 | 1.000 | 0.941 | 0.952 | 1.000 | 0.976 | 0.974 | 0.950 | 0.962 | 0.860 | 0.925 | 0.892 | |
G5 | 0.972 | 0.875 | 0.921 | 0.974 | 0.950 | 0.962 | 0.974 | 0.925 | 0.949 | 0.967 | 0.725 | 0.829 | |
3 | G1 | 0.917 | 0.825 | 0.868 | 0.946 | 0.875 | 0.909 | 0.925 | 0.925 | 0.925 | 0.907 | 0.975 | 0.940 |
G2 | 1.000 | 0.950 | 0.974 | 1.000 | 0.950 | 0.974 | 0.974 | 0.950 | 0.962 | 1.000 | 0.975 | 0.987 | |
G3 | 0.947 | 0.900 | 0.923 | 0.974 | 0.925 | 0.949 | 0.900 | 0.900 | 0.900 | 1.000 | 0.925 | 0.961 | |
G4 | 0.800 | 1.000 | 0.889 | 0.870 | 1.000 | 0.930 | 0.923 | 0.900 | 0.911 | 0.951 | 0.975 | 0.963 | |
G5 | 0.974 | 0.925 | 0.949 | 0.976 | 1.000 | 0.988 | 0.929 | 0.975 | 0.951 | 0.950 | 0.950 | 0.950 | |
4 | G1 | 0.927 | 0.950 | 0.938 | 0.951 | 0.975 | 0.963 | 0.895 | 0.850 | 0.872 | 0.884 | 0.950 | 0.916 |
G2 | 1.000 | 0.950 | 0.974 | 1.000 | 0.950 | 0.974 | 0.975 | 0.975 | 0.975 | 1.000 | 0.975 | 0.987 | |
G3 | 0.974 | 0.950 | 0.962 | 0.973 | 0.900 | 0.935 | 0.800 | 0.900 | 0.847 | 0.927 | 0.950 | 0.938 | |
G4 | 0.851 | 1.000 | 0.920 | 0.909 | 1.000 | 0.952 | 0.881 | 0.925 | 0.902 | 0.950 | 0.950 | 0.950 | |
G5 | 0.971 | 0.850 | 0.907 | 0.975 | 0.975 | 0.975 | 1.000 | 0.875 | 0.933 | 0.973 | 0.900 | 0.935 | |
5 | G1 | 0.881 | 0.925 | 0.902 | 0.921 | 0.875 | 0.897 | 0.837 | 0.900 | 0.867 | 0.902 | 0.925 | 0.914 |
G2 | 1.000 | 0.950 | 0.974 | 1.000 | 0.950 | 0.974 | 0.974 | 0.925 | 0.949 | 1.000 | 0.975 | 0.987 | |
G3 | 0.974 | 0.925 | 0.949 | 0.974 | 0.950 | 0.962 | 0.875 | 0.875 | 0.875 | 0.950 | 0.950 | 0.950 | |
G4 | 0.870 | 1.000 | 0.930 | 0.889 | 1.000 | 0.941 | 0.895 | 0.850 | 0.872 | 0.902 | 0.925 | 0.914 | |
G5 | 0.972 | 0.875 | 0.921 | 0.975 | 0.975 | 0.975 | 0.878 | 0.900 | 0.889 | 0.949 | 0.925 | 0.937 |
Number of Maximum Features | Accuracy (Acc) and F1 Scores Over Datasets | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
A | B | C | A, D | A, E | ||||||
Acc | F1 | Acc | F1 | Acc | F1 | Acc | F1 | Acc | F1 | |
100 | 0.615 | 0.616 | 0.580 | 0.588 | 0.495 | 0.482 | 0.595 | 0.572 | 0.490 | 0.478 |
1000 | 0.865 | 0.866 | 0.830 | 0.861 | 0.760 | 0.764 | 0.615 | 0.601 | 0.500 | 0.487 |
10,000 | 0.905 | 0.906 | 0.540 | 0.534 | 0.875 | 0.873 | 0.695 | 0.685 | 0.500 | 0.489 |
20,000 | 0.915 | 0.916 | 0.575 | 0.574 | 0.875 | 0.873 | 0.660 | 0.648 | 0.490 | 0.481 |
50,000 | 0.915 | 0.916 | 0.590 | 0.590 | 0.900 | 0.900 | 0.645 | 0.639 | 0.490 | 0.481 |
100,000 | 0.915 | 0.916 | 0.675 | 0.674 | 0.900 | 0.900 | 0.665 | 0.659 | 0.490 | 0.481 |
References
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Koehl, D.; Davis, C.; Nair, U.; Ramachandran, R. Analogy-based Assessment of Domain-specific Word Embeddings. In Proceedings of the 2020 SoutheastCon, Raleigh, NC, USA, 28–29 March 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Yang, H.; Sohn, E. Expanding Our Understanding of COVID-19 from Biomedical Literature Using Word Embedding. Int. J. Environ. Res. Public Health 2021, 18, 3005. [Google Scholar] [CrossRef] [PubMed]
- Zhao, J.; van Harmelen, F.; Tang, J.; Han, X.; Wang, Q.; Li, X. Knowledge Graph and Semantic Computing. Knowledge Computing and Language Understanding: Third China Conference, CCKS 2018, Tianjin, China, August 14–17, 2018, Revised Selected Papers; Springer: Berlin/Heidelberg, Germany, 2018; Volume 957. [Google Scholar]
- Akbik, A.; Blythe, D.; Vollgraf, R. Contextual string embeddings for sequence labeling. In Proceedings of the 27th İnternational Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
- Firth, J.R. A synopsis of linguistic theory, 1930–1955. In Studies in Linguistic Analysis; Longmans: London, UK, 1957. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the EMNLP, Doha, Qatar, 25–29 October 2014; Volume 14, pp. 1532–1543. [Google Scholar]
- Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. arXiv 2016, arXiv:1607.01759. [Google Scholar]
- Akbik, A.; Bergmann, T.; Blythe, D.; Rasul, K.; Schweter, S.; Vollgraf, R. FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Speer, R.; Chin, J.; Havasi, C. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence, Palo Alto, CA, USA, 4–9 February 2017. [Google Scholar]
- Speer, R.; Lowry-Duda, J. ConceptNet at SemEval-2017 Task 2: Extending Word Embeddings with Multilingual Relational Knowledge. arXiv 2018, arXiv:1704.03560. [Google Scholar]
- Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014. [Google Scholar]
- Hallac, I.R.; Makinist, S.; Ay, B.; Aydin, G. user2Vec: Social Media User Representation Based on Distributed Document Embeddings. In Proceedings of the 2019 International Artificial Intelligence and Data Processing Symposium (IDAP), Malatya, Turkey, 21–22 September 2019; pp. 1–5. [Google Scholar]
- Carrasco, S.S.; Rosillo, R.C. Word Embeddings, Cosine Similarity and Deep Learning for Identification of Professions & Occupations in Health-related Social Media. In Proceedings of the Sixth Social Media Mining for Health (# SMM4H) Workshop and Shared Task, Mexico City, Mexico, 10 June 2021; pp. 74–76. [Google Scholar]
- Samad, M.D.; Khounviengxay, N.D.; Witherow, M.A. Effect of Text Processing Steps on Twitter Sentiment Classification using Word Embedding. arXiv 2020, arXiv:2007.13027. [Google Scholar]
- Gallo, F.R.; Simari, G.I.; Martinez, M.V.; Falappa, M.A. Predicting user reactions to Twitter feed content based on personality type and social cues. Future Gener. Comput. Syst. 2020, 110, 918–930. [Google Scholar] [CrossRef]
- Liao, C.H.; Chen, L.X.; Yang, J.C.; Yuan, S.M. A photo post recommendation system based on topic model for improving facebook fan page engagement. Symmetry 2020, 12, 1105. [Google Scholar] [CrossRef]
- Carta, S.; Podda, A.S.; Recupero, D.R.; Saia, R.; Usai, G. Popularity prediction of instagram posts. Information 2020, 11, 453. [Google Scholar] [CrossRef]
- Chen, H.-H. Behavior2Vec: Generating distributed representations of users’ behaviors on products for recommender systems. ACM Trans. Knowl. Discov. Data (TKDD) 2018, 12, 1–20. [Google Scholar] [CrossRef]
- Mehrotra, R.; Yilmaz, E. Task embeddings: Learning query embeddings using task context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 2199–2202. [Google Scholar]
- Gupta, U.; Wu, C.J.; Wang, X.; Naumov, M.; Reagen, B.; Brooks, D.; Cottel, B.; Hazelwood, K.; Hempstead, M.; Jia, B.; et al. The architectural implications of facebook’s dnn-based personalized recommendation. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), San Diego, CA, USA, 22–26 February 2020. [Google Scholar]
- Chen, L.; Qian, T.; Zhu, P.; You, Z. Learning user embedding representation for gender prediction. In Proceedings of the 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), San Jose, CA, USA, 6–8 November 2016; pp. 263–269. [Google Scholar]
- Lay, A.; Ferwerda, B. Predicting users’ personality based on their ‘liked’images on instagram. In Proceedings of the 23rd International on Intelligent User Interfaces, Tokyo, Japan, 7–11 March 2018. [Google Scholar]
- Mairesse, F.; Walker, M.A.; Mehl, M.R.; Moore, R.K. Using linguistic cues for the automatic recognition of personality in conversation and text. J. Artif. Intell. Res. 2007, 30, 457–500. [Google Scholar] [CrossRef]
- Rajaraman, A.; Ullman, J.D. Mining of Massive Datasets; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
- Adomavicius, G.; Sankaranarayanan, R.; Sen, S.; Tuzhilin, A. Incorporating contextual information in recommender systems using a multidimensional approach. ACM Trans. Inf. Syst. 2005, 23, 103–145. [Google Scholar] [CrossRef] [Green Version]
- Żołna, K.; Romański, B. User modeling using LSTM networks. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
- Pan, S.; Ding, T. Social media-based user embedding: A literature review. arXiv 2019, arXiv:1907.00725. [Google Scholar]
- Xing, L.; Paul, M.J. Incorporating Metadata into Content-Based User Embeddings. In Proceedings of the 3rd Workshop Noisy User-Generated Text, Copenhagen, Denmark, 7 September 2017; pp. 45–49. Available online: http://aclweb.org/anthology/W17-4406 (accessed on 10 June 2021).
- Littman, J.; Wrubel, L.; Kerchner, D.; Gaber, Y.B. News Outlet Tweet Ids. Harv. Dataverse 2017. [Google Scholar] [CrossRef]
- Binkley, P. Twarc-Report README. md. 2015. Available online: https://github.com/DocNow/twarc (accessed on 20 February 2021).
- Jaccard, P. Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat. 1908, 44, 223–270. [Google Scholar]
- Vijaymeena, M.K.; Kavitha, K. A survey on similarity measures in text mining. Mach. Learn. Appl. An Int. J. 2016, 3, 19–28. [Google Scholar]
- Hoff, P.D.; Raftery, A.E.; Handcock, M.S. Latent space approaches to social network analysis. J. Am. Stat. Assoc. 2002. [Google Scholar] [CrossRef]
- Dai, A.M.; Olah, C.; Le, Q.V. Document embedding with paragraph vectors. arXiv 2015, arXiv:1507.07998. [Google Scholar]
- Benton, A.; Dredze, M. Using Author Embeddings to Improve Tweet Stance Classification. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, Brussels, Belgium, 1 November 2018. [Google Scholar] [CrossRef]
- Grave, E.; Bojanowski, P.; Gupta, P.; Joulin, A.; Mikolov, T. Learning Word Vectors for 157 Languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Google-News Pre-trained Vectors (GoogleNews-Vectors-Negative300.bin.gz). Available online: https://code.google.com/archive/p/word2vec/ (accessed on 1 June 2021).
- Lau, J.H.; Baldwin, T. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv 2016, arXiv:1607.05368. [Google Scholar]
- van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Name | ID | Type | Source and Description | Size Information |
---|---|---|---|---|
300K-TW | A | Text corpus | Tweets of manually selected Twitter users for five pre-determined groups. This is the text-only dataset used in our previous study. |
|
1M-Wiki | B | Text corpus | English Wikipedia dump dated 2019.06.20. It contains 4.672.834 Wikipedia entity pages. | 1M randomly selected documents having minimum 50 and maximum 1K characters. |
1M-TW | C | Text corpus | News Outlet Tweet IDs [34]. It contains 39,695,156 tweet IDs. (*) | Total of 1M randomly selected tweets. |
NL | D | Social network data | Data set consisting of username and location information of each user in 300K-TW dataset. | Information of 240 users. |
NLD-RLC | E | Text corpus and Social network data | Description information is added to the NL dataset. This NLD dataset is merged with a list of users who retweeted/liked/commented the tweets (RLC) in 300K-TW. | Information of 300K tweets and 240 users. |
Group Id | Group Name | Number of Unique Handles |
---|---|---|
G1 | Economy | 247,109 |
G2 | Crypto Economy | 340,653 |
G3 | Technology | 649,865 |
G4 | Fashion | 803,920 |
G5 | Politics | 1,278,024 |
∩ | G1 | G2 | G3 | G4 | G5 |
---|---|---|---|---|---|
G1 | 450,162 | 57,347 | 65,247 | 15,561 | 122,840 |
G2 | 439,652 | 44,633 | 6782 | 25,580 | |
G3 | 817,242 | 29,839 | 86,233 | ||
G4 | 887,604 | 57,264 | |||
G5 | 1,515,083 |
Jaccard Similarity | G1 | G2 | G3 | G4 | G5 | Overlap Coefficient | G1 | G2 | G3 | G4 | G5 |
---|---|---|---|---|---|---|---|---|---|---|---|
G1 | 1 | 0.069 | 0.054 | 0.012 | 0.067 | G1 | 1 | 0.130 | 0.145 | 0.035 | 0.273 |
G2 | 1 | 0.037 | 0.005 | 0.013 | G2 | 1 | 0.102 | 0.015 | 0.058 | ||
G3 | 1 | 0.018 | 0.038 | G3 | 1 | 0.037 | 0.106 | ||||
G4 | 1 | 0.024 | G4 | 1 | 0.065 | ||||||
G5 | 1 | G5 | 1 |
Case No. | Dataset | Scenario |
---|---|---|
1 | A | Use of tweets belonging to a limited number of users considering the domains of the users handled in the problem. |
2 | B | Use of a big, general-purpose text corpus such as Wikipedia, News, etc. |
3 | C | Use of a large number of crawled tweets such as tweets from a specific time period. |
4 | A, B | Use of various datasets (A, B, C) that are put together in different combinations. |
5 | B, C | |
6 | A, C | |
7 | A, B, C |
Tweet Data | Sample |
---|---|
Tweet Id | tweet_id_1 |
User name | usr_name_1 |
Location | loc_1 |
Description | desc_1 |
Tweet-List of tokens | tkn_1, tkn_2, tkn_3, tkn_4, tkn_5 |
List of likes | Lusr_1, Lusr_2, Lusr_3, Lusr_4, Lusr_5, Lusr_6 |
List of retweets | Rusr_7, Rusr_8, Rusr_9, Rusr_10 |
List of comments | Cusr_11, Cusr_12 |
Enrichment | |
Enriched_Data_1 | tkn_1 tkn_2 tkn_3 tkn_4 tkn_5 enrtag enrtag enrtag enrtag enrtag nametag nametag nametag nametag nametag usr_name_1 nametag nametag nametag nametag nametag loctag loctag loctag loctag loctag loc_1 loctag loctag loctag loctag loctag desctag desctag desctag desctag desctag desc_1 desctag desctag desctag desctag desctag liketag liketag liketag liketag liketag Lusr_1 liketag liketag liketag liketag liketag Lusr_2 liketag liketag liketag liketag liketag Lusr_3 liketag liketag liketag liketag liketag Lusr_4 liketag liketag liketag liketag liketag Lusr_5 liketag liketag liketag liketag liketag Lusr_6 liketag liketag liketag liketag liketag rttag rttag rttag rttag rttag Rusr_7 rttag rttag rttag rttag rttag Rusr_8 rttag rttag rttag rttag rttag Rusr_9 rttag rttag rttag rttag rttag Rusr_10 rttag rttag rttag rttag rttag commtag commtag commtag commtag commtag Cusr_11 commtag commtag commtag commtag Cusr_12 |
Tagged Document | TaggedDocument ([Enriched Data_1], [tweet_id_1]) |
Case No. | Dataset | Scenario |
---|---|---|
1 | A, D | In the [33] study, user information are incorporated with tweets. In order to make a comparison with this study, datasets A and D are used. |
2 | A, E | Use of tweet, user information, and user activity data together. |
3 | A, B, E | Use of tweet, non-textual data, and a general-purpose corpus together. |
4 | A, C, E | |
5 | A, B, C, E |
Case No. | PWE Model | Scenario |
---|---|---|
1 | Glove [8], FastText [41], Google News [42], NumberBatch [14] | Comparison of publicly available PWEs. |
2 | Glove | Using the same PWE model trained on different corpus with different dimension sizes. |
3 | Word2Vec (our) | Training a word2vec model on our dataset. Use different parameters (dimension size, minimum count) |
4 | BERT, ELMO | Use of pre-trained contextualized word representation models |
5 | BERT, ELMO, Glove | Use of stacked embeddings as presented in [10] |
6 | TF-IDF | Using a sparse representation technique as another baseline. Investigate TF-IDF for a different number of features. |
Case No (See Table 5) | Training Dataset | Results | |||
---|---|---|---|---|---|
PV-DM (dm = 1) | PV-DBOW (dm = 0) | ||||
min Count = 1 | min Count = 5 | min Count = 1 | min Count = 5 | ||
1 | A | 0.945 | 0.960 | 0.910 | 0.925 |
2 | B | 0.920 | 0.900 | 0.900 | 0.900 |
3 | C | 0.940 | 0.945 | 0.910 | 0.915 |
4 | A + B | 0.930 | 0.935 | 0.915 | 0.935 |
5 | B + C | 0.890 | 0.955 | 0.900 | 0.940 |
6 | A + C | 0.970 | 0.975 | 0.925 | 0.930 |
7 | A + B + C | 0.940 | 0.975 | 0.915 | 0.965 |
Case No (See Table 7) | Training Dataset | Results | |||
---|---|---|---|---|---|
PV-DM (dm = 1) | PV-DBOW (dm = 0) | ||||
min Count = 1 | min Count = 5 | min Count = 1 | min Count = 5 | ||
1 | A + D | 0.925 | 0.910 | 0.880 | 0.895 |
2 | A + E | 0.940 | 0.970 | 0.930 | 0.860 |
3 | A + B + E | 0.920 | 0.950 | 0.930 | 0.960 |
4 | A + C + E | 0.940 | 0.960 | 0.905 | 0.945 |
5 | A + B + C + E | 0.935 | 0.950 | 0.890 | 0.940 |
Exp. #1 | G1 | G2 | G3 | G4 | G5 | Exp. #2 | G1 | G2 | G3 | G4 | G5 |
---|---|---|---|---|---|---|---|---|---|---|---|
G1 | 38 | 0 | 1 | 0 | 1 | G1 | 38 | 0 | 1 | 0 | 1 |
G2 | 1 | 38 | 1 | 0 | 0 | G2 | 1 | 39 | 0 | 0 | 0 |
G3 | 0 | 0 | 39 | 0 | 1 | G3 | 0 | 0 | 39 | 1 | 0 |
G4 | 0 | 0 | 0 | 40 | 0 | G4 | 0 | 0 | 0 | 40 | 0 |
G5 | 0 | 0 | 0 | 0 | 40 | G5 | 1 | 0 | 0 | 1 | 38 |
PWE MODEL | Dimension | Accuracy | F1 |
---|---|---|---|
Fast Text | 300 | 0.810 | 0.809 |
Number Batch | 300 | 0.810 | 0.807 |
Google News | 300 | 0.720 | 0.713 |
Corpus Source | Vocab Size | Number of Tokens | Dimension | Accuracy | F1 |
---|---|---|---|---|---|
Wikipedia | 100 | 0.835 | 0.828 | ||
200 | 0.815 | 0.808 | |||
300 | 0.800 | 0.797 | |||
25 | 0.800 | 0.799 | |||
50 | 0.805 | 0.799 | |||
100 | 0.835 | 0.830 | |||
200 | 0.840 | 0.834 |
Min Count | Acc | F1 |
---|---|---|
1 | 0.790 | 0.786 |
3 | 0.830 | 0.827 |
5 | 0.845 | 0.842 |
Embedding Model | Dimension | Accuracy | F1 |
---|---|---|---|
Bert | 3072 | 0.815 | 0.814 |
Elmo | 3072 | 0.770 | 0.765 |
Bert-Elmo-Glove (Stacked) | 3072 + 3072 + 100 | 0.815 | 0.812 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hallac, I.R.; Ay, B.; Aydin, G. User Representation Learning for Social Networks: An Empirical Study. Appl. Sci. 2021, 11, 5489. https://doi.org/10.3390/app11125489
Hallac IR, Ay B, Aydin G. User Representation Learning for Social Networks: An Empirical Study. Applied Sciences. 2021; 11(12):5489. https://doi.org/10.3390/app11125489
Chicago/Turabian StyleHallac, Ibrahim Riza, Betul Ay, and Galip Aydin. 2021. "User Representation Learning for Social Networks: An Empirical Study" Applied Sciences 11, no. 12: 5489. https://doi.org/10.3390/app11125489
APA StyleHallac, I. R., Ay, B., & Aydin, G. (2021). User Representation Learning for Social Networks: An Empirical Study. Applied Sciences, 11(12), 5489. https://doi.org/10.3390/app11125489