Detecting Social Media Bots with Variational AutoEncoder and k-Nearest Neighbor
Abstract
:1. Introduction
- (1)
- Firstly, VAE is used to encode and decode the sample features. The features of normal samples are more similar to the initial features after decoding, while the features of abnormal samples are different from the initial features.
- (2)
- The original features and decoded features are fused, and then the anomaly detection method is used.
- (3)
- Our method considers that the number of abnormal users is lower than that of normal users in the social network environment, and it is difficult to separate the abnormal users in the process of data collection. Our method addresses the shortcomings of high labeling costs and unbalanced positive and negative samples in the existing methods for the detection of social media bots. By reducing the number of abnormal samples that participate in the model training, we can realize the efficient detection of social media bots in social networks.
2. Related Work
2.1. Social Media Bots
2.2. Anomaly Detection Research
3. Social Media Bot Detection
3.1. Detection Framework
3.2. Feature Extraction
- (1)
- The average number of mentions in tweets is : social media bots and normal users refer to other users for some special purpose, which leads to different proportions of tweets containing @ in all tweets. The definition of this indicator is as follows:
- (2)
- The average number of emojis used in tweets is ; the writing style of social media bots is very different from that of normal users. Bots’ tweets are full of emojis or have no facial emojis at all. However, the use of emojis by normal users is not so extreme. This indicator is defined as follows:
- (3)
- The average number of stop words in tweets is expressed as : stop words are the most frequently used words in tweets, so they represent the writing style of tweet accounts. There are some differences in the use of stop words between normal human users and social media bots, as indicated by the following equation:
- (4)
- The average number of topics in tweets is expressed as : the #xx form indicates that a specific topic is instantiated on Twitter. Both normal users and social media bots pay attention to certain topics and participate in discussions. Some social media bots participate in a large number of discussions to achieve their goals and improve their reputation. The definition of the average topic tag usage is as follows:
- (5)
- The average number of links in tweets is expressed as : social media bots always post tweets for some purpose, such as spreading harmful links or advertisements, while social platforms limit the length of tweets and cannot explain all the contents in detail. Therefore, hyperlinks are used to link to other platforms, which leads to a higher proportion of tweets where links are used than normal. The percentage of defined links is as follows:
- (6)
- The proportion of retweets is expressed as : publishing tweets is the main activity in social platforms. Social media bots and normal users generally increase their popularity or participate in activities by retweeting others’ content and continuously publishing original tweets in a certain field. We define the forwarding rate to observe the difference between malicious social media bots and real users. The definition of the forwarding rate is as follows:
- (7)
- The average similarity of tweets is expressed as : social media bots publish tweets mechanically. Messages belonging to the same social media bots are very similar. Term Frequency Inverse Document Frequency (TF-IDF) is used to weight each word. Then, the cosine similarity is calculated for each pair of tweets. Finally, the average of the obtained scores is taken as the feature. The cumulative distribution function is shown in Figure 2g. The content similarity of most common users is very low, and the curve of social media bots rises sharply after 0.8. The highest similarity of normal users is 17.45%, and the highest similarity of social media bots is 98.2%. This shows that social media bots often send similar tweets, and there may be a situation of batch-publishing identical tweets, so the similarity index of tweets can better distinguish social media bots from normal users.
- (8)
- The average length of original tweets is expressed as : the style of tweets published by normal users and social media bots is not consistent, and the length of relevant tweets is also different, including original tweets and forwarded tweets. As shown in Figure 2h, the average length of tweets sent by normal users and social media bots is counted. On the left side is the average length of a user’s original tweets. The average length of tweets of normal users is much lower than that of social media bots. Social media bots usually add a lot of irrelevant information to their tweets to achieve the purpose of dissemination.
- (9)
- The average length of forwarded tweets is : compared with original tweets, the length of tweets forwarded by social media bots is much shorter than that of normal users, as shown in Figure 2h. It can be seen that social media bots only retweet without making comments.
- (10)
- The average number of the seven kinds of punctuations has seven dimensions: the mathematical expressions of the symbols “,”, “.”, “;”, “””, “!”, “(”, and “)” are, respectively, , , , , , , and . Figure 2i shows the average usage of symbols in tweets by normal users and social media bots. It can be seen from the figure that there are great differences in the usage of symbols in tweets between normal users and social media bots.
3.3. Anomaly Detection
- (1)
- Approximate the inference process of hidden variable Z posterior distribution: the recognition model , an inferential network, represents the process of inferring z from a known value of x.
- (2)
- Generate the conditional distribution of variables : conditional distribution , namely a generative network.
- (1)
- The distance between each sample in the test set and all samples in the training set is calculated once;
- (2)
- The k-nearest distances corresponding to each sample are averaged;
- (3)
- The sample set is sorted in descending order, and the first n (the number of outliers in the test set) points in the sorting table are taken as exception samples.
Algorithm 1. Specific description of the algorithm. |
Input: Dataset X; Parameter: and ; Mini-batch: batch; Epochs: epoch; Learning rate: lr. |
Output: Detection results. |
|
4. Experiments
4.1. Data Description
4.2. Results and Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Lee, M.; Oh, S. An Information Recommendation Technique Based on Influence and Activeness of Users in Social Networks. Appl. Sci. 2021, 11, 2530. [Google Scholar] [CrossRef]
- Ferrara, E.; Varol, O.; Davis, C.; Menczer, F.; Flammini, A. The rise of social bots. Commun. ACM 2016, 59, 96–104. [Google Scholar] [CrossRef] [Green Version]
- Howard, P.N.; Woolley, S.; Calo, R. Algorithms, bots, and political communication in the US 2016 election: The challenge of automated political communication for election law and administration. J. Inf. Technol. Politics 2018, 15, 81–93. [Google Scholar] [CrossRef]
- Mesnards, N.; Hunter, D.S.; Hjouji, Z.E.; Zaman, T. The Impact of Bots on Opinions in Social Networks. arXiv 2018, arXiv:1810.12398. [Google Scholar]
- Varol, O.; Ferrara, E.; Davis, C.A.; Menczer, F.; Flammini, A. Online Human-Bot Interactions: Detection, Estimation, and Characterization. arXiv 2017, arXiv:1703.03107v1. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2014, arXiv:1312.6114. [Google Scholar]
- Lingam, G.; Rout, R.R.; Somayajulu, D. Detection of Social Botnet using a Trust Model based on Spam Content in Twitter Network. In Proceedings of the 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS), Rupnagar, India, 1–2 December 2019. [Google Scholar]
- Rout, R.R.; Lingam, G.; Somayajulu, D. Detection of malicious social bots using learning automata with url features in twitter network. IEEE Trans. Comput. Social Syst. 2020, 99, 1–15. [Google Scholar] [CrossRef]
- Zhang, C.; Wu, B. Social Bot Detection Using “Features Fusion”. In Proceedings of the 2020 2nd International Conference on Information Technology and Computer Application (ITCA), Guangzhou, China, 18–20 December 2020; pp. 626–629. [Google Scholar]
- Bacciu, A.; Morgia, L.; Nemmi, E.N.; Neri, V.; Stefa, J. Bot and Gender Detection of Twitter Accounts Using Distortion and LSA; CLEF: Lugano, Switzerland, 2019. [Google Scholar]
- Davis, C.A.; Varol, O.; Ferrara, E.; Flammini, A.; Menczer, F. Botornot: A system to evaluate social bots. In Proceedings of the 25th International Conference Companion on World Wide Web, Montreal, QC, Canada, 11–15 April 2016; pp. 273–274. [Google Scholar]
- Sneha, K.; Emilio, F. Deep neural networks for bot detection. Inf. Sci. 2018, 467, 312–322. [Google Scholar]
- Loyola-Gonzalez, O.; Monroy, R.; Rodriguez, J.; Lopez-Cuevas, A. Contrast Pattern-Based Classification for Bot Detection on Twitter. IEEE Access 2019, 7, 45800–45817. [Google Scholar] [CrossRef]
- Dickerson, J.P.; Kagan, V.; Subrahmanian, V.S. Using sentiment to detect bots on Twitter: Are humans more opinionated than bots? In Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Beijing, China, 17–20 August 2014; pp. 620–627. [Google Scholar]
- Yang, K.C.; Varol, O.; Davis, C.A.; Ferrara, E.; Flammini, A. Arming the public with artificial intelligence to counter social bots. Hum. Behav. Emerg. Technol. 2019, 1, e115. [Google Scholar] [CrossRef] [Green Version]
- Cai, C.; Li, L.; Zengi, D. Behavior enhanced deep bot detection in social media. In Proceedings of the 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), Beijing, China, 22–24 July 2017; pp. 128–130. [Google Scholar]
- Andrew, H.; Loren, T.; Aaron, H. Bot Detection in Wikidata Using Behavioral and Other Informal Cues. In Proceedings of the ACM on Human-Computer Interaction, New York, NJ, USA, 3–7 November 2018; Volume 2, p. 64. [Google Scholar]
- Qiang, C.; Sirivianos, M.; Yang, X.; Pregueiro, T. Aiding the Detection of Fake Accounts in Large Scale Social Online Services. In Proceedings of the Usenix Conference on Networked Systems Design & Implementation; USENIX Association: Berkeley, CA, USA, 2012. [Google Scholar]
- Wang, G.; Mohanlal, M.; Wilson, C.; Metzger, M.; Zheng, H.; Zhao, B.Y. Social Turing Tests: Crowdsourcing Sybil Detection. arXiv 2012, arXiv:1205.3856. [Google Scholar]
- Nguyen, T.D.; Cao, T.D.; Nguyen, L.G. DGA Botnet detection using Collaborative Filtering and Density-based Clustering. In Proceedings of the Sixth International Symposium ACM, Hue, Vietnam, 3–4 December 2015; pp. 203–209. [Google Scholar]
- Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying Density-Based Local Outliers. ACM Sigmod Record 2000, 29, 93–104. [Google Scholar] [CrossRef]
- Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008. [Google Scholar]
- Ma, J.; Perkins, S. Time-series novelty detection using one-class support vector machines. In Proceedings of the IJCNN’ 03, Portland, OR, USA, 20–24 July 2003; pp. 1741–1745. [Google Scholar]
- Goldstein, M.; Dengel, A. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. KI-2012 Poster Demo Track 2012, 24, 59–63. [Google Scholar]
- Lazarevic, A.; Kumar, V. August. Feature bagging for outlier detection. In Proceedings of the KDD ’05, Chicago, IL, USA, 21–24 August 2005. [Google Scholar]
- Shyu, M.L.; Chen, S.; Sarinnapakorn, K.; Chang, L. A novel anomaly detection scheme based on principal component classifier. In Proceedings of the IEEE Foundations and New Directions of Data Mining Workshop, in conjunction with the Third IEEE International Conference on Data Mining (ICDM’03) IEEE, Melbourne, FL, USA, 19 December 2003; pp. 353–365. [Google Scholar]
- Hardin, J.; Rocke, D.M. Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Comput. Stat. Data Anal. 2004, 44, 625–638. [Google Scholar] [CrossRef] [Green Version]
- Angiulli, F.; Pizzuti, C. Fast outlier detection in high dimensional spaces. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery; Springer: Berlin/Heidelberg, Germany, 2002; pp. 15–27. [Google Scholar]
- Jeeyung, K.; Alex, S.; Jinoh, K.; Kesheng, W. Botnet Detection Using Recurrent Variational Autoencoder. In Proceedings of the 2020 IEEE Global Communications Conference, Taipei, Taiwan, 7–11 December 2020. [Google Scholar]
- Jia, G.; Liu, G.; Yuan, Z.; Wu, J. An Anomaly Detection Framework Based on Autoencoder and Nearest Neighbor. In Proceedings of the 2018 15th International Conference on Service Systems and Service Management (ICSSSM), Hangzhou, China, 21–22 July 2018. [Google Scholar]
- Jiao, Y.; Rayhana, R.; Bin, J.; Liu, Z.; Kong, X. A steerable pyramid autoencoder based framework for anomaly frame detection of water pipeline CCTV inspection. Measurement 2021, 174, 109020. [Google Scholar] [CrossRef]
- Rangel, F.; Rosso, P. Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling. In Proceedings of the CLEF 2019 Labs and Workshops, Notebook Papers, Lugano, Switzerland, 9–12 September 2019. [Google Scholar]
- Kriegel, H.P.; Schubert, M.; Zimek, A. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008. [Google Scholar]
Dataset | Users | Maximum Length of Tweet |
---|---|---|
English training set | 2880 | 933 |
English test set | 1240 | 646 |
Spanish training set | 2080 | 932 |
Spanish test set | 920 | 876 |
Method | AUC | Precision | Recall | Time |
---|---|---|---|---|
VAE-KNN-1 | 0.9649 | 0.9108 | 0.9749 | 0.0709 |
VAE-KNN-2 | 0.8095 | 0.7589 | 0.5539 | 0.0658 |
VAE-KNN-3 | 0.9834 | 0.9379 | 0.9879 | 0.1396 |
Method | Precision | Recall | Time |
---|---|---|---|
ABOD | 0.8612 | 0.9788 | 3.2796 |
CBLOF | 0.8131 | 0.9628 | 2.8504 |
Feature Bagging | 0.7806 | 0.975 | 1.6655 |
HBOS | 0.6481 | 0.6937 | 3.0498 |
IForest | 0.7883 | 0.8874 | 0.6094 |
OCSVM | 0.7699 | 0.5408 | 0.386 |
AE | 0.7583 | 0.8602 | 22.3951 |
VAE | 0.7456 | 0.8549 | 31.032 |
KNN | 0.9379 | 0.9879 | 0.1396 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, X.; Zheng, Q.; Zheng, K.; Sui, Y.; Cao, S.; Shi, Y. Detecting Social Media Bots with Variational AutoEncoder and k-Nearest Neighbor. Appl. Sci. 2021, 11, 5482. https://doi.org/10.3390/app11125482
Wang X, Zheng Q, Zheng K, Sui Y, Cao S, Shi Y. Detecting Social Media Bots with Variational AutoEncoder and k-Nearest Neighbor. Applied Sciences. 2021; 11(12):5482. https://doi.org/10.3390/app11125482
Chicago/Turabian StyleWang, Xiujuan, Qianqian Zheng, Kangfeng Zheng, Yi Sui, Siwei Cao, and Yutong Shi. 2021. "Detecting Social Media Bots with Variational AutoEncoder and k-Nearest Neighbor" Applied Sciences 11, no. 12: 5482. https://doi.org/10.3390/app11125482
APA StyleWang, X., Zheng, Q., Zheng, K., Sui, Y., Cao, S., & Shi, Y. (2021). Detecting Social Media Bots with Variational AutoEncoder and k-Nearest Neighbor. Applied Sciences, 11(12), 5482. https://doi.org/10.3390/app11125482