1.1. Background and Purpose
Social media provides a virtual space for sharing thoughts and information, engaging with others, and creating online communities [
1,
2,
3]. It generates large volumes of user-generated data and provides unprecedented opportunities for computational social science researchers [
4,
5]. Thus, social profiling, which profiles users based on their social media data, has recently gained significant attention for social media studies on human dynamics [
6,
7]. Among the many social profile attributes available on social media, user demographics include age, gender, race, city, country, and occupation. Additionally, it is essential to understand patterns, differences, and trends through demographic analysis, as demographics usually determine online social and behavioral patterns [
3].
Most prior studies that used demographic information have mainly considered age as one of the principal and mandatory variables to be explored in social profiling and human dynamics studies. Behavioral differences in social media are evident among people of different ages [
8]. For example, teenagers generally pay less attention to privacy, and tend to share personal information more carelessly on social networks [
9]. In contrast, adult users are careful when posting, and pay attention to who can read their comments. Therefore, more frequently, they craft sentences with positive emotions, minimizing negations, and reducing slang usage [
10].
In addition, there are typical behaviors among users of the same age group when the same topic is discussed [
11]. For example, among the numerous topics that teenagers generally discuss in their daily lives, topics such as relationships, school, and friends are more frequent [
12]. Contrastingly, when adults use social media to express their opinions, they often leave their classic identity markers of adulthood, through the discussion of topics such as religion, ideology, politics, and work. Moreover, adult users tend to provide photos, videos, and URLs to complement their opinions [
13].
However, prior studies on age information-based social profiling have been restricted because age information is not always available on social media [
14]. In other words, most prior age research on social media has used small data, where age information could be collected, or if possible, they used a simple but unreliable strategy to extract age information from social media (e.g., using descriptions that contain expressions like “12 years” and “I have 12 years”) [
15]. Moreover, social media’s anonymity and privacy policies have made it more challenging to obtain demographic information [
16]. Therefore, prior researchers have not even attempted to collect anonymous social media data with incomplete age information (i.e., partially open to the public for social profiling studies).
As such, there have been concerns, and efforts have been made for acquiring age information on social media [
11]. Therefore, most previous studies on age information-based social profiling have focused on age predictions, as shown in
Table 1. Nevertheless, challenging issues for age information-based social profiling still exist, as listed below.
First, according to this study’s literature review, most of the existing studies focused only on predictions, and made little effort to link the age prediction results to the analysis of human dynamics using the obtained age information. If more age information is provided through age predictions, more studies on human dynamics need to be considered with the predicted age information.
Second, most of the social media data that prior studies used were in English or Chinese, whereas social profiling has rarely been performed using Korean social media data. However, using social media data in various languages for age analysis will provide a richer understanding of various individuals and societies; therefore, efforts to acquire social media data in diverse languages are required for more effective social profiling studies.
To resolve the abovementioned problems and challenges, this study selected as its focus one of Korea’s major news portal sites, naver.com, and pioneered using it for age information-based social profiling studies. Unlike other sites, naver.com provides age information (i.e., rates of people in their 10s, 20s, 30s, 40s, and ≥50s as age group distributions) of anonymous commenters on news articles. The age information of naver.com is reliable as users sign into the service via real name authentication. Therefore, its age information can be used for age information-based social profiling studies of social media users in Korea. However, to do so, there are still problems to be solved as below.
First, naver.com has a policy of making the age group distribution of anonymous commenters on a news article open to the public only if the number of its direct news comments exceeds a specific number, i.e., 100. (The term “direct news comment” has been used for news comments that replied directly to a news article and to distinguish it from “news comments” on a news article, which included both direct and indirect replies to news comments). In other words, news articles with fewer than 100 direct news comments can be considered as unlabeled news articles. Around 91% of news articles published on naver.com were found to be unlabeled news articles based on the data that this study collected.
Second, as a result, naver.com has not yet been used to predict age information or analyze the social profiles of Korean users by using the age information that has been collected or predicted. Hence, methodologically there are several elements that are unclear: (i) how age information of anonymous commenters on a news article can be represented by extracted features; (ii) which prediction technique can give a better performance at predicting age information using the age information representation; and (iii) how the social profiles of commenters on news articles can be analyzed using the predicted age information.
To address these questions, this study proposed a method for predicting the age group distribution of anonymous new commenters for news articles and then using this predicted age information to investigate and understand differences in topics of interest between age groups as human dynamics. To be specific, it adopted a machine learning approach for predicting the age group distribution of anonymous commenters on unlabeled news articles. In this approach, each news article was represented by textual characteristics based on its comments. These labeled news articles were used to evaluate machine learning techniques for predicting age information, and the best prediction technique was selected and used to perform age information predictions for unlabeled news articles. Consequently, all collected news articles could be labeled by age information. Thereafter, using the section information as a cue for topics of interest to age groups, the fuzzy differences of interesting topics among age groups were explored and compared.
1.2. Reviews on Related Works
User profiling refers to the process of collecting, cleaning, and presenting an individual’s characteristics that are related to demographics and behavior [
17]. These attributes often include basic information (e.g., age, gender, location, education, and occupation). Additionally, user profiles can include elements that reflect more complex aspects, such as preferences, interests, behaviors, and personality traits. Recently, user profiling has evolved into social profiling, which leverages social information to generate user profiles (e.g., social actions, such as clicks and likes on social media) [
18]. Prior social profiling studies are summarized in
Table 1.
As shown in
Table 1, social profiling can be classified into individual and group profiling [
3]. Individual profiling involves learning about a person based on demographics (e.g., age, gender, and location) and psychographics (e.g., behavior, personality traits, and interests) by directly asking questions or tracking behaviors online and offline. Group profiling is a process used to represent individuals who share common attributes and may or may not be identified as a group. It mainly includes community detection and subsequent analysis of communities. In relation to the taxonomy of social profiling, this study focused on age, one of the demographic characteristics of individual profiling, but extended to the perspective of group profiling by using the distribution of age groups among commenters on a news article.
According to this study’s purpose, prior works related to social profiling can be divided into three categories: predicting social profile attributes (also known as digital footprints) [
19], using collected or generated social profiles to analyze human dynamics, and performing both in a sequence. However, most of the prior studies aimed at predicting the attributes of social profiles (e.g., personality traits and demographics). Based on the taxonomies in
Table 1, this study’s purpose can be classified as examining both age prediction and human dynamics, using the predicted age information and interests of groups as two social profile attributes. As age differences were analyzed at different topic levels using section information, this study contributes to the related literature.
Table 1.
Prior social profiling studies.
Table 1.
Prior social profiling studies.
Previous Work | Description | Types of Social Profile Attributes | Types of Purpose |
---|
Individual | Group | Prediction | Human Dynamics Studies |
---|
Lima and de Castro [20] | Personality traits prediction | Personality traits | | √ | |
Segalin, Cheng, and Cristani [17] | Personality traits prediction | Personality traits | Personality traits | √ | |
Wang et al. [21] | Joint gender and age prediction | Gender, Age | | √ | |
Guimarães, Rosa, Gaetano, Rodríguez, and Bressan [11] | Age group classification of people in their 10s to adults, analyzing the importance of age information to the precision of a sentiment metric | | Age, sentiment, and relationship between age and sentiment | √ | √ |
Wang et al. [22] | Demographics prediction | Gender, Age, and Location | | √ | |
Chen et al. [23] | Age prediction | Age | | √ | |
Lee and Ryu [24] | Gender difference analysis | | Gender, Interest | | √ |
Fang et al. [25] | Prediction of age and gender | Age, Gender | | √ | |
López-Santillán, Montes-Y-Gómez, González-Gurrola, Ramírez-Alonso, and Prieto-Ordaz [7] | Prediction of age, gender, and personality traits | Age, Gender, Personality traits | | √ | |
Han et al. [26] | Personality traits prediction | Personality traits | Personality traits | √ | |
Romanov et al. [27] | Age Prediction | Age | Age | √ | |
Figueroa, Peralta, and Nicolis [15] | Age Prediction | Age | Age | √ | |
Kamalesh and B [28] | Personality traits prediction | Personality traits | | √ | |
Khorrami et al. [29] | Personality traits prediction | Personality traits | Personality | √ | |
Zhou et al. [30] | Personality traits prediction | Personality traits | | √ | |
This study | Age prediction and fuzzy age differences in terms of interesting topics | | Age, Interest | √ | √ |
Table 1 shows that most of the prior studies focused on predicting social profiles.
Table 2 provides additional details on the prior studies in terms of their machine learning approaches. The findings emerging from
Table 2 are as follows:
First, representative social media, such as Twitter, Facebook, Instagram, and Weibo were used as data sources, and most of the text data was in English or Chinese. Second, various features were used as footprints on social media, representing target social profile attributes. They could be grouped into text, images, social relations, and social behaviors, whereas the type of footprint used on social media depended on the types of data sources (e.g., text or social features for Twitter and Weibo, and images for Flickr and Instagram). Third, for the type of machine learning, supervised learning (e.g., classification and regression) was mostly adopted. Prediction techniques varied from traditional machine learning techniques (e.g., support vector machine (SVM) and multilayer perceptron (MLP)) to the recent deep learning techniques (e.g., convolution neural network (CNN) and recurrent neural network (RNN)).
Compared to the previous studies summarized in
Table 2, this study used news articles and their comments from the Korean social media website
www.naver.com, which has rarely been studied because of anonymity and incompleteness of its social profile data. Regarding the type of footprint used in social media, this study generated and used word embeddings from text to represent age groups (i.e., the distribution of commenters on a news article by age group). In particular, it used the word2vec approach [
31] for word embeddings to overcome difficulties in text representation (e.g., news comments are very short, although their lengths are varied, and many of them are grammatically incorrect, either accidentally or intentionally).
Regarding the prediction techniques in
Table 2, although there are more complicated machine learning models, deep learning models such as CNN or RNN may achieve better performance than classical models in the case of massive training datasets. However, the scale of a personality dataset is usually small owing to the high cost of collecting sample data, and classical models can also achieve comparable results for such a small dataset [
26]. Consequently, this study did not involve deep learning models. Instead, it focused on regression tasks for its unsupervised learning-based approach (i.e., predicting the distribution of age groups for a news article in unlabeled datasets, whose size was much larger than labeled datasets).
Table 2.
Prior studies using a machine learning approach for age predictions of social profiling.
Table 2.
Prior studies using a machine learning approach for age predictions of social profiling.
Prior Studies | Social Media Data Used | Types of Footprints Used | Types of Machine Learning | Prediction Techniques |
---|
Source | Language |
---|
Lima and de Castro [20] | Twitter | English | Text, Social | Classification | Naïve Bayes (NB), Support Vector Machine (SVM), Multilayer Perceptron (MLP) |
Wang, Li, Chen, and Li [21] | Sina Weibo | Chinese | Text | Classification, Regression | SVM, CNN, Multi-task Convolutional Neural Network (MTCNN) |
Guimarães, Rosa, Gaetano, Rodríguez, and Bressan [11] | Twitter | English | Text, Social | Classification | MLP, DeepCNN, Decision Tree (DT), Random Forest (RF), SVM |
Segalin, Cheng, and Cristani [17] | Flickr | | Images | Classification, Regression | CNN |
Wang, Ma, and Zhang [22] | Sina Weibo | Chinese | Text | Classification | SVM |
Chen, Cheng, Yang, Liang, Quan, and Li [23] | Sina Weibo | Chinese | Text, Social | Classification, Regression | SVM, MLP, CNN, Long Short-Term Memory (LSTM) |
Fang, Yuan, Lu, and Feng [25] | Flickr | | Images | Classification, Regression | CNN |
López-Santillán, Montes-Y-Gómez, González-Gurrola, Ramírez-Alonso, and Prieto-Ordaz [7] | Social media outlets, blogs, Twitter, and hotel reviews | English, Arabic, Spanish, Portuguese, Dutch, Italian | Text | Classification, Regression | SVM, RF, Extra Trees (ET), k-Nearest Neighbors (k-NN) |
Han, Huang, and Tang [26] | Sina Weibo | Chinese | Text | Classification | Logistic Regression (LR), SVM, RF |
Romanov, Kurtukova, Sobolev, Shelupanov, and Fedotova [27] | vk.com | Russian | Text | Classification | The hybrid of CNN and RNN (CRNN) |
Figueroa, Peralta, and Nicolis [15] | Yahoo! Answers | English | Text, Images, Social | Classification | FastText, CNN, Bidirectional RNN (B-RNN), Attention-based Bidirectional RNN (AB-RNN), Recurrent Convolutional Neural Network (RCNN) |
Kamalesh and B [28] | Facebook, Twitter, Instagram | English | | Classification | Maximum Entropy Classifier (MEC) |
Khorrami, Khorrami, and Farhangi [29] | Instagram | - | Images | Regression | ET, Gradient Boosted Trees (GBT), RF |
Zhou, Zhang, Zhao, and Yang [30] | Facebook | English | Text | Classification | SVM, RF, k-NN, Attention-based Bidirectional LSTM (AB-LSTM) |
This study | naver.com | Korean | Text | Regression | Multiple Linear Regression (MLR), MLP, DT, SVM, k-NN, RF |