Multi-Label Prediction-Based Fuzzy Age Difference Analysis for Social Profiling of Anonymous Social Media

Suh, Jong Hwan

doi:10.3390/app14020790

Open AccessArticle

Multi-Label Prediction-Based Fuzzy Age Difference Analysis for Social Profiling of Anonymous Social Media

by

Jong Hwan Suh

Department of Management Information Systems & BERI, Gyeongsang National University, 501 Jinjudae-ro, Jinju-si 52828, Republic of Korea

Appl. Sci. 2024, 14(2), 790; https://doi.org/10.3390/app14020790

Submission received: 14 November 2023 / Revised: 9 January 2024 / Accepted: 15 January 2024 / Published: 17 January 2024

Download

Browse Figures

Versions Notes

Abstract

:

Age is an essential piece of demographic information for social profiling, as different social and behavioral characteristics are age-related. To acquire age information, most of the previously conducted social profiling studies have predicted age information. However, age predictions in social profiling have been very limited, because it is difficult or impossible to obtain age information from social media. Moreover, age-prediction results have rarely been used to study human dynamics. In these circumstances, this study focused on naver.com, a nationwide social media website in Korea. Although the social profiles of news commenters on naver.com can be analyzed and used, the age information is incomplete (i.e., partially open to the public) owing to anonymity and privacy protection policies. Therefore, no prior research has used naver.com for age predictions or subsequent analyses based on the predicted age information. To address this research gap, this study proposes a method that uses a machine learning approach to predict the age information of anonymous commenters on unlabeled (i.e., with age information hidden) news articles on naver.com. Furthermore, the predicted age information was fused with the section information of the collected news articles, and fuzzy differences between age groups were analyzed for topics of interest, using the proposed correlation–similarity matrix and fuzzy sets of age differences. Thus, differentiated from the previous social profiling studies, this study expands the literature on social profiling and human dynamics studies. Consequently, it revealed differences between age groups from anonymous and incomplete Korean social media that can help in understanding age differences and ease related intergenerational conflicts to help reach a sustainable South Korea.

Keywords:

age predictions; anonymous news commenters; word embedding; machine learning; age difference analysis; fuzzy sets

1. Introduction

1.1. Background and Purpose

Social media provides a virtual space for sharing thoughts and information, engaging with others, and creating online communities [1,2,3]. It generates large volumes of user-generated data and provides unprecedented opportunities for computational social science researchers [4,5]. Thus, social profiling, which profiles users based on their social media data, has recently gained significant attention for social media studies on human dynamics [6,7]. Among the many social profile attributes available on social media, user demographics include age, gender, race, city, country, and occupation. Additionally, it is essential to understand patterns, differences, and trends through demographic analysis, as demographics usually determine online social and behavioral patterns [3].

Most prior studies that used demographic information have mainly considered age as one of the principal and mandatory variables to be explored in social profiling and human dynamics studies. Behavioral differences in social media are evident among people of different ages [8]. For example, teenagers generally pay less attention to privacy, and tend to share personal information more carelessly on social networks [9]. In contrast, adult users are careful when posting, and pay attention to who can read their comments. Therefore, more frequently, they craft sentences with positive emotions, minimizing negations, and reducing slang usage [10].

In addition, there are typical behaviors among users of the same age group when the same topic is discussed [11]. For example, among the numerous topics that teenagers generally discuss in their daily lives, topics such as relationships, school, and friends are more frequent [12]. Contrastingly, when adults use social media to express their opinions, they often leave their classic identity markers of adulthood, through the discussion of topics such as religion, ideology, politics, and work. Moreover, adult users tend to provide photos, videos, and URLs to complement their opinions [13].

However, prior studies on age information-based social profiling have been restricted because age information is not always available on social media [14]. In other words, most prior age research on social media has used small data, where age information could be collected, or if possible, they used a simple but unreliable strategy to extract age information from social media (e.g., using descriptions that contain expressions like “12 years” and “I have 12 years”) [15]. Moreover, social media’s anonymity and privacy policies have made it more challenging to obtain demographic information [16]. Therefore, prior researchers have not even attempted to collect anonymous social media data with incomplete age information (i.e., partially open to the public for social profiling studies).

As such, there have been concerns, and efforts have been made for acquiring age information on social media [11]. Therefore, most previous studies on age information-based social profiling have focused on age predictions, as shown in Table 1. Nevertheless, challenging issues for age information-based social profiling still exist, as listed below.

First, according to this study’s literature review, most of the existing studies focused only on predictions, and made little effort to link the age prediction results to the analysis of human dynamics using the obtained age information. If more age information is provided through age predictions, more studies on human dynamics need to be considered with the predicted age information.

Second, most of the social media data that prior studies used were in English or Chinese, whereas social profiling has rarely been performed using Korean social media data. However, using social media data in various languages for age analysis will provide a richer understanding of various individuals and societies; therefore, efforts to acquire social media data in diverse languages are required for more effective social profiling studies.

To resolve the abovementioned problems and challenges, this study selected as its focus one of Korea’s major news portal sites, naver.com, and pioneered using it for age information-based social profiling studies. Unlike other sites, naver.com provides age information (i.e., rates of people in their 10s, 20s, 30s, 40s, and ≥50s as age group distributions) of anonymous commenters on news articles. The age information of naver.com is reliable as users sign into the service via real name authentication. Therefore, its age information can be used for age information-based social profiling studies of social media users in Korea. However, to do so, there are still problems to be solved as below.

First, naver.com has a policy of making the age group distribution of anonymous commenters on a news article open to the public only if the number of its direct news comments exceeds a specific number, i.e., 100. (The term “direct news comment” has been used for news comments that replied directly to a news article and to distinguish it from “news comments” on a news article, which included both direct and indirect replies to news comments). In other words, news articles with fewer than 100 direct news comments can be considered as unlabeled news articles. Around 91% of news articles published on naver.com were found to be unlabeled news articles based on the data that this study collected.

Second, as a result, naver.com has not yet been used to predict age information or analyze the social profiles of Korean users by using the age information that has been collected or predicted. Hence, methodologically there are several elements that are unclear: (i) how age information of anonymous commenters on a news article can be represented by extracted features; (ii) which prediction technique can give a better performance at predicting age information using the age information representation; and (iii) how the social profiles of commenters on news articles can be analyzed using the predicted age information.

To address these questions, this study proposed a method for predicting the age group distribution of anonymous new commenters for news articles and then using this predicted age information to investigate and understand differences in topics of interest between age groups as human dynamics. To be specific, it adopted a machine learning approach for predicting the age group distribution of anonymous commenters on unlabeled news articles. In this approach, each news article was represented by textual characteristics based on its comments. These labeled news articles were used to evaluate machine learning techniques for predicting age information, and the best prediction technique was selected and used to perform age information predictions for unlabeled news articles. Consequently, all collected news articles could be labeled by age information. Thereafter, using the section information as a cue for topics of interest to age groups, the fuzzy differences of interesting topics among age groups were explored and compared.

1.2. Reviews on Related Works

User profiling refers to the process of collecting, cleaning, and presenting an individual’s characteristics that are related to demographics and behavior [17]. These attributes often include basic information (e.g., age, gender, location, education, and occupation). Additionally, user profiles can include elements that reflect more complex aspects, such as preferences, interests, behaviors, and personality traits. Recently, user profiling has evolved into social profiling, which leverages social information to generate user profiles (e.g., social actions, such as clicks and likes on social media) [18]. Prior social profiling studies are summarized in Table 1.

As shown in Table 1, social profiling can be classified into individual and group profiling [3]. Individual profiling involves learning about a person based on demographics (e.g., age, gender, and location) and psychographics (e.g., behavior, personality traits, and interests) by directly asking questions or tracking behaviors online and offline. Group profiling is a process used to represent individuals who share common attributes and may or may not be identified as a group. It mainly includes community detection and subsequent analysis of communities. In relation to the taxonomy of social profiling, this study focused on age, one of the demographic characteristics of individual profiling, but extended to the perspective of group profiling by using the distribution of age groups among commenters on a news article.

According to this study’s purpose, prior works related to social profiling can be divided into three categories: predicting social profile attributes (also known as digital footprints) [19], using collected or generated social profiles to analyze human dynamics, and performing both in a sequence. However, most of the prior studies aimed at predicting the attributes of social profiles (e.g., personality traits and demographics). Based on the taxonomies in Table 1, this study’s purpose can be classified as examining both age prediction and human dynamics, using the predicted age information and interests of groups as two social profile attributes. As age differences were analyzed at different topic levels using section information, this study contributes to the related literature.

Table 1. Prior social profiling studies.

Previous Work	Description	Types of Social Profile Attributes		Types of Purpose
Previous Work	Description	Individual	Group	Prediction	Human Dynamics Studies
Lima and de Castro [20]	Personality traits prediction	Personality traits		√
Segalin, Cheng, and Cristani [17]	Personality traits prediction	Personality traits	Personality traits	√
Wang et al. [21]	Joint gender and age prediction	Gender, Age		√
Guimarães, Rosa, Gaetano, Rodríguez, and Bressan [11]	Age group classification of people in their 10s to adults, analyzing the importance of age information to the precision of a sentiment metric		Age, sentiment, and relationship between age and sentiment	√	√
Wang et al. [22]	Demographics prediction	Gender, Age, and Location		√
Chen et al. [23]	Age prediction	Age		√
Lee and Ryu [24]	Gender difference analysis		Gender, Interest		√
Fang et al. [25]	Prediction of age and gender	Age, Gender		√
López-Santillán, Montes-Y-Gómez, González-Gurrola, Ramírez-Alonso, and Prieto-Ordaz [7]	Prediction of age, gender, and personality traits	Age, Gender, Personality traits		√
Han et al. [26]	Personality traits prediction	Personality traits	Personality traits	√
Romanov et al. [27]	Age Prediction	Age	Age	√
Figueroa, Peralta, and Nicolis [15]	Age Prediction	Age	Age	√
Kamalesh and B [28]	Personality traits prediction	Personality traits		√
Khorrami et al. [29]	Personality traits prediction	Personality traits	Personality	√
Zhou et al. [30]	Personality traits prediction	Personality traits		√
This study	Age prediction and fuzzy age differences in terms of interesting topics		Age, Interest	√	√

Table 1 shows that most of the prior studies focused on predicting social profiles. Table 2 provides additional details on the prior studies in terms of their machine learning approaches. The findings emerging from Table 2 are as follows:

First, representative social media, such as Twitter, Facebook, Instagram, and Weibo were used as data sources, and most of the text data was in English or Chinese. Second, various features were used as footprints on social media, representing target social profile attributes. They could be grouped into text, images, social relations, and social behaviors, whereas the type of footprint used on social media depended on the types of data sources (e.g., text or social features for Twitter and Weibo, and images for Flickr and Instagram). Third, for the type of machine learning, supervised learning (e.g., classification and regression) was mostly adopted. Prediction techniques varied from traditional machine learning techniques (e.g., support vector machine (SVM) and multilayer perceptron (MLP)) to the recent deep learning techniques (e.g., convolution neural network (CNN) and recurrent neural network (RNN)).

Compared to the previous studies summarized in Table 2, this study used news articles and their comments from the Korean social media website www.naver.com, which has rarely been studied because of anonymity and incompleteness of its social profile data. Regarding the type of footprint used in social media, this study generated and used word embeddings from text to represent age groups (i.e., the distribution of commenters on a news article by age group). In particular, it used the word2vec approach [31] for word embeddings to overcome difficulties in text representation (e.g., news comments are very short, although their lengths are varied, and many of them are grammatically incorrect, either accidentally or intentionally).

Regarding the prediction techniques in Table 2, although there are more complicated machine learning models, deep learning models such as CNN or RNN may achieve better performance than classical models in the case of massive training datasets. However, the scale of a personality dataset is usually small owing to the high cost of collecting sample data, and classical models can also achieve comparable results for such a small dataset [26]. Consequently, this study did not involve deep learning models. Instead, it focused on regression tasks for its unsupervised learning-based approach (i.e., predicting the distribution of age groups for a news article in unlabeled datasets, whose size was much larger than labeled datasets).

Table 2. Prior studies using a machine learning approach for age predictions of social profiling.

Prior Studies	Social Media Data Used		Types of Footprints Used	Types of Machine Learning	Prediction Techniques
Prior Studies	Source	Language	Types of Footprints Used	Types of Machine Learning	Prediction Techniques
Lima and de Castro [20]	Twitter	English	Text, Social	Classification	Naïve Bayes (NB), Support Vector Machine (SVM), Multilayer Perceptron (MLP)
Wang, Li, Chen, and Li [21]	Sina Weibo	Chinese	Text	Classification, Regression	SVM, CNN, Multi-task Convolutional Neural Network (MTCNN)
Guimarães, Rosa, Gaetano, Rodríguez, and Bressan [11]	Twitter	English	Text, Social	Classification	MLP, DeepCNN, Decision Tree (DT), Random Forest (RF), SVM
Segalin, Cheng, and Cristani [17]	Flickr		Images	Classification, Regression	CNN
Wang, Ma, and Zhang [22]	Sina Weibo	Chinese	Text	Classification	SVM
Chen, Cheng, Yang, Liang, Quan, and Li [23]	Sina Weibo	Chinese	Text, Social	Classification, Regression	SVM, MLP, CNN, Long Short-Term Memory (LSTM)
Fang, Yuan, Lu, and Feng [25]	Flickr		Images	Classification, Regression	CNN
López-Santillán, Montes-Y-Gómez, González-Gurrola, Ramírez-Alonso, and Prieto-Ordaz [7]	Social media outlets, blogs, Twitter, and hotel reviews	English, Arabic, Spanish, Portuguese, Dutch, Italian	Text	Classification, Regression	SVM, RF, Extra Trees (ET), k-Nearest Neighbors (k-NN)
Han, Huang, and Tang [26]	Sina Weibo	Chinese	Text	Classification	Logistic Regression (LR), SVM, RF
Romanov, Kurtukova, Sobolev, Shelupanov, and Fedotova [27]	vk.com	Russian	Text	Classification	The hybrid of CNN and RNN (CRNN)
Figueroa, Peralta, and Nicolis [15]	Yahoo! Answers	English	Text, Images, Social	Classification	FastText, CNN, Bidirectional RNN (B-RNN), Attention-based Bidirectional RNN (AB-RNN), Recurrent Convolutional Neural Network (RCNN)
Kamalesh and B [28]	Facebook, Twitter, Instagram	English		Classification	Maximum Entropy Classifier (MEC)
Khorrami, Khorrami, and Farhangi [29]	Instagram	-	Images	Regression	ET, Gradient Boosted Trees (GBT), RF
Zhou, Zhang, Zhao, and Yang [30]	Facebook	English	Text	Classification	SVM, RF, k-NN, Attention-based Bidirectional LSTM (AB-LSTM)
This study	naver.com	Korean	Text	Regression	Multiple Linear Regression (MLR), MLP, DT, SVM, k-NN, RF

1.3. Organization of This Paper

The rest of this paper comprises three sections. Section 2 describes the research framework, which proposed using a machine learning-based approach for predicting the age group distribution of anonymous commenters on news articles. Section 3 demonstrates the results of applying the research framework to the anonymous Korean news portal site naver.com. Thereafter, it discusses and compares the performance of different prediction techniques. It selected the best prediction technique and used it to predict the age group distribution of anonymous commenters on each unlabeled news article. Moreover, it used the collected or predicted age information to analyze fuzzy differences between age groups with respect to their topics of interest by using the collected section information. Finally, Section 4 concludes the paper by summarizing the results and implications of this study and its limitations for future research.

2. Materials and Methods

This study proposed a machine learning-based methodology for fuzzy age difference analysis of anonymous commenters with news article data, of which age information was mostly unlabeled. Figure 1 summarizes this study’s overall research method, which comprises four steps, and the subsequent subsections explain its details.

2.1. Acquire the News Data from Anonymous Korean Social Media

This study acquired data comprising 167,533 news articles and their comments from 1 October to 30 November 2022, covering all sections and subsections of naver.com, an anonymous Korean news portal site. Only 15,080 (9.0001%) of the acquired news articles were given the age information of their commenters according to the policy of naver.com, wherein age information of a news article is disclosed to the public only if it has more than 100 direct comments. Such articles were considered as labeled news articles, and approximately 86% of the collected comments belonged to the labeled news articles category.

The age information for a news article n represents the age group distribution, which is composed of the rates of five age groups: 10s, 20s, 30s, 40s, and ≥50s. The rate of an age group

g

was defined as:

{a g e r a t e}_{n e w s} (n, g) = t h e r a t e o f n e w s c o m m e n t e r s b e l o n g i n g t o a n a g e g r o u p g f o r a n e w s a r t i c l e n,

(1)

where

g

is an age group,

g \in {10 s, 20 s, 30 s, 40 s, \geq 50 s}

. Figure 2 illustrates the distributions of

{a g e r a t e}_{n e w s} (n, g)

given for the labeled news articles by using the kernel density estimation (KDE) plot, and Table 3 provides descriptive statistics on the

{a g e r a t e}_{n e w s} (n, g)

of the labeled news articles.

In addition to age information, this study collected section and subsection information about the news articles of the collected data and denoted a section as

S \in S E C T I O N,

and a subsection of the section

S

as

s \in S U B S E C T I O N (S)

. The section and subsection information about the collected news articles that were merged with age information enabled this study to analyze the differences between age groups with respect to the topics of their interest. Table 4 shows the descriptive statistics in terms of 6 sections and 48 subsections for the labeled news articles and their

a g e r a t e_{n e w s} (n, g)

values.

2.2. Represent the Distribution of Age Groups among Commenters on a News Article Using word2vec

This study used the news2vec model to represent a news article n, i.e., the age group distribution of the news article. The news2vec vectors for the collected news articles were generated by following five steps, which can be summarized as: (i) extracting unigrams from news comments; (ii) removing extremely long unigrams; (iii) generating 300-dimensional word2vec embeddings for the unigrams, i.e., unigram2vec; (iv) generating feature vectors for news comments, i.e., comment2vec; and (v) generating feature vectors for news articles, i.e., news2vec.

Here, to generate unigram2vec vectors, a news comment c was considered as a sentence and represented by its unigrams, considered as words in the sentence. Thereafter, it was used as an input to train the word2vec model, using GENSIM’s word2vec module with default settings. Moreover, to generate comment2vec vectors, unigam2vec vectors were aggregated for related news comments, which were then aggregated for related news articles. The comment2vec and news2vec are defined as follows:

c o m m e n t 2 v e c (c) = \frac{1}{n (U N I G R A M (c))} \sum_{u \in U N I G R A M (c)} u n i g r a m 2 v e c (u),

(2)

where

U N I G R A M (c)

is a set of unigrams appearing in a news comment c.

n e w s 2 v e c (n) \frac{1}{n (C O M M E N T (n))} \sum_{c \in C O M M E N T (n)} c o m m e n t 2 v e c (c),

(3)

where

C O M M E N T (n)

is a set of news comments on a news article n.

In addition, for the 15,080 labeled news articles, each age group rate of a news article n, i.e.,

{a g e s r a t e}_{n e w s} (n, g)

, was transformed into log odds to avoid having a negative value in the prediction, and the log odds was given by

{a g e s c o r e}_{n e w s} (n, g) = \log (\frac{{a g e r a t e}_{n e w s} (n, g)}{1 - {a g e r a t e}_{n e w s} (n, g)}) .

(4)

Additionally, for the labeled news article n,

{a g e s c o r e}_{n e w s} (n, g)

, values of all five age groups were obtained and used as a multi-label target, which was defined as

{a g e s c o r e}_{n e w s} (n) = [\begin{matrix} {a g e s c o r e}_{n e w s} (n, 10 s) \\ ⋮ \\ {a g e s c o r e}_{n e w s} (n, \geq 50 s) \end{matrix}] .

(5)

Then, the multi-label target,

{a g e s c o r e}_{n e w s} (n)

, was represented by

n e w s 2 v e c (n)

as features. Finally, two types of datasets were prepared for this study: i) the labeled dataset, which contained 15,080 instances with the news2vec vectors of 300 dimensions as features and a multi-label target, and ii) the unlabeled dataset, which had unlabeled news articles with only news2vec vectors as features.

2.3. Predict Age Information of the Unlabeled Dataset

This study used a machine learning approach to obtain an age predictor from the labeled dataset and used it to predict the age group distribution of a news article’s commenters in the unlabeled dataset. As described in Figure 3, the approach consists of four steps: (1) the same experiments were performed for different prediction techniques; (2) prediction techniques were evaluated and the best one was identified; (3) the predictor for the unlabeled dataset was trained using the best prediction technique; and (4) using the trained predictor, the multi-label targets of the unlabeled dataset were predicted and normalized to unity. Details are explained in the subsections below.

2.3.1. Perform Experiments of Age Predictions with Labeled Datasets

This study’s multi-label target,

{a g e s c o r e}_{n e w s} (n)

, was a numerical vector, so it considered machine learning techniques for regression that had been commonly used in prior studies. The six regression models are: Multiple Linear Regression (MLR), Neural Network Regression (NNR) (in this study, NNR is a prediction technique for regression using MLP), Decision Tree Regression (DTR), Support Vector Regression (SVR), k-Nearest Neighbors Regression (k-NNR), and Random Forest Regression (RFR).

Moreover, this study required a regression model that could deal with a multi-label target (i.e., multi-output regression). Out of these six regression models, as only NNR and k-NNR could be used as prediction techniques for multi-label targets, this study adopted a strategy that fits one regressor per target. Therefore, for this study’s multi-label target task, it applied the multi-output regression module in the machine learning package, scikit-learn, to the six prediction techniques. (https://scikit-learn.org/stable/modules/multiclass.html#multioutput-regression (accessed on 14 November 2023)) Therefore, for predicting the

{a g e s c o r e}_{n e w s} (n),

this study considered eight prediction techniques in total, which can be grouped into two categories: (1) single output predictors, denoted as NNR^single and single k-NNR^single, and (2) multiple output predictors, denoted as MLR^multi, NNR^multi, DTR^multi, SVR^multi, k-NNR^multi, and RFR^multi.

These eight prediction techniques were used to perform age prediction experiments with the labeled datasets after their hyperparameters were optimized via the grid search method, which used 10-fold cross-validation. As an experiment, it performed 10-fold cross-validation for each regressor, set by its optimized hyperparameters, and the same experiment was repeated 50 times. Here, a different random seed was used for experimental repetition, but the random seed was kept identical for the same iteration of different prediction techniques [2,32,33,34].

2.3.2. Evaluate the Performances of Prediction Techniques

To find the best prediction technique, the eight techniques were evaluated by measuring their mean absolute error (MAE) and root mean square error (RMSE) from the experimental results. These performance measures have been previously adopted when target variables had continuous values, and their values are smaller if the prediction error is lower, indicating better performance [32,34]. Hence, the effect of using a prediction technique on both performance measures was statistically investigated, using pairwise t tests between different prediction techniques. Thereafter, the best prediction technique was searched and selected by referring to the pairwise t tests results.

2.3.3. Train the Age Predictor

In this step, the best prediction technique, identified in the previous subsection, was used to train a prediction model from the labeled dataset (i.e., the age predictor). Then, the multi-label target of each news article,

{a g e s c o r e}_{n e w s} (n)

, in the labeled dataset was predicted using the learned age predictor. The predicted

{a g e s c o r e}_{n e w s} (n, g)

values of the estimated multi-label target were transformed into the

{a g e r a t e}_{n e w s} (n, g)

values of the labeled news article through two substeps. First, going through the inverse way of Equation (4), and second, normalizing into unity (i.e., the sum across all age groups of a news article n becomes 1). Lastly, the obtained

{a g e r a t e}_{n e w s} (n, g)

values were compared with the corresponding true values of the labeled dataset.

2.3.4. Predict and Explore the Age Information of Unlabeled Datasets

Here, the

a g e r a t e_{n e w s} (n, g)

values of age groups for a news article in the unlabeled dataset,

{N E W S}_{U n l a b e l e d}

, were obtained using the learned age predictor. Likewise, this study could deal with the unknown age information problem of anonymous commenters on the collected news articles; therefore, it enabled age difference analysis of the collected news articles,

{N E W S}_{T o t a l}

.

The prediction results for the unlabeled dataset could not be evaluated by the MAE or RMSE because true target values were unavailable in the unlabeled dataset. Instead, this study explored the prediction results and compared them with the labeled and mixed datasets. Here, the mixed dataset was obtained by combining the labeled and unlabeled datasets. Figure 4 illustrates the three types of datasets.

Specifically, the predicted

{a g e r a t e}_{n e w s} (n, g)

values of the unlabeled news articles were visualized by drawing a histogram, which was compared with the distributions from the labeled and mixed datasets. Next, to investigate differences between these three datasets, this study measured and compared their descriptive statistics, i.e., normality, such as mean and standard deviation, and compared them between the three datasets.

Additionally, to compare the three datasets, an investigation was conducted on sections. For this sectional investigation, the

{a g e r a t e}_{n e w s} (n, g)

values were averaged for news articles belonging to a section S, i.e.,

{n \in N E W S}_{S e c t i o n} (S),

which was defined as

{a g e r a t e}_{s e c t i o n} (S, g) = \frac{1}{O (S)} \sum_{n \in {N E W S}_{S e c t i o n} (S)} {a g e r a t e}_{n e w} (n, g),

(6)

where

O (S) = {n (N E W S}_{S e c t i o n} (S))

. The statistic for the investigation with respect to subsections was given by

{a g e r a t e}_{s u b s e c t i o n} (s, g) = \frac{1}{P (s)} \sum_{n \in {N E W S}_{S u b s e c t i o n} (s)} {a g e r a t e}_{n e w s} (n, g),

(7)

where s is a subsection,

{N E W S}_{S u b s e c t i o n} (s)

is a set of news articles belonging to the subsection s, i.e.,

n \in {N E W S}_{S u b s e c t i o n} (s)

, and

P (s) = {n (N E W S}_{S u b s e c t i o n} (s)

).

Then, using the

{a g e r a t e}_{s u b s e c t i o n} (s, g)

values, the rank of age group

g

across all five age groups within subsection s was measured for the labeled and mixed datasets. This resulted in two metrics:

{r a n k}_{l a b e l e d} (s, g)

and

{r a n k}_{m i x e d} (s, g)

. Subsequently, these two ranks for subsection s and age group

g

were considered as data points (x, y) and displayed via a scatter diagram to investigate the difference between the labeled and mixed datasets.

2.4. Discover Fuzzy Age Differences from Anonymous News Comments

In the previous subsection, the unlabeled news articles could be labeled using predictions. In this component, the fuzzy differences between age groups were analyzed with respect to topics of interest by using a mixed dataset, in which all the collected news articles had labels that were collected or predicted. Figure 5 summarizes the overall steps to discover fuzzy differences between age groups through anonymous news comments.

2.4.1. Represent Topics of Interest for Age Groups

To represent topics of interest for age groups, first the topic of interest for an age group at the news level was represented by using the

a g e r a t e_{n e w s} (n, g)

values of the mixed dataset, as given by

{t o p i c}_{n e w s} (g) = [\begin{matrix} {a g e r a t e}_{n e w s} (n_{1}, g) \\ ⋮ \\ {a g e r a t e}_{n e w s} (n_{M}, g) \end{matrix}],

(8)

where

n_{i}

is a news article in

{N E W S}_{T o t a l}

and

M

is the number of news articles in

{N E W S}_{T o t a l}

, i.e.,

M = n ({N E W S}_{T o t a l})

. Moreover, using the section information of the collected news articles, topics of interest for an age group at the section level were defined as

{t o p i c}_{s e c t i o n} (S, g) = [\begin{matrix} {a g e r a t e}_{n e w s} (n_{1}, g) \\ ⋮ \\ {a g e r a t e}_{n e w s} (n_{O (S)}, g) \end{matrix}],

(9)

where

n_{i} \in {N E W S}_{S e c t i o n} (S)

. Similarly, topics of interest for an age group at the subsection level were given by

{t o p i c}_{s u b s e c t i o n} (s, g) = [\begin{matrix} {a g e r a t e}_{n e w s} (n_{1}, g) \\ ⋮ \\ {a g e r a t e}_{n e w s} (n_{P (s)}, g) \end{matrix}],

(10)

where

n_{i} \in {N E W S}_{S u b s e c t i o n} (s)

.

These representations of an age group’s topics of interest were used to analyze the differences between the five age groups of anonymous news commenters. The age difference analysis was conducted at various levels, from news articles to subsections, and led to a correlation–similarity matrix after recognizing similarities and correlations between the age groups. Then, based on the results of the analyses, the degree of uncertainty of the age difference was quantified by adopting the concept of fuzzy sets.

2.4.2. Use Similarities between Age Groups for Age Difference Analysis

In this step, first, two age groups were paired, resulting in ₅C₂ = 10 age group pairs, and each age group pair was used to represent the relationship between the two age groups. Then, the cosine similarity of each age group pair was measured to investigate similarities between the two age groups. Specifically, the similarity between two age groups like

g_{i}

and

g_{j}

was defined as

{s i m i l a r i t y}_{n e w s} (g_{i}, g_{j}) = \frac{{t o p i c}_{n e w s} (g_{i}) \cdot {t o p i c}_{n e w s} (g_{j})}{‖{t o p i c}_{n e w s} (g_{i})‖ ‖{t o p i c}_{n e w s} (g_{j})‖} .

(11)

Next, the similarity between the two age groups

g_{i}

and

g_{j}

was measured at the section and subsection levels as

{s i m i l a r i t y}_{s e c t i o n} (S, g_{i}, g_{j}) = \frac{{t o p i c}_{s e c t i o n} (S, g_{i}) \cdot {t o p i c}_{s e c t i o n} (S, g_{j})}{‖{t o p i c}_{s e c t i o n} (S, g_{i})‖ ‖{t o p i c}_{s e c t i o n} (S, g_{j})‖} .

(12)

{s i m i l a r i t y}_{s u b s e c t i o n} (s, g_{i}, g_{j}) = \frac{{t o p i c}_{s u b s e c t i o n} (s, g_{i}) \cdot {t o p i c}_{s u b s e c t i o n} (s, g_{j})}{‖{t o p i c}_{s u b s e c t i o n} (s, g_{i})‖ ‖{t o p i c}_{s u b s e c t i o n} (s, g_{j})‖} .

(13)

Using the

{s i m i l a r i t y}_{n e w s} (g_{i}, g_{j})

values, the most and least similar age group pairs were identified and used to examine the overall differences between age groups. Moreover, from the

{s i m i l a r i t y}_{s e c t i o n} (S, g_{i}, g_{j})

values, the similarity within a section was investigated for each age group pair and compared with other sections. In terms of subsections, the top 5 and bottom 5 subsections were explored for each age group pair based on the

{s i m i l a r i t y}_{s u b s e c t i o n} (s, g_{i}, g_{j})

values.

2.4.3. Use Correlations between Age Groups for Age Difference Analysis

Correlations between age groups were assessed using the Pearson correlation coefficient. Likewise, for finding the similarity, the correlation coefficient between two age groups in each age group pair was obtained, i.e.,

c o r r e l a t i o n (t o p i c (g_{i}), t o p i c (g_{j})

). For different elements, i.e., news, sections, and subsections, three types of correlations coefficients were obtained as:

{c o r r e l a t i o n}_{n e w s} (g_{i}, g_{j}) = c o r r e l a t i o n ({t o p i c}_{n e w s} (g_{i}), {t o p i c}_{n e w s} (g_{j})) .

(14)

{c o r r e l a t i o n}_{s e c t i o n} ({S, g}_{i}, g_{j}) = c o r r e l a t i o n ({t o p i c}_{s e c t i o n} ({S, g}_{i}), {t o p i c}_{s e c t i o n} (S, g_{j})) .

(15)

{c o r r e l a t i o n}_{s u b s e c t i o n} (s, g_{i}, g_{j}) = c o r r e l a t i o n ({t o p i c}_{s u b s e c t i o n} ({s, g}_{i}), {t o p i c}_{s u b s e c t i o n} (s, g_{j})) .

(16)

Based on the

{c o r r e l a t i o n}_{n e w s} (g_{i}, g_{j})

values, the most and least correlated age groups were identified. Similarly, the most and least correlated sections for each age group pair were analyzed using

{c o r r e l a t i o n}_{s e c t i o n} (S, g_{i}, g_{j})

, and both the top and bottom 5 subsections for each age group pair were investigated in terms of

{c o r r e l a t i o n}_{s u b s e c t i o n} (s, g_{i}, g_{j})

.

2.4.4. Integrate Both Similarity and Correlation for Age Difference Analysis

This step used similarity and correlation perspectives within a subsection to evaluate an age group pair by building a correlation–similarity matrix, modified from business portfolio models. First, each age group pair was mapped onto the x-y plane according to its similarity and correlation values within a subsection. For example, as shown in Figure 6, a pair of two age groups

(g_{i}, g_{j})

within a section s was positioned as a blue data point in the x-y plane by setting

{s i m i l a r i t y}_{s u b s e c t i o n} ({s, g}_{i}, g_{j})

on the x-axis and

{c o r r e l a t i o n}_{s u b s e c t i o n} (s, g_{i}, g_{j})

on the y-axis.

Next, boundary values to divide the ranges of the

{s i m i l a r i t y}_{s u b s e c t i o n} ({s, g}_{i}, g_{j})

and

{c o r r e l a t i o n}_{s u b s e c t i o n} (s, g_{i}, g_{j})

values into high and low categories, i.e.,

{c o r r e l a t i o n}^{*}

and

{s i m i l a r i t y}^{*}

, were calculated by averaging the similarities and correlations between all age group pairs. Thereafter, using the

{c o r r e l a t i o n}^{*}

and

{s i m i l a r i t y}^{*}

values, the x-y plane in the matrix was divided into four areas, which were classified into three types, as shown in Figure 6: (i) the area type ① contains subsections considered as topics with relatively low differences between age groups; (ii) subsections in the area ② are neither relatively high nor relatively low; and (iii) the area ③ includes subsections considered as topics with relatively high difference between age groups.

2.4.5. Quantify the Uncertain Degree of Age Difference with Fuzzy Sets

In the proposed correlation–similarity matrix, as the subsections for an age group pair were positioned over more than two areas, it was difficult to clearly define the differences between two age groups in the pair as one of the three area types. To surmount such uncertainty in the relationship between two age groups, this study adopted the concept of a fuzzy set, and accordingly defined the three fuzzy sets of age difference as: (i)

D_{H i g h}

, which included the age group pairs related to subsections in the area ③; (ii)

D_{M i d d l e},

whose elements comprised the age group pairs related to subsections in the area ②; and (iii)

D_{L o w}

, which contained the age group pairs of subsections in the area ①.

In addition, the membership value of an age group pair in a fuzzy set,

μ_{D_{q}} (g_{i}, g_{j})

, was obtained by calculating the ratio of subsections belonging to the age group pair in the fuzzy set’s area. For example, as illustrated in Figure 6, if 36 subsections of an age group pair

(g_{i}, g_{j})

are positioned in the area ③, then the ratio 36/48 is the degree of membership of the age group pair in the fuzzy set

D_{H i g h}

, i.e.,

μ_{D_{H i g h}} (g_{i}, g_{j}) = 0.7500

. Likewise, the degree of membership of the age group pair in the other fuzzy sets can be calculated as 4/48 for

D_{M i d d l e}

, i.e.,

μ_{D_{M i d d l e}} (g_{i}, g_{j}) = 0.0833

and 8/48 for

D_{L o w}

, i.e.,

μ_{D_{L o w}} (g_{i}, g_{j}) = 0.1667

. Thus, the degree of membership of an age group pair to a fuzzy set was considered as a probability of its age difference’s existence, and the uncertainty of differences between age groups could be quantified.

3. Results and Discussions

3.1. Evaluation of Results of Prediction Techniques

For each prediction technique, a 10-fold cross-validation using the labeled dataset was repeated 50 times, and the experimental results were evaluated using two performance measures: MAE and RMSE. Table 5 explains the descriptive statistics of the evaluation results of the repeated experiments and shows that SVR^multi was the best among the eight prediction techniques with respect to both performance metrics.

In addition, Table 6 shows the results of pairwise t tests between the eight prediction techniques, performed to statistically investigate the effect of using a prediction technique on performance measures. The hypotheses in Table 6 were generated to represent comparisons between different prediction techniques. For example, a hypothesis “A > B” implies that the prediction technique A has a bigger prediction error than B; therefore, B is a better prediction technique.

The comparison results in Table 6 were interpreted and used to determine whether corresponding hypotheses were supported or not by the repeated experiments in terms of both performance measures. For example, the first hypothesis in Table 6, i.e., “NNR^single > k-NNR^single”, is supported by repeated experiments because of its t-value > 0 and p-value <

α

, and it also implies that NNR^single has a bigger prediction error; therefore, k-NNR^single is a better prediction technique. On the other hand, if t-value < 0 and p-value <

α

, it means that the opposite hypothesis, i.e., “NNR^single < k-NNR^single”, is supported, and it indicates that NNR^single is better.

Consequently, based on the performance differences that were found statistically significant at p-value = 0.05, the eight prediction techniques could be arranged in descending order of either MAE or RMSE, as “MLR^multi > DTR^multi > NNR^single > k-NNR^multi > k-NNR^single = NNR^multi = RFR^multi > SVR^multi”. Additionally, given that the smallest values in performance metrics indicate the best prediction technique, it is statistically evident from Table 6 that SVR^multi was the best prediction technique for this study. Moreover, all the hypotheses related to SVR^multi proved the superiority of SVR^multi, which is highlighted in red font in Table 6. Therefore, SVR^multi was selected to train an age predictor with the labeled dataset to predict the age group distribution of anonymous commenters on unlabeled news articles.

3.2. Results of Age Prediction for Labeled Datasets and Comparison

In this step, the best prediction technique identified in the previous subsection was used to train a prediction model from the labeled dataset, i.e., the age predictor. Thereafter, the multi-label target of each news article,

{a g e s c o r e}_{n e w s} (n)

, in the labeled dataset was predicted by using the age predictor, and its

{a g e s c o r e}_{n e w s} (n, g)

values were transformed into the

{a g e r a t e}_{n e w s} (n, g)

values of the labeled news article.

Using SVR^multi as the best prediction technique, the age group distribution of commenters was predicted for news articles in the labeled dataset. Figure 7 describes the distribution of the

{a g e r a t e}_{n e w s} (n, g)

values obtained from the predicted

{a g e s c o r e}_{n e w s} (n, g)

values using the learned SVR^multi. When compared, the predicted

{a g e r a t e}_{n e w s} (n, g)

values for the labeled news articles were shown to have almost the same distribution as their corresponding true

{a g e r a t e}_{n e w s} (n, g)

values.

3.3. Results of Predicting Age Information for Unlabeled Datasets and Exploration

The unknown

{a g e r a t e}_{n e w s} (n, g)

values for news articles in the unlabeled dataset were estimated using the age predictor, learned from the labeled dataset by using the selected best prediction technique, i.e., SVR^multi. Figure 8 shows the distribution of the predicted

{a g e r a t e}_{n e w s} (n, g)

values for the unlabeled news articles, and its comparisons with the labeled and mixed datasets. Table 7 shows more detailed comparison results across the three types of

{a g e r a t e}_{n e w s} (n, g)

values. Figure 8 implies that the unknown age group distributions of unlabeled and labeled news articles are different. Moreover, when they were mixed, the distribution of the mixed values was more like the predicted values of the unlabeled news articles than those of the true values of the labeled news articles.

Table 8 contains descriptive statistics of the three types of

{a g e r a t e}_{s e c t i o n} (n, g)

. Furthermore, the Z-test results indicated whether differences between the true values of the labeled dataset and the values of the mixed dataset were statistically significant when they were compared in terms of sections. It was shown that only people in their 40s in the “IT/Science” section had an insignificant Z-test result, showing that the labeled and mixed datasets statistically had the same population mean as highlighted in red font in Table 8.

In addition, Figure 9 shows scatter diagrams of relationships between the labeled and mixed datasets when the proposed ranks,

{r a n k}_{l a b e l e d} (s, g)

and

{r a n k}_{m i x e d} (s, g)

, were used to compare two different datasets. When put together, the scatter diagrams indicated that although Table 8 showed that differences existed between the labeled and mixed datasets, similarities could be found between the two datasets when proposed ranks were used for comparing them. Thus, the age prediction results for the unlabeled and labeled datasets could be considered similar from a different perspective (i.e., other than ordinary statistical tests).

3.4. Results of Fuzzy Age Difference Analysis

3.4.1. On the Similarities between Age Groups

Figure 10 describes the obtained similarities between age groups, i.e.,

{s i m i l a r i t y}_{n e w s} (g_{i}, g_{j})

, and shows that each age group was most similar to its closest age group (i.e., 20s were similar to 10s, 30s to 20s, 20s to 30s, 30s to 40s, and 40s to ≥50s).

Table 9 shows similarities within a section measured for each age group pair, with their maximum and minimum values for similarities highlighted in red and blue fonts, respectively. For example, (10s, 20s) was the most similar in the “World” section, but least similar in the “Economy” section. Overall, (20s, 30s) had the greatest similarity in the “World” section, while (10s, ≥50s) had the lowest similarity in the “Economy” section. The largest and smallest values of all age group pairs are marked in bold in Table 9.

The similarities in Table 9 were investigated within each section and they demonstrated that four sections, “Politics,” “Lifestyle/Culture,” “World,” and “IT/Science” had the greatest similarity in (20s, 30s). In comparison, the other two sections “Economy” and “Society” had the greatest similarity in (30s, 40s). However, all sections had the lowest similarity in (10s, ≥50s). These results indicate that age differences in terms of topics of interest were small in (20s, 30s) and (30s, 40s), whereas they were large in (10s, ≥50s).

A more thorough analysis of the similarity of topics of interest between age groups was performed using the obtained

{s i m i l a r i t y}_{s u b s e c t i o n} (s, g_{i}, g_{j})

values. Its results are shown in Supplementary A. To summarize the results, both the top and bottom five subsections for

{s i m i l a r i t y}_{s u b s e c t i o n} (s, g_{i}, g_{j})

were identified for each age group pair, as shown in Table 10. The data in Table 10 can also help to examine and discuss the more detailed reasons for the similarities and differences in topics of interest revealed between the age groups. For example, according to Table 9, overall, the greatest similarity in topics of interest existed in (20s, 30s) in the “World” section, but a closer look at Table 10 revealed that the two age groups had the greatest similarity in the “Weather” subsection of the “Lifestyle/Culture” section. Moreover, Table 9 shows that overall, (10s, ≥50s) showed the lowest similarity relating to topics of interest in the “Economy” section. However, based on Table 10, the “Society General” subsection of the “Society” section had the lowest similarity relating to topics of interest for (10s, ≥50s).

3.4.2. On the Correlations between Age Groups

Regarding correlations between age groups, Figure 11 shows the overall correlations between age groups based on the measured

{c o r r e l a t i o n}_{n e w s} (g_{i}, g_{j})

values. Overall, the correlation between 10s and 20s was the largest, whereas 30s and ≥50s showed the smallest correlation. Considering all age group pairs, the most correlated age group for each age group pair was as follows: 20s similar to 10s, 10s to 20s, 40s to 30s, ≥50s to 40s, and 40s to ≥50s. This means that the correlation between close age groups was the largest, which is the same as the previous results of Figure 10, which show that the similarity between close age groups was the highest.

Furthermore, Figure 12 shows the correlation within a section measured for each age group pair. From Figure 12, it can be seen that with respect to common topics of interest between age groups, differences in correlation existed between them. The results in all sections are as follows. Overall, (10s, 20s) showed the greatest positive correlations in all sections. Among them, the correlation was the largest in the “Society” section. Similarly, all sections showed positive correlations for (20s, 30s), although the correlations were smaller than those for (10s, 20s). Contrastingly, overall, it was shown that (30s, ≥50s) had the greatest negative correlations across all sections; in particular, in the “Politics” section, (30s, ≥50s) showed the lowest correlation. These findings, based on the

{c o r r e l a t i o n}_{s e c t i o n} (S, g_{i}, g_{j})

values, imply that the age difference in terms of topic correlations is small for (10s, 20s) and (20s, 30s), but large in (30s, ≥50s). Thus, it is also expected that the above identified small or large topic similarities between age groups could have caused more conflicts and social problems between different age groups.

In addition, the Tables in Supplementary B show the results when the correlation between age groups was analyzed more carefully by obtaining and using

{c o r r e l a t i o n}_{s u b s e c t i o n} (s, g_{i}, g_{j})

values. The results can be condensed as shown in Table 11, by listing the identified top five and bottom five subsections in terms of

{c o r r e l a t i o n}_{s u b s e c t i o n} (s, g_{i}, g_{j})

for each age group pair. Table 11 reveals more detailed reasons for the correlations regarding topics of interest between age groups that can be examined and discussed. For example, the correlation between 10s and 20s in the “Society” section was the largest of all, as revealed in Figure 12, but when investigated more closely, it was seen in Table 11 that the two age groups had the largest correlation in the “Real Estate” subsection, belonging to the “Economy” section. Moreover, (30s, ≥50s) in the “Political” section showed the lowest correlation among all age group pairs in Figure 12, and it was found from Table 11 that the detailed subsection in the “Politics” section was the “President’s Office” subsection.

3.4.3. On the Quantification of Age Difference with Fuzzy Sets

Figure 13 illustrates a correlation–similarity matrix, which mapped each age group pair in a subsection using both similarity and correlation values. Moreover, using the correlation–similarity matrix, the uncertain differences between age groups could be roughly classified into three areas. For instance, Figure 13 shows that (10s, 20s) belonged to area ①, implying that they were relatively close to each other. Conversely, (20s, ≥50s) were classified as area ③, showing a relatively large age difference. Lastly, (10s, 30s) were classified into area ②, which indicates they were neither relatively high nor low.

Using the results of the correlation–similarity matrix, this study compiled Table 12, which shows the membership of an age group in three fuzzy sets of age differences. Using Table 12, the difference between the age groups could be measured by considering the uncertainty in their relationships, and the following results were obtained. In the case of (10s, 20s), the difference between them was not considered large, as they belonged to the fuzzy set

D_{L o w}

with a 100% degree of membership, i.e.,

μ_{D_{L o w}} (10 s, 20 s) = 1.0

. However, referring to the large degree of membership in the fuzzy set

D_{H i g h}

, it was found that 10s had clear differences from the other age groups, i.e., 40s and ≥50s. In addition, the results

μ_{D_{L o w}} (20 s, 30 s) = 1.0

and

μ_{D_{H i g h}} (20 s, \geq 50 s) = 1.0

show that there was little difference between the 20s and 30s, while there was a clear difference between the 20s and ≥50s. The difference between the 30s and ≥50s was evident, like that of the 20s and ≥50s, but the difference between the 40s and ≥50s was not found, according to

μ_{D_{L o w}} (40 s, \geq 50 s) = 1.0

.

4. Conclusions

Social profiling research requires demographic information, of which age information is essential. Previous studies have revealed that differences exist in social and behavioral characteristics, according to age. It has also been found that such age differences are prominent on specific topics. Therefore, to acquire such age information for social profiling, prior studies have primarily focused on predicting age information on social media. However, age predictions on social media have been limited because it is difficult or impossible to obtain age information from social media owing to its anonymity and privacy policies. Moreover, these predictions have focused mainly on English and Chinese social media and are yet to be connected to studying human dynamics. In using the acquired data of this study to fill those research gaps, the method of how age information can be represented and best predicted was also not clearly defined. To address these problems, this study proposed a method that predicts age information from naver.com after selecting the best prediction technique and used the predicted age information for difference analysis between age groups regarding topics of interest.

As regards the prediction techniques of machine learning, it evaluated eight models with comparisons, using labeled datasets. Consequently, the multi-label regression of SVM, i.e., SVR^multi, gave the best results for age prediction with labeled data, showing better performance than the other prediction techniques in a statistically significant manner. Hence, it was used to predict the age information of the unlabeled data, which were mixed with the labeled dataset, resulting in a mixed dataset. Subsequently, when general statistical approaches were employed to compare the two datasets, this study found differences between labeled and mixed datasets. Similarities existed between the two datasets when they were compared using a different method (i.e., ranks for an age group

g

within a subsection s). Thus, even though age information between labeled and mixed datasets could not be considered strictly identical, age prediction results of the unlabeled dataset still need to be included and used for social profiling. Moreover, using the predicted age information, this study performed a fuzzy difference analysis between age groups with respect to their topics of interest, using the section information of the collected news articles. The highlights of the obtained difference analysis results are as follows. The measured similarities between age groups showed that the age difference in terms of topics of interest was small in (20s, 30s) and (30s, 40s), whereas it was big in (10s, ≥50s). Besides correlations between age groups, this study also showed that the difference in terms of topic correlations was small in (10s, 20s) and (20s, 30s), but it was significant in (30s, ≥50s). When uncertainty was considered using the correlation–similarity matrix and defining the fuzzy sets of age difference, the difference in (10s, 20s), (20s, 30s), and (40s, ≥50s) was shown to be small, while (10s, 40s), (10s, ≥50s), (20s, ≥50s) and (30s, ≥50s) showed obvious distinctions between age groups.

As such, this study was able to uncover differences in topics of interest between age groups which may have caused intergenerational conflicts and social problems between age groups in Korean society. However, it is necessary to be cautious in determining differences between generations based on the results of this study, and more in-depth studies should be conducted. This is because different conclusions can be reached depending on the type or quality of data acquired. In addition, this study assumed that people in the same age group have similar topics of interest due to the nature of the data used, but it should be remembered that even people in the same age group can have different thoughts and preferences. However, data on individual age information is difficult to acquire. Moreover, the topics of interest for an age group may change over time and vary according to different events. To overcome these limitations of this study, further research needs to be undertaken.

Furthermore, this study used word embeddings to represent the distribution of age groups across commenters on a news article. Future research can investigate methods to measure and use other features (e.g., psychological/cognitive and social/behavioral characteristics of news commenters as a group and their changes over time). The usefulness of such features for the age prediction of commenters on a news article can be examined because different features can result in different prediction performances. Additionally, future studies could explore more a sophisticated topic analysis for identifying topics in news articles and finding differences in mindsets between age groups. While this study used predicted age information of anonymous commenters on unlabeled news articles, a more advanced exploration of human dynamics using predicted age information needs to be undertaken in the future.

This study makes valuable contributions, summarized as follows: First, unlike previous works, it introduced naver.com, which contains anonymous news comments written mainly in Korean, and thus, it could extend social profiling literature in terms of language diversity. It can, therefore, be referred to as a new data provider for future age information-based social profiling studies on Korean social media. Second, it contributed to social profiling literature by providing age information-based social profiling studies with incomplete age information for anonymous social media. Moreover, the proposed machine learning approach can be applied to deal with the incompleteness of other social profile attributes and will provide new opportunities for social profiling studies. Third, by proposing a correlation–similarity matrix and fuzzy sets of age differences, this study was able to capture uncertain differences between age groups from the integrated viewpoint of similarities and correlations on their topics of interest. Eventually, it will be helpful for measuring and monitoring intergenerational conflicts and solving such problems for sustainable societies.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app14020790/s1, Supplementary A. Similarities between Age Groups in Subsections. Supplementary B. Correlations between Age Groups in Subsections.

Funding

This study was supported by National Research Foundation of Korea Grant, funded by the Korean Government (No. 2021R1F1A1063681). The APC was funded by the Gyeongsang National University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from www.naver.com and are available from the author with the permission of www.naver.com.

Conflicts of Interest

The author declares no conflicts of interest.

References

Suh, J.H.; Park, C.H.; Jeon, S.H. Applying text and data mining techniques to forecasting the trend of petitions filed to e-People. Expert Syst. Appl. 2010, 37, 7255–7268. [Google Scholar] [CrossRef]
Suh, J.H. Forecasting the daily outbreak of topic-level political risk from social media using hidden Markov model-based techniques. Technol. Forecast. Soc. Chang. 2015, 94, 115–132. [Google Scholar] [CrossRef]
Bilal, M.; Gani, A.; Lali, M.I.U.; Marjani, M.; Malik, N. Social Profiling: A Review, Taxonomy, and Challenges. Cyberpsychol. Behav. Soc. Netw. 2019, 22, 433–450. [Google Scholar] [CrossRef] [PubMed]
Suh, J.H. SocialTERM-Extractor: Identifying and Predicting Social-Problem-Specific Key Noun Terms from a Large Number of Online News Articles Using Text Mining and Machine Learning Techniques. Sustainability 2019, 11, 196. [Google Scholar] [CrossRef]
Bazzaz Abkenar, S.; Haghi Kashani, M.; Mahdipour, E.; Jameii, S.M. Big data analytics meets social media: A systematic review of techniques, open issues, and future directions. Telemat. Inform. 2021, 57, 101517. [Google Scholar] [CrossRef] [PubMed]
Hirt, R.; Kühl, N.; Satzger, G. Cognitive computing for customer profiling: Meta classification for gender prediction. Electron. Mark. 2019, 29, 93–106. [Google Scholar] [CrossRef]
López-Santillán, R.; Montes-Y-Gómez, M.; González-Gurrola, L.C.; Ramírez-Alonso, G.; Prieto-Ordaz, O. Richer Document Embeddings for Author Profiling tasks based on a heuristic search. Inf. Process. Manag. 2020, 57, 102227. [Google Scholar] [CrossRef]
Sawyer, S.M.; Afifi, R.A.; Bearinger, L.H.; Blakemore, S.-J.; Dick, B.; Ezeh, A.C.; Patton, G.C. Adolescence: A foundation for future health. Lancet 2012, 379, 1630–1640. [Google Scholar] [CrossRef]
Utz, S.; Krämer, N.C. The privacy paradox on social network sites revisited: The role of individual characteristics and group norms. Cyberpsychol. J. Psychosoc. Res. Cyberspace 2009, 3, 2. [Google Scholar]
Schwartz, H.A.; Eichstaedt, J.C.; Kern, M.L.; Dziurzynski, L.; Ramones, S.M.; Agrawal, M.; Shah, A.; Kosinski, M.; Stillwell, D.; Seligman, M.E.P.; et al. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. PLoS ONE 2013, 8, e73791. [Google Scholar] [CrossRef]
Guimarães, R.G.; Rosa, R.L.; Gaetano, D.D.; Rodríguez, D.Z.; Bressan, G. Age Groups Classification in Social Network Using Deep Learning. IEEE Access 2017, 5, 10805–10816. [Google Scholar] [CrossRef]
Huffaker, D.A.; Calvert, S.L. Gender, Identity, and Language Use in Teenage Blogs. J. Comput.-Mediat. Commun. 2005, 10, JCMC10211. [Google Scholar] [CrossRef]
Pempek, T.A.; Yermolayeva, Y.A.; Calvert, S.L. College students’ social networking experiences on Facebook. J. Appl. Dev. Psychol. 2009, 30, 227–238. [Google Scholar] [CrossRef]
Wu, C.; Wu, F.; Qi, T.; Liu, J.; Huang, Y.; Xie, X. Neural Gender Prediction in Microblogging with Emotion-aware User Representation. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 2401–2404. [Google Scholar]
Figueroa, A.; Peralta, B.; Nicolis, O. Coming to Grips with Age Prediction on Imbalanced Multimodal Community Question Answering Data. Information 2021, 12, 48. [Google Scholar] [CrossRef]
Reddy, T.R.; Vardhan, B.V.; Reddy, P.V. N-Gram Approach for Gender Prediction. In Proceedings of the 2017 IEEE 7th International Advance Computing Conference (IACC), Hyderabad, India, 5–7 January 2017; pp. 860–865. [Google Scholar]
Segalin, C.; Cheng, D.S.; Cristani, M. Social profiling through image understanding: Personality inference using convolutional neural networks. Comput. Vis. Image Underst. 2017, 156, 34–50. [Google Scholar] [CrossRef]
Chen, E.; Zeng, G.; Luo, P.; Zhu, H.; Tian, J.; Xiong, H. Discerning individual interests and shared interests for social user profiling. World Wide Web 2017, 20, 417–435. [Google Scholar] [CrossRef]
Azucar, D.; Marengo, D.; Settanni, M. Predicting the Big 5 personality traits from digital footprints on social media: A meta-analysis. Personal. Individ. Differ. 2018, 124, 150–159. [Google Scholar] [CrossRef]
Lima, A.C.E.S.; de Castro, L.N. A multi-label, semi-supervised classification approach applied to personality prediction in social media. Neural Netw. 2014, 58, 122–130. [Google Scholar] [CrossRef]
Wang, L.; Li, Q.; Chen, X.; Li, S. Multi-task Learning for Gender and Age Prediction on Chinese Microblog. In Proceedings of the Natural Language Understanding and Intelligent Applications: 5th CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2016, and 24th International Conference on Computer Processing of Oriental Languages, ICCPOL 2016, Kunming, China, 2–6 December 2016; Springer: Cham, Switzerland, 2016; pp. 189–200. [Google Scholar]
Wang, Q.; Ma, S.; Zhang, C. Predicting users’ demographic characteristics in a Chinese social media network. Electron. Libr. 2017, 35, 758–769. [Google Scholar] [CrossRef]
Chen, J.; Cheng, L.; Yang, X.; Liang, J.; Quan, B.; Li, S. Joint Learning with both Classification and Regression Models for Age Prediction. J. Phys. Conf. Ser. 2019, 1168, 032016. [Google Scholar] [CrossRef]
Lee, S.Y.; Ryu, M.H. Exploring characteristics of online news comments and commenters with machine learning approaches. Telemat. Inform. 2019, 43, 101249. [Google Scholar] [CrossRef]
Fang, J.; Yuan, Y.; Lu, X.; Feng, Y. Muti-stage learning for gender and age prediction. Neurocomputing 2019, 334, 114–124. [Google Scholar] [CrossRef]
Han, S.; Huang, H.; Tang, Y. Knowledge of words: An interpretable approach for personality recognition from social media. Knowl.-Based Syst. 2020, 194, 105550. [Google Scholar] [CrossRef]
Romanov, A.S.; Kurtukova, A.V.; Sobolev, A.A.; Shelupanov, A.A.; Fedotova, A.M. Determining the Age of the Author of the Text Based on Deep Neural Network Models. Information 2020, 11, 589. [Google Scholar] [CrossRef]
Kamalesh, M.D.; Bharathi, B. Personality prediction model for social media using machine learning Technique. Comput. Electr. Eng. 2022, 100, 107852. [Google Scholar] [CrossRef]
Khorrami, M.; Khorrami, M.; Farhangi, F. Evaluation of tree-based ensemble algorithms for predicting the big five personality traits based on social media photos: Evidence from an Iranian sample. Personal. Individ. Differ. 2022, 188, 111479. [Google Scholar] [CrossRef]
Zhou, L.; Zhang, Z.; Zhao, L.; Yang, P. Attention-based BiLSTM models for personality recognition from user-generated content. Inf. Sci. 2022, 596, 460–471. [Google Scholar] [CrossRef]
García-Díaz, J.A.; Cánovas-García, M.; Colomo-Palacios, R.; Valencia-García, R. Detecting misogyny in Spanish tweets. An approach based on linguistics features and word embeddings. Future Gener. Comput. Syst. 2021, 114, 506–518. [Google Scholar] [CrossRef]
Choi, B.; Suh, J.H. Forecasting Spare Parts Demand of Military Aircraft: Comparisons of Data Mining Techniques and Managerial Features from the Case of South Korea. Sustainability 2020, 12, 6045. [Google Scholar] [CrossRef]
Suh, J.H. Comparing writing style feature-based classification methods for estimating user reputations in social media. SpringerPlus 2016, 5, 261. [Google Scholar] [CrossRef]
Suh, J.H. Machine-Learning-Based Gender Distribution Prediction from Anonymous News Comments: The Case of Korean News Portal. Sustainability 2022, 14, 9939. [Google Scholar] [CrossRef]

Figure 1. Research framework proposed by this study.

Figure 2. The distributions of

a g e r a t e_{n e w s} (n, g)

for the labeled news articles.

Figure 2. The distributions of

a g e r a t e_{n e w s} (n, g)

for the labeled news articles.

Figure 3. The four steps of a machine learning approach to predict age information of the unlabeled dataset.

Figure 4. An illustration of how to obtain the mixed dataset.

Figure 5. Steps to discover fuzzy age differences from anonymous news comments.

Figure 6. An example of the correlation–similarity matrix. * indicates that it is an average value.

Figure 7. Histogram comparisons between the true and predicted values of

{a g e r a t e}_{n e w s} (n, g)

for the labeled news articles.

Figure 7. Histogram comparisons between the true and predicted values of

{a g e r a t e}_{n e w s} (n, g)

for the labeled news articles.

Figure 8. Kernel Density Estimation plots on the distribution of the

{a g e r a t e}_{n e w s} (n, g)

for the different datasets.

Figure 8. Kernel Density Estimation plots on the distribution of the

{a g e r a t e}_{n e w s} (n, g)

for the different datasets.

Figure 9. Scatter diagrams of relationships between the labeled and mixed datasets with the two measured ranks. Red lines represent that two measured ranks are the same, i.e.,

{r a n k}_{l a b e l e d} (s, g) = {r a n k}_{m i x e d} (s, g)

.

Figure 9. Scatter diagrams of relationships between the labeled and mixed datasets with the two measured ranks. Red lines represent that two measured ranks are the same, i.e.,

{r a n k}_{l a b e l e d} (s, g) = {r a n k}_{m i x e d} (s, g)

.

Figure 10. The measured similarities between age groups.

Figure 11. The measured correlations between age groups. *** indicates the significance level of p < 0.01.

Figure 12. The measured correlation values within a section between different age groups. *** indicates the significance level of p < 0.01.

Figure 13. The obtained correlation–similarity matrix. * indicates that it is an average value.

Table 3. Descriptive statistics on

a g e r a t e_{n e w s} (n, g)

for news articles in the labeled dataset.

Table 3. Descriptive statistics on

a g e r a t e_{n e w s} (n, g)

for news articles in the labeled dataset.

Statistics	10s	20s	30s	40s	≥50s
Mean	0.0173	0.1403	0.2833	0.3135	0.2455
S.D.	0.0239	0.0891	0.0868	0.0711	0.1198

Table 4. Descriptive statistics on the 15,080 labeled news articles and their

a g e r a t e_{n e w s} (n, g)

by sections and subsections.

Table 4. Descriptive statistics on the 15,080 labeled news articles and their

a g e r a t e_{n e w s} (n, g)

by sections and subsections.

Section S	Subsection s	Labeled News Articles (%)	$a g e r a t e_{n e w s} (n, g)$
			10s		20s		30s		40s		≥50s
			Mean	S.D.	Mean	S.D.	Mean	S.D.	Mean	S.D.	Mean	S.D.
Politics	President’s Office	4.4870%	0.0111	0.0099	0.1322	0.0543	0.2544	0.0654	0.2950	0.0448	0.3077	0.1009
	National Assembly/Political Party	8.9869%	0.0112	0.0120	0.0996	0.0579	0.2322	0.0713	0.3300	0.0582	0.3271	0.1096
	Administration	0.6271%	0.0146	0.0196	0.1497	0.1107	0.2602	0.0796	0.2975	0.0750	0.2798	0.1348
	National Defense/Diplomacy	3.3297%	0.0156	0.0147	0.1508	0.0705	0.2391	0.0617	0.2848	0.0486	0.3101	0.1079
	North Korea	3.4848%	0.0132	0.0118	0.1406	0.0538	0.2369	0.0551	0.2856	0.0423	0.3241	0.0955
	Politics General	15.2906%	0.0119	0.0142	0.0996	0.0653	0.2286	0.0673	0.3381	0.0608	0.3221	0.1034
Economy	Stock	1.3707%	0.0073	0.0307	0.0937	0.0529	0.2699	0.0678	0.3506	0.0573	0.2795	0.0856
	Finance	1.0797%	0.0069	0.0091	0.1044	0.0541	0.3032	0.0705	0.3410	0.0538	0.2443	0.0862
	Real Estate	1.8879%	0.0050	0.0066	0.0630	0.0454	0.2771	0.0731	0.3717	0.0542	0.2842	0.0808
	Industry/Business	3.4267%	0.0092	0.0119	0.1130	0.0547	0.2898	0.0719	0.3341	0.0576	0.2539	0.0921
	Global Economy	0.3815%	0.0124	0.0102	0.1124	0.0505	0.2686	0.0548	0.3564	0.0489	0.2512	0.0646
	Economy General	6.9050%	0.0093	0.0135	0.1139	0.0627	0.2896	0.0720	0.3365	0.0573	0.2506	0.0866
	Living Economy	0.4784%	0.0108	0.0182	0.1404	0.0885	0.3323	0.0758	0.3226	0.0704	0.1943	0.0834
	Small and Mid-sized Businesses/Start-ups	0.2198%	0.0068	0.0053	0.1059	0.0464	0.2853	0.0840	0.3321	0.0567	0.2715	0.1052
Society	Case/Accident	5.0171%	0.0208	0.0293	0.1599	0.1122	0.3090	0.0952	0.3054	0.0790	0.2048	0.1200
	Education	0.8340%	0.0398	0.0436	0.1864	0.1199	0.2664	0.1037	0.3459	0.1336	0.1627	0.0888
	Labor	1.2866%	0.0122	0.0165	0.1484	0.0765	0.3279	0.0804	0.3035	0.0669	0.2057	0.0944
	Environment	0.5690%	0.0214	0.0203	0.1724	0.0566	0.3515	0.0725	0.2818	0.0522	0.1703	0.0805
	The Press	0.1681%	0.0162	0.0188	0.1400	0.1163	0.2635	0.0966	0.3358	0.0897	0.2469	0.1264
	Food/Medical	0.4978%	0.0253	0.0232	0.2248	0.1356	0.3277	0.0923	0.2705	0.0800	0.1506	0.1036
	Region	5.4051%	0.0202	0.0255	0.1531	0.0893	0.3213	0.0830	0.3148	0.0760	0.1897	0.0961
	Society General	18.1419%	0.0253	0.0369	0.1735	0.1170	0.3172	0.0946	0.2987	0.0853	0.1841	0.1176
	Character	0.0711%	0.0191	0.0176	0.1855	0.0785	0.3364	0.0802	0.3145	0.0898	0.1473	0.0553
	Human Rights/Welfare	0.4396%	0.0166	0.0240	0.1540	0.0796	0.3479	0.0731	0.3062	0.0590	0.1760	0.0831
Lifestyle/Culture	Travel/Leisure	0.2134%	0.0221	0.0195	0.1670	0.0606	0.3852	0.0630	0.2785	0.0703	0.1464	0.0586
	Food/Restaurant	0.0517%	0.0100	0.0076	0.2062	0.0457	0.3712	0.0671	0.2775	0.0231	0.1350	0.0746
	Car/Test Drive	0.2651%	0.0100	0.0095	0.1254	0.0466	0.3685	0.0708	0.3220	0.0533	0.1754	0.0797
	Road/Traffic	0.1099%	0.0118	0.0101	0.1053	0.0511	0.2865	0.0775	0.3241	0.0537	0.2735	0.0962
	Health Information	0.3362%	0.0187	0.0114	0.1904	0.0761	0.3590	0.0581	0.2831	0.0646	0.1494	0.0675
	Performance/Exhibition	0.2198%	0.0221	0.0271	0.1712	0.1363	0.2874	0.0717	0.3174	0.0956	0.2012	0.1005
	Book	0.1616%	0.0432	0.0409	0.2100	0.1112	0.2660	0.0870	0.2724	0.0842	0.2072	0.1195
	Religion	0.4655%	0.0225	0.0198	0.1582	0.0848	0.3033	0.0944	0.3040	0.0643	0.2118	0.1348
	Lifestyle/Culture General	2.3469%	0.0249	0.0257	0.1876	0.0903	0.3416	0.0760	0.2851	0.0770	0.1611	0.0903
	Weather	2.2112%	0.0280	0.0189	0.2132	0.0490	0.3658	0.0564	0.2474	0.0437	0.1461	0.0565
	Fashion/Beauty	0.0388%	0.0217	0.0147	0.1567	0.0653	0.3833	0.0383	0.2900	0.0352	0.1483	0.0504
World	Asia/Australia	3.2392%	0.0215	0.0167	0.1561	0.0603	0.2796	0.0592	0.3145	0.0596	0.2285	0.0845
	USA/Latin America	1.3771%	0.0192	0.0137	0.1588	0.0545	0.2815	0.0698	0.2975	0.0515	0.2429	0.0961
	Europe	0.5302%	0.0246	0.0184	0.1900	0.0783	0.3161	0.0706	0.2865	0.0599	0.1830	0.0996
	Middle East/Africa	0.2651%	0.0300	0.0201	0.2215	0.0720	0.3385	0.0619	0.2805	0.0578	0.1298	0.0504
	World General	0.6724%	0.0240	0.0163	0.1489	0.0600	0.2954	0.0606	0.3080	0.0518	0.2237	0.0914
IT/Science	Internet/SNS	0.2263%	0.0417	0.0382	0.2677	0.1282	0.3320	0.0756	0.2389	0.0871	0.1206	0.0663
	Communications/New Media	0.3039%	0.0264	0.0235	0.1864	0.0637	0.3387	0.0696	0.3053	0.0711	0.1426	0.0585
	Science General	0.3685%	0.0367	0.0233	0.1988	0.0705	0.3295	0.0663	0.2839	0.0631	0.1516	0.0597
	Games/Reviews	0.0323%	0.0640	0.0385	0.4180	0.1585	0.3260	0.0643	0.1380	0.0823	0.0540	0.0391
	IT General	1.7327%	0.0316	0.0235	0.1924	0.0682	0.3351	0.0605	0.2962	0.0672	0.1446	0.0616
	Computer	0.0517%	0.0262	0.0200	0.1888	0.0761	0.3425	0.1011	0.2962	0.0566	0.1475	0.1038
	Mobile	0.3750%	0.0381	0.0250	0.2045	0.0595	0.3355	0.0568	0.2997	0.0617	0.1229	0.0531
	Security/Hacking	0.0194%	0.0367	0.0115	0.2000	0.0200	0.4033	0.0153	0.2633	0.0115	0.1000	0.0200

Table 5. Evaluation results for different prediction techniques.

Type	Prediction Technique	MAE		RMSE
Type	Prediction Technique	Mean	S.D.	Mean	S.D.
Single output	NNR^single	0.0285	0.0003	0.0396	0.0004
Single output	k-NNR^single	0.0267	0.0001	0.0379	0.0003
Multiple output	MLR^multi	0.0360	0.0003	0.0805	0.0010
	NNR^multi	0.0264	0.0003	0.0383	0.0010
	DT’R^multi	0.0357	0.0001	0.0511	0.0003
	SVR^multi	0.0246	0.0001	0.0357	0.0003
	k-NNR^multi	0.0270	0.0001	0.0381	0.0003
	RFR^multi	0.0252	0.0001	0.0359	0.0003

Note: The best evaluation result for each performance metric is shown in bold highlight font.

Table 6. Comparison results of the different prediction techniques.

Hypothesis	MAE		RMSE		Supported
Hypothesis	t	p-Value	t	p-Value	Supported
NNR^single > k-NNR^single	38.5257	0.0000 ***	24.3066	0.0000 ***	Yes
NNR^single > MLR^multi	−134.3845	0.0000 ***	−276.1103	0.0000 ***	Yes (opposite)
NNR^single > NNR^multi	33.6470	0.0000 ***	8.2965	0.0000 ***	Yes
NNR^single > DTR^muti	−155.2202	0.0000 ***	−164.6572	0.0000 ***	Yes (opposite)
NNR^single > SVR^multi	85.1937	0.0000 ***	55.1925	0.0000 ***	Yes
NNR^single > k-NNR^multi	33.1459	0.0000 ***	20.8643	0.0000 ***	Yes
NNR^single > RFR^multi	72.8377	0.0000 ***	52.4563	0.0000 ***	Yes
k-NNR^single > MLR^multi	−229.1540	0.0000 ***	−299.1235	0.0000 ***	Yes (opposite)
k-NNR^single > NNR^multi	5.9086	0.0000 ***	−3.4503	0.0008 ***	No
k-NNR^single > DTR^muti	−338.7725	0.0000 ***	−231.9032	0.0000 ***	Yes (opposite)
k-NNR^single > SVR^multi	83.1917	0.0000 ***	37.6899	0.0000 ***	Yes
k-NNR^single > k-NNR^multi	−9.1403	0.0000 ***	−3.9958	0.0001 ***	Yes (opposite)
k-NNR^single > RFR^multi	60.8346	0.0000 ***	34.4006	0.0000 ***	Yes
MLR^multi > NNR^multi	166.1897	0.0000 ***	218.8488	0.0000 ***	Yes
MLR^multi > DTR^muti	6.2753	0.0000 ***	206.2436	0.0000 ***	Yes
MLR^multi > SVR^multi	289.3370	0.0000 ***	314.9457	0.0000 ***	Yes
MLR^multi > k-NNR^multi	222.4004	0.0000 ***	296.9896	0.0000 ***	Yes
MLR^multi > RFR^multi	275.3113	0.0000 ***	313.2876	0.0000 ***	Yes
NNR^multi > DTR^muti	−190.7444	0.0000 ***	−90.2455	0.0000 ***	Yes (opposite)
NNR^multi > SVR^multi	37.5580	0.0000 ***	18.5418	0.0000 ***	Yes
NNR^multi > k-NNR^multi	−10.9507	0.0000 ***	1.8082	0.0736 ***	No
NNR^multi > RFR^multi	25.7457	0.0000 ***	17.3319	0.0000 ***	Yes
DTR^muti > SVR^multi	448.2618	0.0000 ***	273.8899	0.0000 ***	Yes
DTR^muti > k-NNR^multi	327.3412	0.0000 ***	225.4759	0.0000 ***	Yes
DTR^muti > RFR^multi	427.0787	0.0000 ***	268.2875	0.0000 ***	Yes
SVR^multi > k-NNR^multi	−92.3828	0.0000 ***	−41.3531	0.0000 ***	Yes (opposite)
SVR^multi > RFR^multi	−24.6128	0.0000 ***	−2.9794	0.0036 ***	Yes (opposite)
k-NNR^multi > RFR^multi	70.2213	0.0000 ***	38.0674	0.0000 ***	Yes

Note: *** indicates the significance level of p < 0.01. “Yes (opposite)” means the opposite of a hypothesis is supported.

Table 7. Descriptive statistics on the three types of

{a g e r a t e}_{n e w s} (n, g)

.

Table 7. Descriptive statistics on the three types of

{a g e r a t e}_{n e w s} (n, g)

.

Age Group g	True Values of the Labeled		Predictions for the Unlabeled		Values of the Mixed
Age Group g	Mean	S.D.	Mean	S.D.	Mean	S.D.
10s	0.0173	0.0239	0.0119	0.0184	0.0123	0.0190
20s	0.1403	0.0891	0.1497	0.1249	0.1489	0.1222
30s	0.2833	0.0868	0.2847	0.1430	0.2846	0.1390
40s	0.3135	0.0711	0.2742	0.1180	0.2776	0.1153
≥50s	0.2455	0.1198	0.2795	0.1965	0.2765	0.1913

Table 8. Descriptive statistics on the three types of

{a g e r a t e}_{s e c t i o n} (S, g)

values and Z-test results for comparisons.

Table 8. Descriptive statistics on the three types of

{a g e r a t e}_{s e c t i o n} (S, g)

values and Z-test results for comparisons.

Section S	Age Group g	True Values of the Labeled		Predictions for the Unlabeled		Values of the Mixed		Z-Test
Section S	Age Group g	Mean	S.D.	Mean	S.D.	Mean	S.D.	Z	p-Value
Politics	10s	0.0121	0.0132	0.0089	0.0123	0.0093	0.0124	15.5958	0.0000
	20s	0.1132	0.0659	0.1402	0.1193	0.1365	0.1139	−14.9812	0.0000
	30s	0.2350	0.0673	0.2306	0.1337	0.2312	0.1267	2.1891	0.0286
	40s	0.3201	0.0601	0.2528	0.1085	0.2619	0.1058	40.2355	0.0000
	≥50s	0.3199	0.1053	0.3675	0.2104	0.3610	0.2000	−15.1047	0.0000
Economy	10s	0.0085	0.0148	0.0070	0.0133	0.0071	0.0134	5.1138	0.0000
	20s	0.1059	0.0608	0.1166	0.1082	0.1160	0.1060	−4.6772	0.0000
	30s	0.2881	0.0724	0.2950	0.1418	0.2946	0.1387	−2.2952	0.0217
	40s	0.3417	0.0583	0.3034	0.1199	0.3057	0.1175	15.0378	0.0000
	≥50s	0.2560	0.0883	0.2779	0.1800	0.2766	0.1760	−5.737	0.0000
Society	10s	0.0233	0.0333	0.0122	0.0185	0.0132	0.0206	31.2967	0.0000
	20s	0.1677	0.1103	0.1601	0.1338	0.1608	0.1318	3.5588	0.0004
	30s	0.3167	0.0926	0.3077	0.1461	0.3086	0.1420	3.9659	0.0001
	40s	0.3034	0.0836	0.2701	0.1171	0.2732	0.1148	18.1868	0.0000
	≥50s	0.1879	0.1124	0.2499	0.1897	0.2441	0.1847	−21.1782	0.0000
Lifestyle/Culture	10s	0.0247	0.0226	0.0208	0.0267	0.0211	0.0265	4.271	0.0000
	20s	0.1897	0.0801	0.1787	0.1265	0.1794	0.1243	2.5924	0.0095
	30s	0.3464	0.0744	0.3113	0.1363	0.3134	0.1337	7.6971	0.0000
	40s	0.2761	0.0685	0.2647	0.1191	0.2654	0.1167	2.863	0.0042
	≥50s	0.1633	0.0864	0.2244	0.1716	0.2207	0.1683	−10.6728	0.0000
World	10s	0.0219	0.0165	0.0184	0.0223	0.0186	0.0220	4.4732	0.0000
	20s	0.1617	0.0633	0.1762	0.1236	0.1752	0.1205	−3.3929	0.0007
	30s	0.2875	0.0648	0.2541	0.1306	0.2564	0.1274	7.4201	0.0000
	40s	0.3060	0.0579	0.2630	0.1123	0.2659	0.1100	11.0682	0.0000
	≥50s	0.2230	0.0916	0.2883	0.1966	0.2838	0.1920	−9.6401	0.0000
IT/Science	10s	0.0335	0.0254	0.0176	0.0230	0.0183	0.0233	13.9764	0.0000
	20s	0.2018	0.0795	0.1810	0.1257	0.1819	0.1241	3.4913	0.0005
	30s	0.3351	0.0634	0.3213	0.1292	0.3219	0.1270	2.2588	0.0239
	40s	0.2900	0.0711	0.2852	0.1293	0.2854	0.1273	0.7872	0.4312
	≥50s	0.1397	0.0621	0.1949	0.1613	0.1925	0.1586	−7.2767	0.0000

Notes: Z-tests were conducted for a comparison between the true values of the labeled dataset and the values of the mixed dataset.

Table 9. The measured similarities between two age groups in different sections, i.e.,

{s i m i l a r i t y}_{s e c t i o n} (S, g_{i}, g_{j})

.

Table 9. The measured similarities between two age groups in different sections, i.e.,

{s i m i l a r i t y}_{s e c t i o n} (S, g_{i}, g_{j})

.

Age Group Pairs (g_i, g_j)	Section S
Age Group Pairs (g_i, g_j)	Politics		Economy		Society		Lifestyle/Culture		World		IT/Science
(10s, 20s)	0.7400		0.6761		0.7243		0.7498		0.7631		0.7600
(10s, 30s)	0.5779		0.4480		0.4790		0.5371		0.5942		0.5587
(10s, 40s)	0.4802		0.3333		0.3909		0.4747		0.5296		0.4574
(10s, ≥50s)	0.4008	∨	0.2796	∨	0.2841	∨	0.3761	∨	0.3893	∨	0.3378	∨
(20s, 30s)	0.8182	∧	0.7863		0.7683		0.8076	∧	0.8347	∧	0.8221	∧
(20s, 40s)	0.5766		0.5243		0.5463		0.5893		0.6321		0.5885
(20s, ≥50s)	0.4843		0.4367		0.4080		0.4769		0.4915		0.4614
(30s, 40s)	0.8019		0.8004	∧	0.8023	∧	0.8059		0.8010		0.8062
(30s, ≥50s)	0.5628		0.5694		0.5209		0.5443		0.5371		0.5400
(40s, ≥50s)	0.7834		0.7866		0.7507		0.7427		0.7508		0.7164

Note: “∧” indicates that the similarity is the largest within a section, while “∨” represents the smallest similarity within the section.

Table 10. The top 5 and bottom 5 subsections for age group pairs in terms of

{s i m i l a r i t y}_{s u b s e c t i o n} (s, g_{i}, g_{j})

.

Table 10. The top 5 and bottom 5 subsections for age group pairs in terms of

{s i m i l a r i t y}_{s u b s e c t i o n} (s, g_{i}, g_{j})

.

(a) Top 5 Subsections
Age Group Pairs (g_i, g_j)	Rank
Age Group Pairs (g_i, g_j)	1	2	3	4	5
(10s, 20s)	Middle East/Africa	Europe	Mobile	Road/Traffic	Science General
(10s, 30s)	Middle East/Africa	Mobile	Europe	National Defense/Diplomacy	USA/Latin America
(10s, 40s)	Middle East/Africa	Religion	USA/Latin America	Europe	National Defense/Diplomacy
(10s, ≥50s)	Religion	Middle East/Africa	North Korea	Food/Restaurant	National Defense/Diplomacy
(20s, 30s)	Weather	Middle East/Africa	Mobile	North Korea	USA/Latin America
(20s, 40s)	Middle East/Africa	Science General	Europe	Asia/Australia	Weather
(20s, ≥50s)	North Korea	National Defense/Diplomacy	Human Rights/Welfare	Security/Hacking	Asia/Australia
(30s, 40s)	Human Rights/Welfare	Middle East/Africa	Food/Restaurant	Mobile	Road/Traffic
(30s, ≥50s)	Human Rights/Welfare	Security/Hacking	Real Estate	Health Information	Food/Restaurant
(40s, ≥50s)	Real Estate	Human Rights/Welfare	North Korea	Economy General	Finance
(b) Bottom 5 Subsections
Age Group Pairs (g_i, g_j)	Rank
Age Group Pairs (g_i, g_j)	1	2	3	4	5
(10s, 20s)	Stock	Labor	Industry/Business	Security/Hacking	Small and Mid-sized Businesses/Start-ups
(10s, 30s)	Labor	Small and Mid-sized Businesses/Start-ups	Stock	Living Economy	Finance
(10s, 40s)	Finance	Stock	Real Estate	Small and Mid-sized Businesses/Start-ups	Economy General
(10s, ≥50s)	Society General	Finance	Small and Mid-sized Businesses/Start-ups	Economy General	Industry/Business
(20s, 30s)	Real Estate	Food/Medical	Case/Accident	Society General	Character
(20s, 40s)	Real Estate	Stock	Finance	Character	Society General
(20s, ≥50s)	Society General	Case/Accident	Finance	Real Estate	The Press
(30s, 40s)	The Press	Performance/Exhibition	Stock	Education	Administration
(30s, ≥50s)	Book	Society General	Communications/New Media	Case/Accident	Middle East/Africa
(40s, ≥50s)	Communications/New Media	Education	Mobile	Internet/SNS	IT General

Table 11. The top 5 and bottom 5 subsections for age group pairs in terms of

{c o r r e l a t i o n}_{s u b s e c t i o n} (s, g_{i}, g_{j})

.

Table 11. The top 5 and bottom 5 subsections for age group pairs in terms of

{c o r r e l a t i o n}_{s u b s e c t i o n} (s, g_{i}, g_{j})

.

(a) Top 5 Subsections
Age Group Pairs (g_i, g_j)	Rank
Age Group Pairs (g_i, g_j)	1	2	3	4	5
(10s, 20s)	Real Estate	Road/Traffic	Food/Medical	The Press	Case/Accident
(10s, 30s)	National Assembly/Political Party	President’s Office	North Korea	The Press	Politics General
(10s, 40s)	National Defense/Diplomacy	North Korea	USA/Latin America	World General	Middle East/Africa
(10s, ≥50s)	Stock	Car/Test Drive	Human Rights/Welfare	Labor	Fashion/Beauty
(20s, 30s)	National Assembly/Political Party	The Press	President’s Office	Politics General	Global Economy
(20s, 40s)	North Korea	National Defense/Diplomacy	National Assembly/Political Party	USA/Latin America	President’s Office
(20s, ≥50s)	Car/Test Drive	Human Rights/Welfare	Real Estate	Food/Restaurant	Living Economy
(30s, 40s)	National Defense/Diplomacy	North Korea	Middle East/Africa	President’s Office	Religion
(30s, ≥50s)	Education	Fashion/Beauty	Mobile	Games/Reviews	Performance/Exhibition
(40s, ≥50s)	Games/Reviews	Fashion/Beauty	Weather	Human Rights/Welfare	Europe
(b) Bottom 5 Subsections
Age Group Pairs (g_i, g_j)	Rank
Age Group Pairs (g_i, g_j)	1	2	3	4	5
(10s, 20s)	Environment	Security/Hacking	National Defense/Diplomacy	Education	Weather
(10s, 30s)	Fashion/Beauty	Games/Reviews	Human Rights/Welfare	Internet/SNS	Education
(10s, 40s)	Real Estate	Road/Traffic	Mobile	Internet/SNS	The Press
(10s, ≥50s)	National Defense/Diplomacy	North Korea	World General	The Press	President’s Office
(20s, 30s)	Human Rights/Welfare	Fashion/Beauty	Games/Reviews	Food/Medical	Internet/SNS
(20s, 40s)	Fashion/Beauty	Mobile	Games/Reviews	Internet/SNS	Car/Test Drive
(20s, ≥50s)	North Korea	National Defense/Diplomacy	USA/Latin America	President’s Office	National Assembly/Political Party
(30s, 40s)	Mobile	Communications/New Media	Car/Test Drive	IT General	Road/Traffic
(30s, ≥50s)	President’s Office	National Assembly/Political Party	Politics General	North Korea	Real Estate
(40s, ≥50s)	Politics General	National Assembly/Political Party	National Defense/Diplomacy	North Korea	USA/Latin America

Table 12. The obtained degree of membership for an age group pair

(g_{i}, g_{j})

in a fuzzy set of age difference.

Table 12. The obtained degree of membership for an age group pair

(g_{i}, g_{j})

in a fuzzy set of age difference.

$Age Group Pairs (g_{i}, g_{j})$	$\frac{T h e N u m b e r o f S u b s e c t i o n s i n a F u z z y S e t}{T h e N u m b e r o f T o t a l S u b s e c t i o n s}$			$The Degree of Membership in a Fuzzy Set, μ_{D_{q}} (g_{i}, g_{j})$
$Age Group Pairs (g_{i}, g_{j})$	$D_{H i g h}$	$D_{M i d d l e}$	$D_{L o w}$	$D_{H i g h}$	$D_{M i d d l e}$	$D_{L o w}$
(10s, 20s)	0/48	0/48	48/48	0.0000	0.0000	1.0000
(10s, 30s)	2/48	40/48	6/48	0.0417	0.8333	0.1250
(10s, 40s)	45/48	3/48	0/48	0.9375	0.0625	0.0000
(10s, ≥50s)	46/48	2/48	0/48	0.9583	0.0417	0.0000
(20s, 30s)	0/48	0/48	48/48	0.0000	0.0000	1.0000
(20s, 40s)	32/48	16/48	0/48	0.6667	0.3333	0.0000
(20s, ≥50s)	48/48	0/48	0/48	1.0000	0.0000	0.0000
(30s, 40s)	0/48	32/48	16/48	0.0000	0.6667	0.3333
(30s, ≥50s)	47/48	1/48	0/48	0.9792	0.0208	0.0000
(40s, ≥50s)	0/48	0/48	48/48	0.0000	0.0000	1.0000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Suh, J.H. Multi-Label Prediction-Based Fuzzy Age Difference Analysis for Social Profiling of Anonymous Social Media. Appl. Sci. 2024, 14, 790. https://doi.org/10.3390/app14020790

AMA Style

Suh JH. Multi-Label Prediction-Based Fuzzy Age Difference Analysis for Social Profiling of Anonymous Social Media. Applied Sciences. 2024; 14(2):790. https://doi.org/10.3390/app14020790

Chicago/Turabian Style

Suh, Jong Hwan. 2024. "Multi-Label Prediction-Based Fuzzy Age Difference Analysis for Social Profiling of Anonymous Social Media" Applied Sciences 14, no. 2: 790. https://doi.org/10.3390/app14020790

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Label Prediction-Based Fuzzy Age Difference Analysis for Social Profiling of Anonymous Social Media

Abstract

1. Introduction

1.1. Background and Purpose

1.2. Reviews on Related Works

1.3. Organization of This Paper

2. Materials and Methods

2.1. Acquire the News Data from Anonymous Korean Social Media

2.2. Represent the Distribution of Age Groups among Commenters on a News Article Using word2vec

2.3. Predict Age Information of the Unlabeled Dataset

2.3.1. Perform Experiments of Age Predictions with Labeled Datasets

2.3.2. Evaluate the Performances of Prediction Techniques

2.3.3. Train the Age Predictor

2.3.4. Predict and Explore the Age Information of Unlabeled Datasets

2.4. Discover Fuzzy Age Differences from Anonymous News Comments

2.4.1. Represent Topics of Interest for Age Groups

2.4.2. Use Similarities between Age Groups for Age Difference Analysis

2.4.3. Use Correlations between Age Groups for Age Difference Analysis

2.4.4. Integrate Both Similarity and Correlation for Age Difference Analysis

2.4.5. Quantify the Uncertain Degree of Age Difference with Fuzzy Sets

3. Results and Discussions

3.1. Evaluation of Results of Prediction Techniques

3.2. Results of Age Prediction for Labeled Datasets and Comparison

3.3. Results of Predicting Age Information for Unlabeled Datasets and Exploration

3.4. Results of Fuzzy Age Difference Analysis

3.4.1. On the Similarities between Age Groups

3.4.2. On the Correlations between Age Groups

3.4.3. On the Quantification of Age Difference with Fuzzy Sets

4. Conclusions

Supplementary Materials

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI