Next Article in Journal
Simultaneous Identification of Multiple Parameters in Wireless Power Transfer Systems Using Primary Variable Capacitors
Previous Article in Journal
Mod2VQLS: A Variational Quantum Algorithm for Solving Systems of Linear Equations Modulo 2
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Label Prediction-Based Fuzzy Age Difference Analysis for Social Profiling of Anonymous Social Media

Department of Management Information Systems & BERI, Gyeongsang National University, 501 Jinjudae-ro, Jinju-si 52828, Republic of Korea
Appl. Sci. 2024, 14(2), 790; https://doi.org/10.3390/app14020790
Submission received: 14 November 2023 / Revised: 9 January 2024 / Accepted: 15 January 2024 / Published: 17 January 2024

Abstract

:
Age is an essential piece of demographic information for social profiling, as different social and behavioral characteristics are age-related. To acquire age information, most of the previously conducted social profiling studies have predicted age information. However, age predictions in social profiling have been very limited, because it is difficult or impossible to obtain age information from social media. Moreover, age-prediction results have rarely been used to study human dynamics. In these circumstances, this study focused on naver.com, a nationwide social media website in Korea. Although the social profiles of news commenters on naver.com can be analyzed and used, the age information is incomplete (i.e., partially open to the public) owing to anonymity and privacy protection policies. Therefore, no prior research has used naver.com for age predictions or subsequent analyses based on the predicted age information. To address this research gap, this study proposes a method that uses a machine learning approach to predict the age information of anonymous commenters on unlabeled (i.e., with age information hidden) news articles on naver.com. Furthermore, the predicted age information was fused with the section information of the collected news articles, and fuzzy differences between age groups were analyzed for topics of interest, using the proposed correlation–similarity matrix and fuzzy sets of age differences. Thus, differentiated from the previous social profiling studies, this study expands the literature on social profiling and human dynamics studies. Consequently, it revealed differences between age groups from anonymous and incomplete Korean social media that can help in understanding age differences and ease related intergenerational conflicts to help reach a sustainable South Korea.

1. Introduction

1.1. Background and Purpose

Social media provides a virtual space for sharing thoughts and information, engaging with others, and creating online communities [1,2,3]. It generates large volumes of user-generated data and provides unprecedented opportunities for computational social science researchers [4,5]. Thus, social profiling, which profiles users based on their social media data, has recently gained significant attention for social media studies on human dynamics [6,7]. Among the many social profile attributes available on social media, user demographics include age, gender, race, city, country, and occupation. Additionally, it is essential to understand patterns, differences, and trends through demographic analysis, as demographics usually determine online social and behavioral patterns [3].
Most prior studies that used demographic information have mainly considered age as one of the principal and mandatory variables to be explored in social profiling and human dynamics studies. Behavioral differences in social media are evident among people of different ages [8]. For example, teenagers generally pay less attention to privacy, and tend to share personal information more carelessly on social networks [9]. In contrast, adult users are careful when posting, and pay attention to who can read their comments. Therefore, more frequently, they craft sentences with positive emotions, minimizing negations, and reducing slang usage [10].
In addition, there are typical behaviors among users of the same age group when the same topic is discussed [11]. For example, among the numerous topics that teenagers generally discuss in their daily lives, topics such as relationships, school, and friends are more frequent [12]. Contrastingly, when adults use social media to express their opinions, they often leave their classic identity markers of adulthood, through the discussion of topics such as religion, ideology, politics, and work. Moreover, adult users tend to provide photos, videos, and URLs to complement their opinions [13].
However, prior studies on age information-based social profiling have been restricted because age information is not always available on social media [14]. In other words, most prior age research on social media has used small data, where age information could be collected, or if possible, they used a simple but unreliable strategy to extract age information from social media (e.g., using descriptions that contain expressions like “12 years” and “I have 12 years”) [15]. Moreover, social media’s anonymity and privacy policies have made it more challenging to obtain demographic information [16]. Therefore, prior researchers have not even attempted to collect anonymous social media data with incomplete age information (i.e., partially open to the public for social profiling studies).
As such, there have been concerns, and efforts have been made for acquiring age information on social media [11]. Therefore, most previous studies on age information-based social profiling have focused on age predictions, as shown in Table 1. Nevertheless, challenging issues for age information-based social profiling still exist, as listed below.
First, according to this study’s literature review, most of the existing studies focused only on predictions, and made little effort to link the age prediction results to the analysis of human dynamics using the obtained age information. If more age information is provided through age predictions, more studies on human dynamics need to be considered with the predicted age information.
Second, most of the social media data that prior studies used were in English or Chinese, whereas social profiling has rarely been performed using Korean social media data. However, using social media data in various languages for age analysis will provide a richer understanding of various individuals and societies; therefore, efforts to acquire social media data in diverse languages are required for more effective social profiling studies.
To resolve the abovementioned problems and challenges, this study selected as its focus one of Korea’s major news portal sites, naver.com, and pioneered using it for age information-based social profiling studies. Unlike other sites, naver.com provides age information (i.e., rates of people in their 10s, 20s, 30s, 40s, and ≥50s as age group distributions) of anonymous commenters on news articles. The age information of naver.com is reliable as users sign into the service via real name authentication. Therefore, its age information can be used for age information-based social profiling studies of social media users in Korea. However, to do so, there are still problems to be solved as below.
First, naver.com has a policy of making the age group distribution of anonymous commenters on a news article open to the public only if the number of its direct news comments exceeds a specific number, i.e., 100. (The term “direct news comment” has been used for news comments that replied directly to a news article and to distinguish it from “news comments” on a news article, which included both direct and indirect replies to news comments). In other words, news articles with fewer than 100 direct news comments can be considered as unlabeled news articles. Around 91% of news articles published on naver.com were found to be unlabeled news articles based on the data that this study collected.
Second, as a result, naver.com has not yet been used to predict age information or analyze the social profiles of Korean users by using the age information that has been collected or predicted. Hence, methodologically there are several elements that are unclear: (i) how age information of anonymous commenters on a news article can be represented by extracted features; (ii) which prediction technique can give a better performance at predicting age information using the age information representation; and (iii) how the social profiles of commenters on news articles can be analyzed using the predicted age information.
To address these questions, this study proposed a method for predicting the age group distribution of anonymous new commenters for news articles and then using this predicted age information to investigate and understand differences in topics of interest between age groups as human dynamics. To be specific, it adopted a machine learning approach for predicting the age group distribution of anonymous commenters on unlabeled news articles. In this approach, each news article was represented by textual characteristics based on its comments. These labeled news articles were used to evaluate machine learning techniques for predicting age information, and the best prediction technique was selected and used to perform age information predictions for unlabeled news articles. Consequently, all collected news articles could be labeled by age information. Thereafter, using the section information as a cue for topics of interest to age groups, the fuzzy differences of interesting topics among age groups were explored and compared.

1.2. Reviews on Related Works

User profiling refers to the process of collecting, cleaning, and presenting an individual’s characteristics that are related to demographics and behavior [17]. These attributes often include basic information (e.g., age, gender, location, education, and occupation). Additionally, user profiles can include elements that reflect more complex aspects, such as preferences, interests, behaviors, and personality traits. Recently, user profiling has evolved into social profiling, which leverages social information to generate user profiles (e.g., social actions, such as clicks and likes on social media) [18]. Prior social profiling studies are summarized in Table 1.
As shown in Table 1, social profiling can be classified into individual and group profiling [3]. Individual profiling involves learning about a person based on demographics (e.g., age, gender, and location) and psychographics (e.g., behavior, personality traits, and interests) by directly asking questions or tracking behaviors online and offline. Group profiling is a process used to represent individuals who share common attributes and may or may not be identified as a group. It mainly includes community detection and subsequent analysis of communities. In relation to the taxonomy of social profiling, this study focused on age, one of the demographic characteristics of individual profiling, but extended to the perspective of group profiling by using the distribution of age groups among commenters on a news article.
According to this study’s purpose, prior works related to social profiling can be divided into three categories: predicting social profile attributes (also known as digital footprints) [19], using collected or generated social profiles to analyze human dynamics, and performing both in a sequence. However, most of the prior studies aimed at predicting the attributes of social profiles (e.g., personality traits and demographics). Based on the taxonomies in Table 1, this study’s purpose can be classified as examining both age prediction and human dynamics, using the predicted age information and interests of groups as two social profile attributes. As age differences were analyzed at different topic levels using section information, this study contributes to the related literature.
Table 1. Prior social profiling studies.
Table 1. Prior social profiling studies.
Previous WorkDescriptionTypes of Social Profile AttributesTypes of Purpose
IndividualGroupPredictionHuman Dynamics Studies
Lima and de Castro [20]Personality traits predictionPersonality traits
Segalin, Cheng, and Cristani [17]Personality traits predictionPersonality traitsPersonality traits
Wang et al. [21]Joint gender and age predictionGender, Age
Guimarães, Rosa, Gaetano, Rodríguez, and Bressan [11]Age group classification of people in their 10s to adults, analyzing the importance of age information to the precision of a sentiment metric Age, sentiment, and relationship between age and sentiment
Wang et al. [22]Demographics predictionGender, Age, and Location
Chen et al. [23]Age predictionAge
Lee and Ryu [24]Gender difference analysis Gender, Interest
Fang et al. [25]Prediction of age and genderAge, Gender
López-Santillán, Montes-Y-Gómez, González-Gurrola, Ramírez-Alonso, and Prieto-Ordaz [7]Prediction of age, gender, and personality traitsAge, Gender, Personality traits
Han et al. [26]Personality traits predictionPersonality traitsPersonality traits
Romanov et al. [27]Age PredictionAgeAge
Figueroa, Peralta, and Nicolis [15]Age PredictionAgeAge
Kamalesh and B [28]Personality traits predictionPersonality traits
Khorrami et al. [29]Personality traits predictionPersonality traitsPersonality
Zhou et al. [30]Personality traits predictionPersonality traits
This studyAge prediction and fuzzy age differences in terms of interesting topics Age, Interest
Table 1 shows that most of the prior studies focused on predicting social profiles. Table 2 provides additional details on the prior studies in terms of their machine learning approaches. The findings emerging from Table 2 are as follows:
First, representative social media, such as Twitter, Facebook, Instagram, and Weibo were used as data sources, and most of the text data was in English or Chinese. Second, various features were used as footprints on social media, representing target social profile attributes. They could be grouped into text, images, social relations, and social behaviors, whereas the type of footprint used on social media depended on the types of data sources (e.g., text or social features for Twitter and Weibo, and images for Flickr and Instagram). Third, for the type of machine learning, supervised learning (e.g., classification and regression) was mostly adopted. Prediction techniques varied from traditional machine learning techniques (e.g., support vector machine (SVM) and multilayer perceptron (MLP)) to the recent deep learning techniques (e.g., convolution neural network (CNN) and recurrent neural network (RNN)).
Compared to the previous studies summarized in Table 2, this study used news articles and their comments from the Korean social media website www.naver.com, which has rarely been studied because of anonymity and incompleteness of its social profile data. Regarding the type of footprint used in social media, this study generated and used word embeddings from text to represent age groups (i.e., the distribution of commenters on a news article by age group). In particular, it used the word2vec approach [31] for word embeddings to overcome difficulties in text representation (e.g., news comments are very short, although their lengths are varied, and many of them are grammatically incorrect, either accidentally or intentionally).
Regarding the prediction techniques in Table 2, although there are more complicated machine learning models, deep learning models such as CNN or RNN may achieve better performance than classical models in the case of massive training datasets. However, the scale of a personality dataset is usually small owing to the high cost of collecting sample data, and classical models can also achieve comparable results for such a small dataset [26]. Consequently, this study did not involve deep learning models. Instead, it focused on regression tasks for its unsupervised learning-based approach (i.e., predicting the distribution of age groups for a news article in unlabeled datasets, whose size was much larger than labeled datasets).
Table 2. Prior studies using a machine learning approach for age predictions of social profiling.
Table 2. Prior studies using a machine learning approach for age predictions of social profiling.
Prior StudiesSocial Media Data UsedTypes of Footprints UsedTypes of Machine LearningPrediction Techniques
SourceLanguage
Lima and de Castro [20]TwitterEnglishText, SocialClassificationNaïve Bayes (NB), Support Vector Machine (SVM), Multilayer Perceptron (MLP)
Wang, Li, Chen, and Li [21]Sina WeiboChineseTextClassification,
Regression
SVM, CNN, Multi-task Convolutional Neural Network (MTCNN)
Guimarães, Rosa, Gaetano, Rodríguez, and Bressan [11]TwitterEnglishText, SocialClassificationMLP, DeepCNN, Decision Tree (DT), Random Forest (RF), SVM
Segalin, Cheng, and Cristani [17]Flickr ImagesClassification,
Regression
CNN
Wang, Ma, and Zhang [22]Sina WeiboChineseTextClassificationSVM
Chen, Cheng, Yang, Liang, Quan, and Li [23]Sina WeiboChineseText, SocialClassification,
Regression
SVM, MLP, CNN, Long Short-Term Memory (LSTM)
Fang, Yuan, Lu, and Feng [25]Flickr ImagesClassification,
Regression
CNN
López-Santillán, Montes-Y-Gómez, González-Gurrola, Ramírez-Alonso, and Prieto-Ordaz [7]Social media outlets, blogs, Twitter, and hotel reviewsEnglish, Arabic, Spanish, Portuguese, Dutch, ItalianTextClassification,
Regression
SVM, RF, Extra Trees (ET), k-Nearest Neighbors (k-NN)
Han, Huang, and Tang [26]Sina WeiboChineseTextClassificationLogistic Regression (LR), SVM, RF
Romanov, Kurtukova, Sobolev, Shelupanov, and Fedotova [27]vk.comRussianTextClassificationThe hybrid of CNN and RNN (CRNN)
Figueroa, Peralta, and Nicolis [15]Yahoo! AnswersEnglishText, Images, SocialClassificationFastText, CNN, Bidirectional RNN (B-RNN), Attention-based Bidirectional RNN (AB-RNN), Recurrent Convolutional Neural Network (RCNN)
Kamalesh and B [28]Facebook, Twitter, InstagramEnglish ClassificationMaximum Entropy Classifier (MEC)
Khorrami, Khorrami, and Farhangi [29]Instagram-ImagesRegressionET, Gradient
Boosted Trees (GBT), RF
Zhou, Zhang, Zhao, and Yang [30]FacebookEnglishTextClassificationSVM, RF, k-NN, Attention-based Bidirectional LSTM (AB-LSTM)
This studynaver.comKoreanTextRegressionMultiple Linear Regression (MLR), MLP, DT, SVM, k-NN, RF

1.3. Organization of This Paper

The rest of this paper comprises three sections. Section 2 describes the research framework, which proposed using a machine learning-based approach for predicting the age group distribution of anonymous commenters on news articles. Section 3 demonstrates the results of applying the research framework to the anonymous Korean news portal site naver.com. Thereafter, it discusses and compares the performance of different prediction techniques. It selected the best prediction technique and used it to predict the age group distribution of anonymous commenters on each unlabeled news article. Moreover, it used the collected or predicted age information to analyze fuzzy differences between age groups with respect to their topics of interest by using the collected section information. Finally, Section 4 concludes the paper by summarizing the results and implications of this study and its limitations for future research.

2. Materials and Methods

This study proposed a machine learning-based methodology for fuzzy age difference analysis of anonymous commenters with news article data, of which age information was mostly unlabeled. Figure 1 summarizes this study’s overall research method, which comprises four steps, and the subsequent subsections explain its details.

2.1. Acquire the News Data from Anonymous Korean Social Media

This study acquired data comprising 167,533 news articles and their comments from 1 October to 30 November 2022, covering all sections and subsections of naver.com, an anonymous Korean news portal site. Only 15,080 (9.0001%) of the acquired news articles were given the age information of their commenters according to the policy of naver.com, wherein age information of a news article is disclosed to the public only if it has more than 100 direct comments. Such articles were considered as labeled news articles, and approximately 86% of the collected comments belonged to the labeled news articles category.
The age information for a news article n represents the age group distribution, which is composed of the rates of five age groups: 10s, 20s, 30s, 40s, and ≥50s. The rate of an age group g was defined as:
a g e r a t e n e w s ( n , g ) = t h e   r a t e   o f   n e w s   c o m m e n t e r s   b e l o n g i n g   t o   a n   a g e   g r o u p   g   f o r   a   n e w s   a r t i c l e   n ,
where g is an age group, g { 10 s , 20 s , 30 s , 40 s , 50 s } . Figure 2 illustrates the distributions of a g e r a t e n e w s ( n , g ) given for the labeled news articles by using the kernel density estimation (KDE) plot, and Table 3 provides descriptive statistics on the a g e r a t e n e w s ( n , g ) of the labeled news articles.
In addition to age information, this study collected section and subsection information about the news articles of the collected data and denoted a section as S S E C T I O N , and a subsection of the section S as s S U B S E C T I O N ( S ) . The section and subsection information about the collected news articles that were merged with age information enabled this study to analyze the differences between age groups with respect to the topics of their interest. Table 4 shows the descriptive statistics in terms of 6 sections and 48 subsections for the labeled news articles and their a g e r a t e n e w s ( n , g ) values.

2.2. Represent the Distribution of Age Groups among Commenters on a News Article Using word2vec

This study used the news2vec model to represent a news article n, i.e., the age group distribution of the news article. The news2vec vectors for the collected news articles were generated by following five steps, which can be summarized as: (i) extracting unigrams from news comments; (ii) removing extremely long unigrams; (iii) generating 300-dimensional word2vec embeddings for the unigrams, i.e., unigram2vec; (iv) generating feature vectors for news comments, i.e., comment2vec; and (v) generating feature vectors for news articles, i.e., news2vec.
Here, to generate unigram2vec vectors, a news comment c was considered as a sentence and represented by its unigrams, considered as words in the sentence. Thereafter, it was used as an input to train the word2vec model, using GENSIM’s word2vec module with default settings. Moreover, to generate comment2vec vectors, unigam2vec vectors were aggregated for related news comments, which were then aggregated for related news articles. The comment2vec and news2vec are defined as follows:
c o m m e n t 2 v e c c = 1 n U N I G R A M ( c ) u U N I G R A M ( c ) u n i g r a m 2 v e c u ,
where U N I G R A M ( c ) is a set of unigrams appearing in a news comment c.
n e w s 2 v e c ( n ) 1 n C O M M E N T ( n ) c C O M M E N T ( n ) c o m m e n t 2 v e c c ,
where C O M M E N T ( n ) is a set of news comments on a news article n.
In addition, for the 15,080 labeled news articles, each age group rate of a news article n, i.e., a g e s r a t e n e w s ( n ,   g ) , was transformed into log odds to avoid having a negative value in the prediction, and the log odds was given by
a g e s c o r e n e w s ( n ,   g ) = log a g e r a t e n e w s n , g 1 a g e r a t e n e w s n , g .
Additionally, for the labeled news article n, a g e s c o r e n e w s ( n , g ) , values of all five age groups were obtained and used as a multi-label target, which was defined as
a g e s c o r e n e w s ( n ) = a g e s c o r e n e w s ( n , 10 s ) a g e s c o r e n e w s ( n , 50 s ) .
Then, the multi-label target, a g e s c o r e n e w s ( n ) , was represented by n e w s 2 v e c ( n ) as features. Finally, two types of datasets were prepared for this study: i) the labeled dataset, which contained 15,080 instances with the news2vec vectors of 300 dimensions as features and a multi-label target, and ii) the unlabeled dataset, which had unlabeled news articles with only news2vec vectors as features.

2.3. Predict Age Information of the Unlabeled Dataset

This study used a machine learning approach to obtain an age predictor from the labeled dataset and used it to predict the age group distribution of a news article’s commenters in the unlabeled dataset. As described in Figure 3, the approach consists of four steps: (1) the same experiments were performed for different prediction techniques; (2) prediction techniques were evaluated and the best one was identified; (3) the predictor for the unlabeled dataset was trained using the best prediction technique; and (4) using the trained predictor, the multi-label targets of the unlabeled dataset were predicted and normalized to unity. Details are explained in the subsections below.

2.3.1. Perform Experiments of Age Predictions with Labeled Datasets

This study’s multi-label target, a g e s c o r e n e w s ( n ) , was a numerical vector, so it considered machine learning techniques for regression that had been commonly used in prior studies. The six regression models are: Multiple Linear Regression (MLR), Neural Network Regression (NNR) (in this study, NNR is a prediction technique for regression using MLP), Decision Tree Regression (DTR), Support Vector Regression (SVR), k-Nearest Neighbors Regression (k-NNR), and Random Forest Regression (RFR).
Moreover, this study required a regression model that could deal with a multi-label target (i.e., multi-output regression). Out of these six regression models, as only NNR and k-NNR could be used as prediction techniques for multi-label targets, this study adopted a strategy that fits one regressor per target. Therefore, for this study’s multi-label target task, it applied the multi-output regression module in the machine learning package, scikit-learn, to the six prediction techniques. (https://scikit-learn.org/stable/modules/multiclass.html#multioutput-regression (accessed on 14 November 2023)) Therefore, for predicting the a g e s c o r e n e w s n , this study considered eight prediction techniques in total, which can be grouped into two categories: (1) single output predictors, denoted as NNRsingle and single k-NNRsingle, and (2) multiple output predictors, denoted as MLRmulti, NNRmulti, DTRmulti, SVRmulti, k-NNRmulti, and RFRmulti.
These eight prediction techniques were used to perform age prediction experiments with the labeled datasets after their hyperparameters were optimized via the grid search method, which used 10-fold cross-validation. As an experiment, it performed 10-fold cross-validation for each regressor, set by its optimized hyperparameters, and the same experiment was repeated 50 times. Here, a different random seed was used for experimental repetition, but the random seed was kept identical for the same iteration of different prediction techniques [2,32,33,34].

2.3.2. Evaluate the Performances of Prediction Techniques

To find the best prediction technique, the eight techniques were evaluated by measuring their mean absolute error (MAE) and root mean square error (RMSE) from the experimental results. These performance measures have been previously adopted when target variables had continuous values, and their values are smaller if the prediction error is lower, indicating better performance [32,34]. Hence, the effect of using a prediction technique on both performance measures was statistically investigated, using pairwise t tests between different prediction techniques. Thereafter, the best prediction technique was searched and selected by referring to the pairwise t tests results.

2.3.3. Train the Age Predictor

In this step, the best prediction technique, identified in the previous subsection, was used to train a prediction model from the labeled dataset (i.e., the age predictor). Then, the multi-label target of each news article, a g e s c o r e n e w s ( n ) , in the labeled dataset was predicted using the learned age predictor. The predicted a g e s c o r e n e w s ( n , g ) values of the estimated multi-label target were transformed into the a g e r a t e n e w s ( n , g ) values of the labeled news article through two substeps. First, going through the inverse way of Equation (4), and second, normalizing into unity (i.e., the sum across all age groups of a news article n becomes 1). Lastly, the obtained a g e r a t e n e w s ( n , g ) values were compared with the corresponding true values of the labeled dataset.

2.3.4. Predict and Explore the Age Information of Unlabeled Datasets

Here, the a g e r a t e n e w s ( n , g ) values of age groups for a news article in the unlabeled dataset, N E W S U n l a b e l e d , were obtained using the learned age predictor. Likewise, this study could deal with the unknown age information problem of anonymous commenters on the collected news articles; therefore, it enabled age difference analysis of the collected news articles, N E W S T o t a l .
The prediction results for the unlabeled dataset could not be evaluated by the MAE or RMSE because true target values were unavailable in the unlabeled dataset. Instead, this study explored the prediction results and compared them with the labeled and mixed datasets. Here, the mixed dataset was obtained by combining the labeled and unlabeled datasets. Figure 4 illustrates the three types of datasets.
Specifically, the predicted a g e r a t e n e w s ( n , g ) values of the unlabeled news articles were visualized by drawing a histogram, which was compared with the distributions from the labeled and mixed datasets. Next, to investigate differences between these three datasets, this study measured and compared their descriptive statistics, i.e., normality, such as mean and standard deviation, and compared them between the three datasets.
Additionally, to compare the three datasets, an investigation was conducted on sections. For this sectional investigation, the a g e r a t e n e w s ( n , g ) values were averaged for news articles belonging to a section S, i.e., n N E W S S e c t i o n S , which was defined as
a g e r a t e s e c t i o n S ,   g = 1 O ( S ) n N E W S S e c t i o n ( S ) a g e r a t e n e w ( n , g ) ,
where O ( S ) = n ( N E W S S e c t i o n ( S ) ) . The statistic for the investigation with respect to subsections was given by
a g e r a t e s u b s e c t i o n s ,   g = 1 P ( s ) n N E W S S u b s e c t i o n ( s ) a g e r a t e n e w s ( n , g ) ,
where s is a subsection, N E W S S u b s e c t i o n ( s ) is a set of news articles belonging to the subsection s, i.e., n N E W S S u b s e c t i o n ( s ) , and P ( s ) = n ( N E W S S u b s e c t i o n ( s ) ).
Then, using the a g e r a t e s u b s e c t i o n s , g values, the rank of age group g across all five age groups within subsection s was measured for the labeled and mixed datasets. This resulted in two metrics: r a n k l a b e l e d s , g and r a n k m i x e d s , g . Subsequently, these two ranks for subsection s and age group g were considered as data points (x, y) and displayed via a scatter diagram to investigate the difference between the labeled and mixed datasets.

2.4. Discover Fuzzy Age Differences from Anonymous News Comments

In the previous subsection, the unlabeled news articles could be labeled using predictions. In this component, the fuzzy differences between age groups were analyzed with respect to topics of interest by using a mixed dataset, in which all the collected news articles had labels that were collected or predicted. Figure 5 summarizes the overall steps to discover fuzzy differences between age groups through anonymous news comments.

2.4.1. Represent Topics of Interest for Age Groups

To represent topics of interest for age groups, first the topic of interest for an age group at the news level was represented by using the a g e r a t e n e w s ( n , g ) values of the mixed dataset, as given by
t o p i c n e w s ( g ) = a g e r a t e n e w s ( n 1 , g ) a g e r a t e n e w s ( n M , g ) ,
where n i is a news article in N E W S T o t a l and M is the number of news articles in N E W S T o t a l , i.e., M = n ( N E W S T o t a l ) . Moreover, using the section information of the collected news articles, topics of interest for an age group at the section level were defined as
t o p i c s e c t i o n ( S , g ) = a g e r a t e n e w s n 1 , g a g e r a t e n e w s ( n O ( S ) , g ) ,
where n i N E W S S e c t i o n ( S ) . Similarly, topics of interest for an age group at the subsection level were given by
t o p i c s u b s e c t i o n ( s , g ) = a g e r a t e n e w s n 1 , g a g e r a t e n e w s ( n P ( s ) , g ) ,
where n i N E W S S u b s e c t i o n ( s ) .
These representations of an age group’s topics of interest were used to analyze the differences between the five age groups of anonymous news commenters. The age difference analysis was conducted at various levels, from news articles to subsections, and led to a correlation–similarity matrix after recognizing similarities and correlations between the age groups. Then, based on the results of the analyses, the degree of uncertainty of the age difference was quantified by adopting the concept of fuzzy sets.

2.4.2. Use Similarities between Age Groups for Age Difference Analysis

In this step, first, two age groups were paired, resulting in 5C2 = 10 age group pairs, and each age group pair was used to represent the relationship between the two age groups. Then, the cosine similarity of each age group pair was measured to investigate similarities between the two age groups. Specifically, the similarity between two age groups like g i and g j was defined as
s i m i l a r i t y n e w s g i , g j =   t o p i c n e w s ( g i ) · t o p i c n e w s ( g j )   t o p i c n e w s ( g i )   t o p i c n e w s ( g j ) .
Next, the similarity between the two age groups g i and g j was measured at the section and subsection levels as
s i m i l a r i t y s e c t i o n S ,   g i , g j =   t o p i c s e c t i o n ( S , g i ) · t o p i c s e c t i o n ( S , g j )   t o p i c s e c t i o n ( S , g i )   t o p i c s e c t i o n ( S , g j ) .
s i m i l a r i t y s u b s e c t i o n s ,   g i , g j =   t o p i c s u b s e c t i o n ( s , g i ) · t o p i c s u b s e c t i o n ( s , g j )   t o p i c s u b s e c t i o n ( s , g i )   t o p i c s u b s e c t i o n ( s , g j ) .
Using the s i m i l a r i t y n e w s ( g i , g j ) values, the most and least similar age group pairs were identified and used to examine the overall differences between age groups. Moreover, from the s i m i l a r i t y s e c t i o n ( S ,   g i , g j ) values, the similarity within a section was investigated for each age group pair and compared with other sections. In terms of subsections, the top 5 and bottom 5 subsections were explored for each age group pair based on the s i m i l a r i t y s u b s e c t i o n ( s , g i , g j ) values.

2.4.3. Use Correlations between Age Groups for Age Difference Analysis

Correlations between age groups were assessed using the Pearson correlation coefficient. Likewise, for finding the similarity, the correlation coefficient between two age groups in each age group pair was obtained, i.e., c o r r e l a t i o n ( t o p i c g i , t o p i c ( g j ) ). For different elements, i.e., news, sections, and subsections, three types of correlations coefficients were obtained as:
c o r r e l a t i o n n e w s g i , g j = c o r r e l a t i o n ( t o p i c n e w s g i , t o p i c n e w s ( g j ) ) .
c o r r e l a t i o n s e c t i o n S , g i , g j = c o r r e l a t i o n ( t o p i c s e c t i o n S , g i , t o p i c s e c t i o n ( S , g j ) ) .
c o r r e l a t i o n s u b s e c t i o n s , g i , g j = c o r r e l a t i o n ( t o p i c s u b s e c t i o n s , g i , t o p i c s u b s e c t i o n ( s , g j ) ) .
Based on the c o r r e l a t i o n n e w s ( g i , g j ) values, the most and least correlated age groups were identified. Similarly, the most and least correlated sections for each age group pair were analyzed using c o r r e l a t i o n s e c t i o n ( S , g i , g j ) , and both the top and bottom 5 subsections for each age group pair were investigated in terms of c o r r e l a t i o n s u b s e c t i o n ( s , g i , g j ) .

2.4.4. Integrate Both Similarity and Correlation for Age Difference Analysis

This step used similarity and correlation perspectives within a subsection to evaluate an age group pair by building a correlation–similarity matrix, modified from business portfolio models. First, each age group pair was mapped onto the x-y plane according to its similarity and correlation values within a subsection. For example, as shown in Figure 6, a pair of two age groups ( g i , g j ) within a section s was positioned as a blue data point in the x-y plane by setting s i m i l a r i t y s u b s e c t i o n s , g i , g j on the x-axis and c o r r e l a t i o n s u b s e c t i o n ( s , g i , g j ) on the y-axis.
Next, boundary values to divide the ranges of the s i m i l a r i t y s u b s e c t i o n s , g i , g j and c o r r e l a t i o n s u b s e c t i o n ( s , g i , g j ) values into high and low categories, i.e., c o r r e l a t i o n and s i m i l a r i t y , were calculated by averaging the similarities and correlations between all age group pairs. Thereafter, using the c o r r e l a t i o n and s i m i l a r i t y values, the x-y plane in the matrix was divided into four areas, which were classified into three types, as shown in Figure 6: (i) the area type ① contains subsections considered as topics with relatively low differences between age groups; (ii) subsections in the area ② are neither relatively high nor relatively low; and (iii) the area ③ includes subsections considered as topics with relatively high difference between age groups.

2.4.5. Quantify the Uncertain Degree of Age Difference with Fuzzy Sets

In the proposed correlation–similarity matrix, as the subsections for an age group pair were positioned over more than two areas, it was difficult to clearly define the differences between two age groups in the pair as one of the three area types. To surmount such uncertainty in the relationship between two age groups, this study adopted the concept of a fuzzy set, and accordingly defined the three fuzzy sets of age difference as: (i) D H i g h , which included the age group pairs related to subsections in the area ③; (ii) D M i d d l e , whose elements comprised the age group pairs related to subsections in the area ②; and (iii) D L o w , which contained the age group pairs of subsections in the area ①.
In addition, the membership value of an age group pair in a fuzzy set, μ D q ( g i , g j ) , was obtained by calculating the ratio of subsections belonging to the age group pair in the fuzzy set’s area. For example, as illustrated in Figure 6, if 36 subsections of an age group pair g i , g j are positioned in the area ③, then the ratio 36/48 is the degree of membership of the age group pair in the fuzzy set D H i g h , i.e., μ D H i g h g i , g j = 0.7500 . Likewise, the degree of membership of the age group pair in the other fuzzy sets can be calculated as 4/48 for D M i d d l e , i.e., μ D M i d d l e g i , g j = 0.0833 and 8/48 for D L o w , i.e., μ D L o w g i , g j = 0.1667 . Thus, the degree of membership of an age group pair to a fuzzy set was considered as a probability of its age difference’s existence, and the uncertainty of differences between age groups could be quantified.

3. Results and Discussions

3.1. Evaluation of Results of Prediction Techniques

For each prediction technique, a 10-fold cross-validation using the labeled dataset was repeated 50 times, and the experimental results were evaluated using two performance measures: MAE and RMSE. Table 5 explains the descriptive statistics of the evaluation results of the repeated experiments and shows that SVRmulti was the best among the eight prediction techniques with respect to both performance metrics.
In addition, Table 6 shows the results of pairwise t tests between the eight prediction techniques, performed to statistically investigate the effect of using a prediction technique on performance measures. The hypotheses in Table 6 were generated to represent comparisons between different prediction techniques. For example, a hypothesis “A > B” implies that the prediction technique A has a bigger prediction error than B; therefore, B is a better prediction technique.
The comparison results in Table 6 were interpreted and used to determine whether corresponding hypotheses were supported or not by the repeated experiments in terms of both performance measures. For example, the first hypothesis in Table 6, i.e., “NNRsingle > k-NNRsingle”, is supported by repeated experiments because of its t-value > 0 and p-value < α , and it also implies that NNRsingle has a bigger prediction error; therefore, k-NNRsingle is a better prediction technique. On the other hand, if t-value < 0 and p-value < α , it means that the opposite hypothesis, i.e., “NNRsingle < k-NNRsingle”, is supported, and it indicates that NNRsingle is better.
Consequently, based on the performance differences that were found statistically significant at p-value = 0.05, the eight prediction techniques could be arranged in descending order of either MAE or RMSE, as “MLRmulti > DTRmulti > NNRsingle > k-NNRmulti > k-NNRsingle = NNRmulti = RFRmulti > SVRmulti”. Additionally, given that the smallest values in performance metrics indicate the best prediction technique, it is statistically evident from Table 6 that SVRmulti was the best prediction technique for this study. Moreover, all the hypotheses related to SVRmulti proved the superiority of SVRmulti, which is highlighted in red font in Table 6. Therefore, SVRmulti was selected to train an age predictor with the labeled dataset to predict the age group distribution of anonymous commenters on unlabeled news articles.

3.2. Results of Age Prediction for Labeled Datasets and Comparison

In this step, the best prediction technique identified in the previous subsection was used to train a prediction model from the labeled dataset, i.e., the age predictor. Thereafter, the multi-label target of each news article, a g e s c o r e n e w s ( n ) , in the labeled dataset was predicted by using the age predictor, and its a g e s c o r e n e w s ( n , g ) values were transformed into the a g e r a t e n e w s ( n , g ) values of the labeled news article.
Using SVRmulti as the best prediction technique, the age group distribution of commenters was predicted for news articles in the labeled dataset. Figure 7 describes the distribution of the a g e r a t e n e w s ( n , g ) values obtained from the predicted a g e s c o r e n e w s ( n , g ) values using the learned SVRmulti. When compared, the predicted a g e r a t e n e w s ( n , g ) values for the labeled news articles were shown to have almost the same distribution as their corresponding true a g e r a t e n e w s ( n , g ) values.

3.3. Results of Predicting Age Information for Unlabeled Datasets and Exploration

The unknown a g e r a t e n e w s ( n , g ) values for news articles in the unlabeled dataset were estimated using the age predictor, learned from the labeled dataset by using the selected best prediction technique, i.e., SVRmulti. Figure 8 shows the distribution of the predicted a g e r a t e n e w s ( n , g ) values for the unlabeled news articles, and its comparisons with the labeled and mixed datasets. Table 7 shows more detailed comparison results across the three types of a g e r a t e n e w s ( n , g ) values. Figure 8 implies that the unknown age group distributions of unlabeled and labeled news articles are different. Moreover, when they were mixed, the distribution of the mixed values was more like the predicted values of the unlabeled news articles than those of the true values of the labeled news articles.
Table 8 contains descriptive statistics of the three types of a g e r a t e s e c t i o n ( n , g ) . Furthermore, the Z-test results indicated whether differences between the true values of the labeled dataset and the values of the mixed dataset were statistically significant when they were compared in terms of sections. It was shown that only people in their 40s in the “IT/Science” section had an insignificant Z-test result, showing that the labeled and mixed datasets statistically had the same population mean as highlighted in red font in Table 8.
In addition, Figure 9 shows scatter diagrams of relationships between the labeled and mixed datasets when the proposed ranks, r a n k l a b e l e d s , g and r a n k m i x e d s , g , were used to compare two different datasets. When put together, the scatter diagrams indicated that although Table 8 showed that differences existed between the labeled and mixed datasets, similarities could be found between the two datasets when proposed ranks were used for comparing them. Thus, the age prediction results for the unlabeled and labeled datasets could be considered similar from a different perspective (i.e., other than ordinary statistical tests).

3.4. Results of Fuzzy Age Difference Analysis

3.4.1. On the Similarities between Age Groups

Figure 10 describes the obtained similarities between age groups, i.e., s i m i l a r i t y n e w s ( g i , g j ) , and shows that each age group was most similar to its closest age group (i.e., 20s were similar to 10s, 30s to 20s, 20s to 30s, 30s to 40s, and 40s to ≥50s).
Table 9 shows similarities within a section measured for each age group pair, with their maximum and minimum values for similarities highlighted in red and blue fonts, respectively. For example, (10s, 20s) was the most similar in the “World” section, but least similar in the “Economy” section. Overall, (20s, 30s) had the greatest similarity in the “World” section, while (10s, ≥50s) had the lowest similarity in the “Economy” section. The largest and smallest values of all age group pairs are marked in bold in Table 9.
The similarities in Table 9 were investigated within each section and they demonstrated that four sections, “Politics,” “Lifestyle/Culture,” “World,” and “IT/Science” had the greatest similarity in (20s, 30s). In comparison, the other two sections “Economy” and “Society” had the greatest similarity in (30s, 40s). However, all sections had the lowest similarity in (10s, ≥50s). These results indicate that age differences in terms of topics of interest were small in (20s, 30s) and (30s, 40s), whereas they were large in (10s, ≥50s).
A more thorough analysis of the similarity of topics of interest between age groups was performed using the obtained s i m i l a r i t y s u b s e c t i o n s , g i , g j values. Its results are shown in Supplementary A. To summarize the results, both the top and bottom five subsections for s i m i l a r i t y s u b s e c t i o n s , g i , g j were identified for each age group pair, as shown in Table 10. The data in Table 10 can also help to examine and discuss the more detailed reasons for the similarities and differences in topics of interest revealed between the age groups. For example, according to Table 9, overall, the greatest similarity in topics of interest existed in (20s, 30s) in the “World” section, but a closer look at Table 10 revealed that the two age groups had the greatest similarity in the “Weather” subsection of the “Lifestyle/Culture” section. Moreover, Table 9 shows that overall, (10s, ≥50s) showed the lowest similarity relating to topics of interest in the “Economy” section. However, based on Table 10, the “Society General” subsection of the “Society” section had the lowest similarity relating to topics of interest for (10s, ≥50s).

3.4.2. On the Correlations between Age Groups

Regarding correlations between age groups, Figure 11 shows the overall correlations between age groups based on the measured c o r r e l a t i o n n e w s ( g i , g j ) values. Overall, the correlation between 10s and 20s was the largest, whereas 30s and ≥50s showed the smallest correlation. Considering all age group pairs, the most correlated age group for each age group pair was as follows: 20s similar to 10s, 10s to 20s, 40s to 30s, ≥50s to 40s, and 40s to ≥50s. This means that the correlation between close age groups was the largest, which is the same as the previous results of Figure 10, which show that the similarity between close age groups was the highest.
Furthermore, Figure 12 shows the correlation within a section measured for each age group pair. From Figure 12, it can be seen that with respect to common topics of interest between age groups, differences in correlation existed between them. The results in all sections are as follows. Overall, (10s, 20s) showed the greatest positive correlations in all sections. Among them, the correlation was the largest in the “Society” section. Similarly, all sections showed positive correlations for (20s, 30s), although the correlations were smaller than those for (10s, 20s). Contrastingly, overall, it was shown that (30s, ≥50s) had the greatest negative correlations across all sections; in particular, in the “Politics” section, (30s, ≥50s) showed the lowest correlation. These findings, based on the c o r r e l a t i o n s e c t i o n S , g i , g j values, imply that the age difference in terms of topic correlations is small for (10s, 20s) and (20s, 30s), but large in (30s, ≥50s). Thus, it is also expected that the above identified small or large topic similarities between age groups could have caused more conflicts and social problems between different age groups.
In addition, the Tables in Supplementary B show the results when the correlation between age groups was analyzed more carefully by obtaining and using c o r r e l a t i o n s u b s e c t i o n s , g i , g j values. The results can be condensed as shown in Table 11, by listing the identified top five and bottom five subsections in terms of c o r r e l a t i o n s u b s e c t i o n s , g i , g j for each age group pair. Table 11 reveals more detailed reasons for the correlations regarding topics of interest between age groups that can be examined and discussed. For example, the correlation between 10s and 20s in the “Society” section was the largest of all, as revealed in Figure 12, but when investigated more closely, it was seen in Table 11 that the two age groups had the largest correlation in the “Real Estate” subsection, belonging to the “Economy” section. Moreover, (30s, ≥50s) in the “Political” section showed the lowest correlation among all age group pairs in Figure 12, and it was found from Table 11 that the detailed subsection in the “Politics” section was the “President’s Office” subsection.

3.4.3. On the Quantification of Age Difference with Fuzzy Sets

Figure 13 illustrates a correlation–similarity matrix, which mapped each age group pair in a subsection using both similarity and correlation values. Moreover, using the correlation–similarity matrix, the uncertain differences between age groups could be roughly classified into three areas. For instance, Figure 13 shows that (10s, 20s) belonged to area ①, implying that they were relatively close to each other. Conversely, (20s, ≥50s) were classified as area ③, showing a relatively large age difference. Lastly, (10s, 30s) were classified into area ②, which indicates they were neither relatively high nor low.
Using the results of the correlation–similarity matrix, this study compiled Table 12, which shows the membership of an age group in three fuzzy sets of age differences. Using Table 12, the difference between the age groups could be measured by considering the uncertainty in their relationships, and the following results were obtained. In the case of (10s, 20s), the difference between them was not considered large, as they belonged to the fuzzy set D L o w with a 100% degree of membership, i.e., μ D L o w 10 s , 20 s = 1.0 . However, referring to the large degree of membership in the fuzzy set D H i g h , it was found that 10s had clear differences from the other age groups, i.e., 40s and ≥50s. In addition, the results μ D L o w 20 s , 30 s = 1.0 and μ D H i g h 20 s , 50 s = 1.0 show that there was little difference between the 20s and 30s, while there was a clear difference between the 20s and ≥50s. The difference between the 30s and ≥50s was evident, like that of the 20s and ≥50s, but the difference between the 40s and ≥50s was not found, according to μ D L o w 40 s , 50 s = 1.0 .

4. Conclusions

Social profiling research requires demographic information, of which age information is essential. Previous studies have revealed that differences exist in social and behavioral characteristics, according to age. It has also been found that such age differences are prominent on specific topics. Therefore, to acquire such age information for social profiling, prior studies have primarily focused on predicting age information on social media. However, age predictions on social media have been limited because it is difficult or impossible to obtain age information from social media owing to its anonymity and privacy policies. Moreover, these predictions have focused mainly on English and Chinese social media and are yet to be connected to studying human dynamics. In using the acquired data of this study to fill those research gaps, the method of how age information can be represented and best predicted was also not clearly defined. To address these problems, this study proposed a method that predicts age information from naver.com after selecting the best prediction technique and used the predicted age information for difference analysis between age groups regarding topics of interest.
As regards the prediction techniques of machine learning, it evaluated eight models with comparisons, using labeled datasets. Consequently, the multi-label regression of SVM, i.e., SVRmulti, gave the best results for age prediction with labeled data, showing better performance than the other prediction techniques in a statistically significant manner. Hence, it was used to predict the age information of the unlabeled data, which were mixed with the labeled dataset, resulting in a mixed dataset. Subsequently, when general statistical approaches were employed to compare the two datasets, this study found differences between labeled and mixed datasets. Similarities existed between the two datasets when they were compared using a different method (i.e., ranks for an age group g within a subsection s). Thus, even though age information between labeled and mixed datasets could not be considered strictly identical, age prediction results of the unlabeled dataset still need to be included and used for social profiling. Moreover, using the predicted age information, this study performed a fuzzy difference analysis between age groups with respect to their topics of interest, using the section information of the collected news articles. The highlights of the obtained difference analysis results are as follows. The measured similarities between age groups showed that the age difference in terms of topics of interest was small in (20s, 30s) and (30s, 40s), whereas it was big in (10s, ≥50s). Besides correlations between age groups, this study also showed that the difference in terms of topic correlations was small in (10s, 20s) and (20s, 30s), but it was significant in (30s, ≥50s). When uncertainty was considered using the correlation–similarity matrix and defining the fuzzy sets of age difference, the difference in (10s, 20s), (20s, 30s), and (40s, ≥50s) was shown to be small, while (10s, 40s), (10s, ≥50s), (20s, ≥50s) and (30s, ≥50s) showed obvious distinctions between age groups.
As such, this study was able to uncover differences in topics of interest between age groups which may have caused intergenerational conflicts and social problems between age groups in Korean society. However, it is necessary to be cautious in determining differences between generations based on the results of this study, and more in-depth studies should be conducted. This is because different conclusions can be reached depending on the type or quality of data acquired. In addition, this study assumed that people in the same age group have similar topics of interest due to the nature of the data used, but it should be remembered that even people in the same age group can have different thoughts and preferences. However, data on individual age information is difficult to acquire. Moreover, the topics of interest for an age group may change over time and vary according to different events. To overcome these limitations of this study, further research needs to be undertaken.
Furthermore, this study used word embeddings to represent the distribution of age groups across commenters on a news article. Future research can investigate methods to measure and use other features (e.g., psychological/cognitive and social/behavioral characteristics of news commenters as a group and their changes over time). The usefulness of such features for the age prediction of commenters on a news article can be examined because different features can result in different prediction performances. Additionally, future studies could explore more a sophisticated topic analysis for identifying topics in news articles and finding differences in mindsets between age groups. While this study used predicted age information of anonymous commenters on unlabeled news articles, a more advanced exploration of human dynamics using predicted age information needs to be undertaken in the future.
This study makes valuable contributions, summarized as follows: First, unlike previous works, it introduced naver.com, which contains anonymous news comments written mainly in Korean, and thus, it could extend social profiling literature in terms of language diversity. It can, therefore, be referred to as a new data provider for future age information-based social profiling studies on Korean social media. Second, it contributed to social profiling literature by providing age information-based social profiling studies with incomplete age information for anonymous social media. Moreover, the proposed machine learning approach can be applied to deal with the incompleteness of other social profile attributes and will provide new opportunities for social profiling studies. Third, by proposing a correlation–similarity matrix and fuzzy sets of age differences, this study was able to capture uncertain differences between age groups from the integrated viewpoint of similarities and correlations on their topics of interest. Eventually, it will be helpful for measuring and monitoring intergenerational conflicts and solving such problems for sustainable societies.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app14020790/s1, Supplementary A. Similarities between Age Groups in Subsections. Supplementary B. Correlations between Age Groups in Subsections.

Funding

This study was supported by National Research Foundation of Korea Grant, funded by the Korean Government (No. 2021R1F1A1063681). The APC was funded by the Gyeongsang National University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from www.naver.com and are available from the author with the permission of www.naver.com.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Suh, J.H.; Park, C.H.; Jeon, S.H. Applying text and data mining techniques to forecasting the trend of petitions filed to e-People. Expert Syst. Appl. 2010, 37, 7255–7268. [Google Scholar] [CrossRef]
  2. Suh, J.H. Forecasting the daily outbreak of topic-level political risk from social media using hidden Markov model-based techniques. Technol. Forecast. Soc. Chang. 2015, 94, 115–132. [Google Scholar] [CrossRef]
  3. Bilal, M.; Gani, A.; Lali, M.I.U.; Marjani, M.; Malik, N. Social Profiling: A Review, Taxonomy, and Challenges. Cyberpsychol. Behav. Soc. Netw. 2019, 22, 433–450. [Google Scholar] [CrossRef] [PubMed]
  4. Suh, J.H. SocialTERM-Extractor: Identifying and Predicting Social-Problem-Specific Key Noun Terms from a Large Number of Online News Articles Using Text Mining and Machine Learning Techniques. Sustainability 2019, 11, 196. [Google Scholar] [CrossRef]
  5. Bazzaz Abkenar, S.; Haghi Kashani, M.; Mahdipour, E.; Jameii, S.M. Big data analytics meets social media: A systematic review of techniques, open issues, and future directions. Telemat. Inform. 2021, 57, 101517. [Google Scholar] [CrossRef] [PubMed]
  6. Hirt, R.; Kühl, N.; Satzger, G. Cognitive computing for customer profiling: Meta classification for gender prediction. Electron. Mark. 2019, 29, 93–106. [Google Scholar] [CrossRef]
  7. López-Santillán, R.; Montes-Y-Gómez, M.; González-Gurrola, L.C.; Ramírez-Alonso, G.; Prieto-Ordaz, O. Richer Document Embeddings for Author Profiling tasks based on a heuristic search. Inf. Process. Manag. 2020, 57, 102227. [Google Scholar] [CrossRef]
  8. Sawyer, S.M.; Afifi, R.A.; Bearinger, L.H.; Blakemore, S.-J.; Dick, B.; Ezeh, A.C.; Patton, G.C. Adolescence: A foundation for future health. Lancet 2012, 379, 1630–1640. [Google Scholar] [CrossRef]
  9. Utz, S.; Krämer, N.C. The privacy paradox on social network sites revisited: The role of individual characteristics and group norms. Cyberpsychol. J. Psychosoc. Res. Cyberspace 2009, 3, 2. [Google Scholar]
  10. Schwartz, H.A.; Eichstaedt, J.C.; Kern, M.L.; Dziurzynski, L.; Ramones, S.M.; Agrawal, M.; Shah, A.; Kosinski, M.; Stillwell, D.; Seligman, M.E.P.; et al. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. PLoS ONE 2013, 8, e73791. [Google Scholar] [CrossRef]
  11. Guimarães, R.G.; Rosa, R.L.; Gaetano, D.D.; Rodríguez, D.Z.; Bressan, G. Age Groups Classification in Social Network Using Deep Learning. IEEE Access 2017, 5, 10805–10816. [Google Scholar] [CrossRef]
  12. Huffaker, D.A.; Calvert, S.L. Gender, Identity, and Language Use in Teenage Blogs. J. Comput.-Mediat. Commun. 2005, 10, JCMC10211. [Google Scholar] [CrossRef]
  13. Pempek, T.A.; Yermolayeva, Y.A.; Calvert, S.L. College students’ social networking experiences on Facebook. J. Appl. Dev. Psychol. 2009, 30, 227–238. [Google Scholar] [CrossRef]
  14. Wu, C.; Wu, F.; Qi, T.; Liu, J.; Huang, Y.; Xie, X. Neural Gender Prediction in Microblogging with Emotion-aware User Representation. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 2401–2404. [Google Scholar]
  15. Figueroa, A.; Peralta, B.; Nicolis, O. Coming to Grips with Age Prediction on Imbalanced Multimodal Community Question Answering Data. Information 2021, 12, 48. [Google Scholar] [CrossRef]
  16. Reddy, T.R.; Vardhan, B.V.; Reddy, P.V. N-Gram Approach for Gender Prediction. In Proceedings of the 2017 IEEE 7th International Advance Computing Conference (IACC), Hyderabad, India, 5–7 January 2017; pp. 860–865. [Google Scholar]
  17. Segalin, C.; Cheng, D.S.; Cristani, M. Social profiling through image understanding: Personality inference using convolutional neural networks. Comput. Vis. Image Underst. 2017, 156, 34–50. [Google Scholar] [CrossRef]
  18. Chen, E.; Zeng, G.; Luo, P.; Zhu, H.; Tian, J.; Xiong, H. Discerning individual interests and shared interests for social user profiling. World Wide Web 2017, 20, 417–435. [Google Scholar] [CrossRef]
  19. Azucar, D.; Marengo, D.; Settanni, M. Predicting the Big 5 personality traits from digital footprints on social media: A meta-analysis. Personal. Individ. Differ. 2018, 124, 150–159. [Google Scholar] [CrossRef]
  20. Lima, A.C.E.S.; de Castro, L.N. A multi-label, semi-supervised classification approach applied to personality prediction in social media. Neural Netw. 2014, 58, 122–130. [Google Scholar] [CrossRef]
  21. Wang, L.; Li, Q.; Chen, X.; Li, S. Multi-task Learning for Gender and Age Prediction on Chinese Microblog. In Proceedings of the Natural Language Understanding and Intelligent Applications: 5th CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2016, and 24th International Conference on Computer Processing of Oriental Languages, ICCPOL 2016, Kunming, China, 2–6 December 2016; Springer: Cham, Switzerland, 2016; pp. 189–200. [Google Scholar]
  22. Wang, Q.; Ma, S.; Zhang, C. Predicting users’ demographic characteristics in a Chinese social media network. Electron. Libr. 2017, 35, 758–769. [Google Scholar] [CrossRef]
  23. Chen, J.; Cheng, L.; Yang, X.; Liang, J.; Quan, B.; Li, S. Joint Learning with both Classification and Regression Models for Age Prediction. J. Phys. Conf. Ser. 2019, 1168, 032016. [Google Scholar] [CrossRef]
  24. Lee, S.Y.; Ryu, M.H. Exploring characteristics of online news comments and commenters with machine learning approaches. Telemat. Inform. 2019, 43, 101249. [Google Scholar] [CrossRef]
  25. Fang, J.; Yuan, Y.; Lu, X.; Feng, Y. Muti-stage learning for gender and age prediction. Neurocomputing 2019, 334, 114–124. [Google Scholar] [CrossRef]
  26. Han, S.; Huang, H.; Tang, Y. Knowledge of words: An interpretable approach for personality recognition from social media. Knowl.-Based Syst. 2020, 194, 105550. [Google Scholar] [CrossRef]
  27. Romanov, A.S.; Kurtukova, A.V.; Sobolev, A.A.; Shelupanov, A.A.; Fedotova, A.M. Determining the Age of the Author of the Text Based on Deep Neural Network Models. Information 2020, 11, 589. [Google Scholar] [CrossRef]
  28. Kamalesh, M.D.; Bharathi, B. Personality prediction model for social media using machine learning Technique. Comput. Electr. Eng. 2022, 100, 107852. [Google Scholar] [CrossRef]
  29. Khorrami, M.; Khorrami, M.; Farhangi, F. Evaluation of tree-based ensemble algorithms for predicting the big five personality traits based on social media photos: Evidence from an Iranian sample. Personal. Individ. Differ. 2022, 188, 111479. [Google Scholar] [CrossRef]
  30. Zhou, L.; Zhang, Z.; Zhao, L.; Yang, P. Attention-based BiLSTM models for personality recognition from user-generated content. Inf. Sci. 2022, 596, 460–471. [Google Scholar] [CrossRef]
  31. García-Díaz, J.A.; Cánovas-García, M.; Colomo-Palacios, R.; Valencia-García, R. Detecting misogyny in Spanish tweets. An approach based on linguistics features and word embeddings. Future Gener. Comput. Syst. 2021, 114, 506–518. [Google Scholar] [CrossRef]
  32. Choi, B.; Suh, J.H. Forecasting Spare Parts Demand of Military Aircraft: Comparisons of Data Mining Techniques and Managerial Features from the Case of South Korea. Sustainability 2020, 12, 6045. [Google Scholar] [CrossRef]
  33. Suh, J.H. Comparing writing style feature-based classification methods for estimating user reputations in social media. SpringerPlus 2016, 5, 261. [Google Scholar] [CrossRef]
  34. Suh, J.H. Machine-Learning-Based Gender Distribution Prediction from Anonymous News Comments: The Case of Korean News Portal. Sustainability 2022, 14, 9939. [Google Scholar] [CrossRef]
Figure 1. Research framework proposed by this study.
Figure 1. Research framework proposed by this study.
Applsci 14 00790 g001
Figure 2. The distributions of a g e r a t e n e w s ( n , g ) for the labeled news articles.
Figure 2. The distributions of a g e r a t e n e w s ( n , g ) for the labeled news articles.
Applsci 14 00790 g002
Figure 3. The four steps of a machine learning approach to predict age information of the unlabeled dataset.
Figure 3. The four steps of a machine learning approach to predict age information of the unlabeled dataset.
Applsci 14 00790 g003
Figure 4. An illustration of how to obtain the mixed dataset.
Figure 4. An illustration of how to obtain the mixed dataset.
Applsci 14 00790 g004
Figure 5. Steps to discover fuzzy age differences from anonymous news comments.
Figure 5. Steps to discover fuzzy age differences from anonymous news comments.
Applsci 14 00790 g005
Figure 6. An example of the correlation–similarity matrix. * indicates that it is an average value.
Figure 6. An example of the correlation–similarity matrix. * indicates that it is an average value.
Applsci 14 00790 g006
Figure 7. Histogram comparisons between the true and predicted values of a g e r a t e n e w s n , g for the labeled news articles.
Figure 7. Histogram comparisons between the true and predicted values of a g e r a t e n e w s n , g for the labeled news articles.
Applsci 14 00790 g007
Figure 8. Kernel Density Estimation plots on the distribution of the a g e r a t e n e w s ( n , g ) for the different datasets.
Figure 8. Kernel Density Estimation plots on the distribution of the a g e r a t e n e w s ( n , g ) for the different datasets.
Applsci 14 00790 g008
Figure 9. Scatter diagrams of relationships between the labeled and mixed datasets with the two measured ranks. Red lines represent that two measured ranks are the same, i.e., r a n k l a b e l e d s , g = r a n k m i x e d s , g .
Figure 9. Scatter diagrams of relationships between the labeled and mixed datasets with the two measured ranks. Red lines represent that two measured ranks are the same, i.e., r a n k l a b e l e d s , g = r a n k m i x e d s , g .
Applsci 14 00790 g009
Figure 10. The measured similarities between age groups.
Figure 10. The measured similarities between age groups.
Applsci 14 00790 g010
Figure 11. The measured correlations between age groups. *** indicates the significance level of p < 0.01.
Figure 11. The measured correlations between age groups. *** indicates the significance level of p < 0.01.
Applsci 14 00790 g011
Figure 12. The measured correlation values within a section between different age groups. *** indicates the significance level of p < 0.01.
Figure 12. The measured correlation values within a section between different age groups. *** indicates the significance level of p < 0.01.
Applsci 14 00790 g012
Figure 13. The obtained correlation–similarity matrix. * indicates that it is an average value.
Figure 13. The obtained correlation–similarity matrix. * indicates that it is an average value.
Applsci 14 00790 g013
Table 3. Descriptive statistics on a g e r a t e n e w s ( n , g ) for news articles in the labeled dataset.
Table 3. Descriptive statistics on a g e r a t e n e w s ( n , g ) for news articles in the labeled dataset.
Statistics10s20s30s40s≥50s
Mean0.01730.14030.28330.31350.2455
S.D.0.02390.08910.08680.07110.1198
Table 4. Descriptive statistics on the 15,080 labeled news articles and their a g e r a t e n e w s ( n , g ) by sections and subsections.
Table 4. Descriptive statistics on the 15,080 labeled news articles and their a g e r a t e n e w s ( n , g ) by sections and subsections.
Section SSubsection sLabeled News Articles (%) a g e r a t e n e w s ( n , g )
10s20s30s40s≥50s
MeanS.D.MeanS.D.MeanS.D.MeanS.D.MeanS.D.
PoliticsPresident’s Office4.4870%0.01110.00990.13220.05430.25440.06540.29500.04480.30770.1009
National Assembly/Political Party8.9869%0.01120.01200.09960.05790.23220.07130.33000.05820.32710.1096
Administration0.6271%0.01460.01960.14970.11070.26020.07960.29750.07500.27980.1348
National Defense/Diplomacy3.3297%0.01560.01470.15080.07050.23910.06170.28480.04860.31010.1079
North Korea3.4848%0.01320.01180.14060.05380.23690.05510.28560.04230.32410.0955
Politics General15.2906%0.01190.01420.09960.06530.22860.06730.33810.06080.32210.1034
EconomyStock1.3707%0.00730.03070.09370.05290.26990.06780.35060.05730.27950.0856
Finance1.0797%0.00690.00910.10440.05410.30320.07050.34100.05380.24430.0862
Real Estate1.8879%0.00500.00660.06300.04540.27710.07310.37170.05420.28420.0808
Industry/Business3.4267%0.00920.01190.11300.05470.28980.07190.33410.05760.25390.0921
Global Economy0.3815%0.01240.01020.11240.05050.26860.05480.35640.04890.25120.0646
Economy General6.9050%0.00930.01350.11390.06270.28960.07200.33650.05730.25060.0866
Living Economy0.4784%0.01080.01820.14040.08850.33230.07580.32260.07040.19430.0834
Small and Mid-sized Businesses/Start-ups0.2198%0.00680.00530.10590.04640.28530.08400.33210.05670.27150.1052
SocietyCase/Accident5.0171%0.02080.02930.15990.11220.30900.09520.30540.07900.20480.1200
Education0.8340%0.03980.04360.18640.11990.26640.10370.34590.13360.16270.0888
Labor1.2866%0.01220.01650.14840.07650.32790.08040.30350.06690.20570.0944
Environment0.5690%0.02140.02030.17240.05660.35150.07250.28180.05220.17030.0805
The Press0.1681%0.01620.01880.14000.11630.26350.09660.33580.08970.24690.1264
Food/Medical0.4978%0.02530.02320.22480.13560.32770.09230.27050.08000.15060.1036
Region5.4051%0.02020.02550.15310.08930.32130.08300.31480.07600.18970.0961
Society General18.1419%0.02530.03690.17350.11700.31720.09460.29870.08530.18410.1176
Character0.0711%0.01910.01760.18550.07850.33640.08020.31450.08980.14730.0553
Human Rights/Welfare0.4396%0.01660.02400.15400.07960.34790.07310.30620.05900.17600.0831
Lifestyle/CultureTravel/Leisure0.2134%0.02210.01950.16700.06060.38520.06300.27850.07030.14640.0586
Food/Restaurant0.0517%0.01000.00760.20620.04570.37120.06710.27750.02310.13500.0746
Car/Test Drive0.2651%0.01000.00950.12540.04660.36850.07080.32200.05330.17540.0797
Road/Traffic0.1099%0.01180.01010.10530.05110.28650.07750.32410.05370.27350.0962
Health Information0.3362%0.01870.01140.19040.07610.35900.05810.28310.06460.14940.0675
Performance/Exhibition0.2198%0.02210.02710.17120.13630.28740.07170.31740.09560.20120.1005
Book0.1616%0.04320.04090.21000.11120.26600.08700.27240.08420.20720.1195
Religion0.4655%0.02250.01980.15820.08480.30330.09440.30400.06430.21180.1348
Lifestyle/Culture General2.3469%0.02490.02570.18760.09030.34160.07600.28510.07700.16110.0903
Weather2.2112%0.02800.01890.21320.04900.36580.05640.24740.04370.14610.0565
Fashion/Beauty0.0388%0.02170.01470.15670.06530.38330.03830.29000.03520.14830.0504
WorldAsia/Australia3.2392%0.02150.01670.15610.06030.27960.05920.31450.05960.22850.0845
USA/Latin America1.3771%0.01920.01370.15880.05450.28150.06980.29750.05150.24290.0961
Europe0.5302%0.02460.01840.19000.07830.31610.07060.28650.05990.18300.0996
Middle East/Africa0.2651%0.03000.02010.22150.07200.33850.06190.28050.05780.12980.0504
World General0.6724%0.02400.01630.14890.06000.29540.06060.30800.05180.22370.0914
IT/ScienceInternet/SNS0.2263%0.04170.03820.26770.12820.33200.07560.23890.08710.12060.0663
Communications/New Media0.3039%0.02640.02350.18640.06370.33870.06960.30530.07110.14260.0585
Science General0.3685%0.03670.02330.19880.07050.32950.06630.28390.06310.15160.0597
Games/Reviews0.0323%0.06400.03850.41800.15850.32600.06430.13800.08230.05400.0391
IT General1.7327%0.03160.02350.19240.06820.33510.06050.29620.06720.14460.0616
Computer0.0517%0.02620.02000.18880.07610.34250.10110.29620.05660.14750.1038
Mobile0.3750%0.03810.02500.20450.05950.33550.05680.29970.06170.12290.0531
Security/Hacking0.0194%0.03670.01150.20000.02000.40330.01530.26330.01150.10000.0200
Table 5. Evaluation results for different prediction techniques.
Table 5. Evaluation results for different prediction techniques.
TypePrediction TechniqueMAERMSE
MeanS.D.MeanS.D.
Single outputNNRsingle0.02850.00030.03960.0004
k-NNRsingle0.02670.00010.03790.0003
Multiple outputMLRmulti0.03600.00030.08050.0010
NNRmulti0.02640.00030.03830.0010
DT’Rmulti0.03570.00010.05110.0003
SVRmulti0.02460.00010.03570.0003
k-NNRmulti0.02700.00010.03810.0003
RFRmulti0.02520.00010.03590.0003
Note: The best evaluation result for each performance metric is shown in bold highlight font.
Table 6. Comparison results of the different prediction techniques.
Table 6. Comparison results of the different prediction techniques.
HypothesisMAERMSESupported
tp-Valuetp-Value
NNRsingle > k-NNRsingle38.52570.0000 ***24.30660.0000 ***Yes
NNRsingle > MLRmulti−134.38450.0000 ***−276.11030.0000 ***Yes (opposite)
NNRsingle > NNRmulti33.64700.0000 ***8.29650.0000 ***Yes
NNRsingle > DTRmuti−155.22020.0000 ***−164.65720.0000 ***Yes (opposite)
NNRsingle > SVRmulti85.19370.0000 ***55.19250.0000 ***Yes
NNRsingle > k-NNRmulti33.14590.0000 ***20.86430.0000 ***Yes
NNRsingle > RFRmulti72.83770.0000 ***52.45630.0000 ***Yes
k-NNRsingle > MLRmulti−229.15400.0000 ***−299.12350.0000 ***Yes (opposite)
k-NNRsingle > NNRmulti5.90860.0000 ***−3.45030.0008 ***No
k-NNRsingle > DTRmuti−338.77250.0000 ***−231.90320.0000 ***Yes (opposite)
k-NNRsingle > SVRmulti83.19170.0000 ***37.68990.0000 ***Yes
k-NNRsingle > k-NNRmulti−9.14030.0000 ***−3.99580.0001 ***Yes (opposite)
k-NNRsingle > RFRmulti60.83460.0000 ***34.40060.0000 ***Yes
MLRmulti > NNRmulti166.18970.0000 ***218.84880.0000 ***Yes
MLRmulti > DTRmuti6.27530.0000 ***206.24360.0000 ***Yes
MLRmulti > SVRmulti289.33700.0000 ***314.94570.0000 ***Yes
MLRmulti > k-NNRmulti222.40040.0000 ***296.98960.0000 ***Yes
MLRmulti > RFRmulti275.31130.0000 ***313.28760.0000 ***Yes
NNRmulti > DTRmuti−190.74440.0000 ***−90.24550.0000 ***Yes (opposite)
NNRmulti > SVRmulti37.55800.0000 ***18.54180.0000 ***Yes
NNRmulti > k-NNRmulti−10.95070.0000 ***1.80820.0736 ***No
NNRmulti > RFRmulti25.74570.0000 ***17.33190.0000 ***Yes
DTRmuti > SVRmulti448.26180.0000 ***273.88990.0000 ***Yes
DTRmuti > k-NNRmulti327.34120.0000 ***225.47590.0000 ***Yes
DTRmuti > RFRmulti427.07870.0000 ***268.28750.0000 ***Yes
SVRmulti > k-NNRmulti−92.38280.0000 ***−41.35310.0000 ***Yes (opposite)
SVRmulti > RFRmulti−24.61280.0000 ***−2.97940.0036 ***Yes (opposite)
k-NNRmulti > RFRmulti70.22130.0000 ***38.06740.0000 ***Yes
Note: *** indicates the significance level of p < 0.01. “Yes (opposite)” means the opposite of a hypothesis is supported.
Table 7. Descriptive statistics on the three types of a g e r a t e n e w s ( n , g ) .
Table 7. Descriptive statistics on the three types of a g e r a t e n e w s ( n , g ) .
Age Group gTrue Values of the LabeledPredictions for the UnlabeledValues of the Mixed
MeanS.D.MeanS.D.MeanS.D.
10s0.01730.02390.01190.01840.01230.0190
20s0.14030.08910.14970.12490.14890.1222
30s0.28330.08680.28470.14300.28460.1390
40s0.31350.07110.27420.11800.27760.1153
≥50s0.24550.11980.27950.19650.27650.1913
Table 8. Descriptive statistics on the three types of a g e r a t e s e c t i o n ( S , g ) values and Z-test results for comparisons.
Table 8. Descriptive statistics on the three types of a g e r a t e s e c t i o n ( S , g ) values and Z-test results for comparisons.
Section SAge Group gTrue Values of the LabeledPredictions for the UnlabeledValues of the MixedZ-Test
MeanS.D.MeanS.D.MeanS.D.Zp-Value
Politics10s0.01210.01320.00890.01230.00930.012415.59580.0000
20s0.11320.06590.14020.11930.13650.1139−14.98120.0000
30s0.23500.06730.23060.13370.23120.12672.18910.0286
40s0.32010.06010.25280.10850.26190.105840.23550.0000
≥50s0.31990.10530.36750.21040.36100.2000−15.10470.0000
Economy10s0.00850.01480.00700.01330.00710.01345.11380.0000
20s0.10590.06080.11660.10820.11600.1060−4.67720.0000
30s0.28810.07240.29500.14180.29460.1387−2.29520.0217
40s0.34170.05830.30340.11990.30570.117515.03780.0000
≥50s0.25600.08830.27790.18000.27660.1760−5.7370.0000
Society10s0.02330.03330.01220.01850.01320.020631.29670.0000
20s0.16770.11030.16010.13380.16080.13183.55880.0004
30s0.31670.09260.30770.14610.30860.14203.96590.0001
40s0.30340.08360.27010.11710.27320.114818.18680.0000
≥50s0.18790.11240.24990.18970.24410.1847−21.17820.0000
Lifestyle/Culture10s0.02470.02260.02080.02670.02110.02654.2710.0000
20s0.18970.08010.17870.12650.17940.12432.59240.0095
30s0.34640.07440.31130.13630.31340.13377.69710.0000
40s0.27610.06850.26470.11910.26540.11672.8630.0042
≥50s0.16330.08640.22440.17160.22070.1683−10.67280.0000
World10s0.02190.01650.01840.02230.01860.02204.47320.0000
20s0.16170.06330.17620.12360.17520.1205−3.39290.0007
30s0.28750.06480.25410.13060.25640.12747.42010.0000
40s0.30600.05790.26300.11230.26590.110011.06820.0000
≥50s0.22300.09160.28830.19660.28380.1920−9.64010.0000
IT/Science10s0.03350.02540.01760.02300.01830.023313.97640.0000
20s0.20180.07950.18100.12570.18190.12413.49130.0005
30s0.33510.06340.32130.12920.32190.12702.25880.0239
40s0.29000.07110.28520.12930.28540.12730.78720.4312
≥50s0.13970.06210.19490.16130.19250.1586−7.27670.0000
Notes: Z-tests were conducted for a comparison between the true values of the labeled dataset and the values of the mixed dataset.
Table 9. The measured similarities between two age groups in different sections, i.e., s i m i l a r i t y s e c t i o n ( S , g i , g j ) .
Table 9. The measured similarities between two age groups in different sections, i.e., s i m i l a r i t y s e c t i o n ( S , g i , g j ) .
Age Group Pairs (gi, gj)Section S
PoliticsEconomySocietyLifestyle/CultureWorldIT/Science
(10s, 20s)0.7400 0.6761 0.7243 0.7498 0.7631 0.7600
(10s, 30s)0.5779 0.4480 0.4790 0.5371 0.5942 0.5587
(10s, 40s)0.4802 0.3333 0.3909 0.4747 0.5296 0.4574
(10s, ≥50s)0.40080.27960.28410.37610.38930.3378
(20s, 30s)0.81820.7863 0.7683 0.80760.83470.8221
(20s, 40s)0.5766 0.5243 0.5463 0.5893 0.6321 0.5885
(20s, ≥50s)0.4843 0.4367 0.4080 0.4769 0.4915 0.4614
(30s, 40s)0.8019 0.80040.80230.8059 0.8010 0.8062
(30s, ≥50s)0.5628 0.5694 0.5209 0.5443 0.5371 0.5400
(40s, ≥50s)0.7834 0.7866 0.7507 0.7427 0.7508 0.7164
Note: “∧” indicates that the similarity is the largest within a section, while “∨” represents the smallest similarity within the section.
Table 10. The top 5 and bottom 5 subsections for age group pairs in terms of s i m i l a r i t y s u b s e c t i o n ( s , g i , g j ) .
Table 10. The top 5 and bottom 5 subsections for age group pairs in terms of s i m i l a r i t y s u b s e c t i o n ( s , g i , g j ) .
(a) Top 5 Subsections
Age Group Pairs (gi, gj)Rank
12345
(10s, 20s)Middle East/AfricaEuropeMobileRoad/TrafficScience General
(10s, 30s)Middle East/AfricaMobileEuropeNational Defense/DiplomacyUSA/Latin America
(10s, 40s)Middle East/AfricaReligionUSA/Latin AmericaEuropeNational Defense/Diplomacy
(10s, ≥50s)ReligionMiddle East/AfricaNorth KoreaFood/RestaurantNational Defense/Diplomacy
(20s, 30s)WeatherMiddle East/AfricaMobileNorth KoreaUSA/Latin America
(20s, 40s)Middle East/AfricaScience GeneralEuropeAsia/AustraliaWeather
(20s, ≥50s)North KoreaNational Defense/DiplomacyHuman Rights/WelfareSecurity/HackingAsia/Australia
(30s, 40s)Human Rights/WelfareMiddle East/AfricaFood/RestaurantMobileRoad/Traffic
(30s, ≥50s)Human Rights/WelfareSecurity/HackingReal EstateHealth InformationFood/Restaurant
(40s, ≥50s)Real EstateHuman Rights/WelfareNorth KoreaEconomy GeneralFinance
(b) Bottom 5 Subsections
Age Group Pairs (gi, gj)Rank
12345
(10s, 20s)StockLaborIndustry/BusinessSecurity/HackingSmall and Mid-sized Businesses/Start-ups
(10s, 30s)LaborSmall and Mid-sized Businesses/Start-upsStockLiving EconomyFinance
(10s, 40s)FinanceStockReal EstateSmall and Mid-sized Businesses/Start-upsEconomy General
(10s, ≥50s)Society GeneralFinanceSmall and Mid-sized Businesses/Start-upsEconomy GeneralIndustry/Business
(20s, 30s)Real EstateFood/MedicalCase/AccidentSociety GeneralCharacter
(20s, 40s)Real EstateStockFinanceCharacterSociety General
(20s, ≥50s)Society GeneralCase/AccidentFinanceReal EstateThe Press
(30s, 40s)The PressPerformance/ExhibitionStockEducationAdministration
(30s, ≥50s)BookSociety GeneralCommunications/New MediaCase/AccidentMiddle East/Africa
(40s, ≥50s)Communications/New MediaEducationMobileInternet/SNSIT General
Table 11. The top 5 and bottom 5 subsections for age group pairs in terms of c o r r e l a t i o n s u b s e c t i o n ( s , g i , g j ) .
Table 11. The top 5 and bottom 5 subsections for age group pairs in terms of c o r r e l a t i o n s u b s e c t i o n ( s , g i , g j ) .
(a) Top 5 Subsections
Age Group Pairs (gi, gj)Rank
12345
(10s, 20s)Real EstateRoad/TrafficFood/MedicalThe PressCase/Accident
(10s, 30s)National Assembly/Political PartyPresident’s OfficeNorth KoreaThe PressPolitics General
(10s, 40s)National Defense/DiplomacyNorth KoreaUSA/Latin AmericaWorld GeneralMiddle East/Africa
(10s, ≥50s)StockCar/Test DriveHuman Rights/WelfareLaborFashion/Beauty
(20s, 30s)National Assembly/Political PartyThe PressPresident’s OfficePolitics GeneralGlobal Economy
(20s, 40s)North KoreaNational Defense/DiplomacyNational Assembly/Political PartyUSA/Latin AmericaPresident’s Office
(20s, ≥50s)Car/Test DriveHuman Rights/WelfareReal EstateFood/RestaurantLiving Economy
(30s, 40s)National Defense/DiplomacyNorth KoreaMiddle East/AfricaPresident’s OfficeReligion
(30s, ≥50s)EducationFashion/BeautyMobileGames/ReviewsPerformance/Exhibition
(40s, ≥50s)Games/ReviewsFashion/BeautyWeatherHuman Rights/WelfareEurope
(b) Bottom 5 Subsections
Age Group Pairs (gi, gj)Rank
12345
(10s, 20s)EnvironmentSecurity/HackingNational Defense/DiplomacyEducationWeather
(10s, 30s)Fashion/BeautyGames/ReviewsHuman Rights/WelfareInternet/SNSEducation
(10s, 40s)Real EstateRoad/TrafficMobileInternet/SNSThe Press
(10s, ≥50s)National Defense/DiplomacyNorth KoreaWorld GeneralThe PressPresident’s Office
(20s, 30s)Human Rights/WelfareFashion/BeautyGames/ReviewsFood/MedicalInternet/SNS
(20s, 40s)Fashion/BeautyMobileGames/ReviewsInternet/SNSCar/Test Drive
(20s, ≥50s)North KoreaNational Defense/DiplomacyUSA/Latin AmericaPresident’s OfficeNational Assembly/Political Party
(30s, 40s)MobileCommunications/New MediaCar/Test DriveIT GeneralRoad/Traffic
(30s, ≥50s)President’s OfficeNational Assembly/Political PartyPolitics GeneralNorth KoreaReal Estate
(40s, ≥50s)Politics GeneralNational Assembly/Political PartyNational Defense/DiplomacyNorth KoreaUSA/Latin America
Table 12. The obtained degree of membership for an age group pair ( g i , g j ) in a fuzzy set of age difference.
Table 12. The obtained degree of membership for an age group pair ( g i , g j ) in a fuzzy set of age difference.
Age   Group   Pairs   ( g i , g j ) T h e   N u m b e r   o f   S u b s e c t i o n s   i n   a   F u z z y   S e t T h e   N u m b e r   o f   T o t a l   S u b s e c t i o n s The   Degree   of   Membership   in   a   Fuzzy   Set ,   μ D q ( g i , g j )
D H i g h D M i d d l e D L o w D H i g h D M i d d l e D L o w
(10s, 20s)0/480/4848/480.0000 0.0000 1.0000
(10s, 30s)2/4840/486/480.0417 0.8333 0.1250
(10s, 40s)45/483/480/480.9375 0.0625 0.0000
(10s, ≥50s)46/482/480/480.9583 0.0417 0.0000
(20s, 30s)0/480/4848/480.0000 0.0000 1.0000
(20s, 40s)32/4816/480/480.6667 0.3333 0.0000
(20s, ≥50s)48/480/480/481.0000 0.0000 0.0000
(30s, 40s)0/4832/4816/480.0000 0.6667 0.3333
(30s, ≥50s)47/481/480/480.9792 0.0208 0.0000
(40s, ≥50s)0/480/4848/480.0000 0.0000 1.0000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Suh, J.H. Multi-Label Prediction-Based Fuzzy Age Difference Analysis for Social Profiling of Anonymous Social Media. Appl. Sci. 2024, 14, 790. https://doi.org/10.3390/app14020790

AMA Style

Suh JH. Multi-Label Prediction-Based Fuzzy Age Difference Analysis for Social Profiling of Anonymous Social Media. Applied Sciences. 2024; 14(2):790. https://doi.org/10.3390/app14020790

Chicago/Turabian Style

Suh, Jong Hwan. 2024. "Multi-Label Prediction-Based Fuzzy Age Difference Analysis for Social Profiling of Anonymous Social Media" Applied Sciences 14, no. 2: 790. https://doi.org/10.3390/app14020790

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop