1. Introduction
Keyword data analysis has been actively performed in various big data fields [
1,
2,
3]. This is because a significant portion of big data consists of text-based data. To carry out a text data analysis, we preprocess the text data, such as a document, and extract the keywords from the preprocessed text data by text mining techniques [
4,
5]. In general, we construct a document–keyword matrix, with documents and keywords corresponding to its rows and columns [
1,
5,
6,
7]. Each element of the matrix is the frequency value of a keyword occurring in a document.
Figure 1 shows a document–keyword matrix [
7].
Even if a large amount of text big data is collected, the size of the data set is reduced through preprocessing to build structured data which can be analyzed by statistics and machine learning. Additionally, as can be seen in
Figure 1, the preprocessed text data contain many zero values because a keyword that is included only once among all documents is assigned to one column in the document–keyword data.
In the previous research attempting to overcome the zero-inflation problem in text data analysis, Uhm and Jun (2022) proposed a generative model based on statistics by synthpop to address the zero-inflation problem in patent data analysis [
6]. Jun (2023) studied generative adversarial network (GAN) and statistical modeling [
1]. He compared the performance of GAN with the statistical model without GAN. Park and Jun (2023) used compound Poisson models to perform zero-inflation patent data analysis [
7]. This was not a generative model but an exponential dispersion model. In this paper, to solve the problems of zero inflation as well as data shortage, we propose a keyword data analysis using generative models based on statistics and machine learning. Using these generative models, we generate synthetic data from the original data, and add the synthetic data to the original training and test data sets for keyword data analysis by various machine learning algorithms such as deep learning, linear modeling, Bayesian neural networks, etc.
Our paper consists of the following sections. In
Section 2, we outline the background to our research, such as keyword data analysis and generative modeling. We propose the generative models based on statistics and machine learning for keyword data analysis in
Section 3. In the next section, we carry out the experiments using simulation and patent document data to show the performance results for the compared models. We explain the conclusions and contributions of our paper in the last section.
3. Proposed Method
Much big data is in text form. Therefore, we have to extract the keywords from text data and analyze them by methods such as constructing document–keyword matrices, N-grams and correlation analysis between keywords, sentiment analysis, topic modeling, etc. [
4,
5]. The first task to be performed in keyword data analysis is to collect text documents on a given topic. As explained in
Figure 2, the collected document data are preprocessed using text mining and natural language processing techniques. The preprocessed text document data set has a frequency matrix structure in which the rows and columns are documents and terms, respectively, as shown in
Figure 1. This matrix is called the document–term matrix [
5]. Next, we extract the keywords from the document–term matrix and construct the document–keyword matrix, as shown in
Table 1.
In
Table 1, the
Frequencyij is the frequency value of
Keywordj occurring in
Documenti. Through the data preprocessing, the size of the initially collected document data set gradually decreases, and when the document–keyword matrix is finally constructed, the data set sometimes becomes so small that it is difficult to analyze. In addition, many elements of the matrix are zero values, as shown in
Figure 1. Not only the data shortage but also the zero-inflation problem must be solved in keyword data analysis [
1,
6,
7,
25,
26,
28,
30]. To solve these problems, we use generative models based on statistics and machine learning. First, we consider the synthpop package of R data language to generate synthetic data [
31,
32]. This is a generative model based on statistics. In the synthpop modeling, the original input data are represented as follows.
In the data of (2),
is the number of input variables. The synthpop model uses classification and regression trees (CART) to generate the synthetic data [
16,
31]. We generate the synthetic data for the current variable using the previous variables as follows: We start with the second variable and exclude the first variable [
31]. We generate
by running
using CART growth. That is, we sample
from the CART model,
[
16].
Figure 4 explains the method of synthetic data generation by the synthpop in our keyword data analysis.
In
Figure 4, we fit statistical models of the keywords to the original data and generate synthetic data that is completely new data of the keywords. We denote original and synthetic data as
and
. Using the original data, we find the joint probability distribution of keywords. We represent the probability distribution of
as Equation (3).
In Equation (3), we generate the from the conditional probability distribution of given . The synthpop begins by estimating the probability distribution of the first keyword, . Next, we generate new data (synthetic) that resembles (original) by . That is, the synthetic data represent the original data . Using this result, we build the conditional distribution and generate the synthetic data for by the conditional distribution. In this way, we generate the final synthetic data .
Unlike the synthpop, the generative model based on machine learning is performed by deep neural networks [
9,
33]. In this paper, we use GAN for the generative model for document–keyword data generation. GAN is a machine learning model which generates new synthetic data that resemble the given original data [
9,
11,
12,
13,
17]. GAN performs an adversarial training process, involving two neural networks, a generator and discriminator. The formula of GAN is defined as follows [
9].
where
and
represent the generator and discriminator respectively.
and
are input data and random noise, which follow a normal distribution. In Equation (4),
also is the latent representation of
.
and
are the generative and latent models. The discriminator wants to maximize
, and on the other side, the generator tries to minimize
. In this paper, the document–keyword matrix is used as
in Equation (4).
Figure 5 illustrates the process of generating a synthetic document–keyword matrix using GAN.
In
Figure 5, the generator creates a new synthetic document–keyword matrix from latent space and random noise. Also, the generator tries to make the synthetic data as similar as possible to the original data. The latent space is a learning space representing sample data with low dimension. Sample data that are similar to each other are located close to each other in the latent space. We select the initial data point from the latent space and add random noise to the data. Thus, we generate the synthetic document–keyword matrix from the latent space and random noise. The discriminator predicts whether the input data are real (original) or fake (not original). Sampling the document–keyword data randomly from original and synthetic data sets, and combining the sample data, the discriminator uses this data to learn to accurately classify real and fake. The generator is trained so that the discriminator can judge the synthetic data as original. Ultimately, the generator aims to generate synthetic data to the extent that the discriminator cannot distinguish whether the synthetic data are original or not. When training the entire model, only the weights of generator should be updated and the weights of discriminator should not be updated so that the synthetic data with good performance are generated. Next, we combine the results of synthpop and GAN.
Figure 6 shows the synthetic data generation.
Our third generative model combines the two data sets generated by synthpop and GAN. In
Figure 6, synthpop and GAN generate synthetic document–keyword data sets based on statistics and machine learning, respectively. Therefore, we use the three generative models to analyze the keyword data. In our keyword data analysis, we build the linear model shown in Equation (5).
where
are
k explanatory keywords and
is a response keyword. Each variable of (5) represents the frequency value of a keyword. To evaluate the performance between generative models, in this paper, we divide the given data into training (70%) and test (30%) data sets. Using the training data, we calculate the Akaike information criterion (AIC) and use this value to compare the explanatory power of the linear model [
34]. AIC is a measure used in predictive modeling to verify the goodness of fit of a model, as shown in Equation (6) [
35,
36].
where
and
represent the model parameter and data, respectively.
is the number of model parameters.
is maximum likelihood estimator of
. The better the fitting performance of model, the smaller the AIC value. We use another measure, mean squared error (MSE), to evaluate performance of the compared models. MSE is defined as shown in Equation (7) [
34,
35,
36,
37].
where
and
are real and predicted values, respectively, and
is the size of the given data. We calculate MSE value of each model using the test data. The smaller the MSE value of the model, the better its predictive performance. In the next section, we perform a performance comparison between the compared generative models using MSE and AIC.
5. Discussion
In this paper, we tried to solve the zero-inflation problem that occurs in the process of keyword data analysis. We carried out two experiments using simulation and practical models. From the experimental results, we found that the performances of the synthpop and GAN models were better than the original model. This means that using the generative models based on statistics and machine learning is better than not using them. Also, the performance of the GAN model was better than that of synthpop. For example, all the AIC values of GAN were smaller than the values of the synthpop model in the simulation data analysis of
Figure 8. This was similar to the results for AIC values when comparing models in the practical data analysis of
Table 6. The GAN model has the best performance due to its characteristics as a generative model. When we generate the synthetic data using the GAN model, we generally perform random sampling from a normal distribution representing the latent space. Thus, most of the generated data values are distributed around the mean. Because the variance of the synthetic data generated by GAN is relatively small compared with other generative models, the performance of the compared linear model is stable and shows excellent explanatory power.
Currently, generative models are actively used in various machine learning domains. In this study, we used this model to solve the zero-inflation problem that occurs during the analysis of keyword data extracted from text documents. Of course, generative models show excellent performance in the image data field, but we showed the utility and improved performance of generative models in the field of numerical data including zero. We generated synthetic data from the generative models and analyzed them by statistical methods such as linear regression. Finally, we overcame the zero-inflation problem using the generative models for keyword data analysis. We expect that our research will be used more broadly to solve data sparsity problems, including the zero-inflation problem. Finally, we found that using simulation data to evaluate the performance of generative models can lead to over-optimistic conclusions. Therefore, we decided that experiments using more diverse practical data would be necessary to efficiently evaluate the performance of the generative model.
6. Conclusions
The aim of our study is to generate synthetic data and analyze it for prediction. Most studies related to generative models are interested in the generative model itself that creates synthetic data, but our focus is on constructing a predictive model by analyzing data sampled from the constructed generative models. For this reason, AIC and MSE were used to evaluate the performance of the model in this paper. Thus, we generated and analyzed more synthetic data. In the process of keyword data analysis, we face the problems of a lack of data and zero inflation. These problems become factors that degrade the performance of machine learning models for keyword data analysis. To solve these problems, we proposed the use of generative models. We considered generative models based on statistics and machine learning, such as synthpop, GAN and synthGAN, in this paper. Also, we compared the model performance between the original and synthetic data sets using the measures of AIC and MSE. From the experimental results using simulation data and practical patent documents, we verified the better performance of generative modeling by synthpop compared with the other models. We also found that the AIC values of the GAN synthetic data were the smallest among the compared models. However, its MSE values were not the smallest but the largest. Due to these results, we confirmed the difficulty in analyzing keyword data using the synthetic data by GAN.
In this paper, we conclude that the synthpop generative model is the best method to generate synthetic data for keyword data analysis. Of course, the generative model using GAN is also an excellent model from the AIC perspective, but its performance is poor from the MSE perspective. Therefore, in our future works, we will study new methods to improve the MSE of GAN synthetic data for keyword data analysis. We will add Bayesian learning or various probability distributions to traditional GAN to improve the performance of generative models using GAN in keyword data analysis. Our research contributes a method for creating new synthetic data in text big data analysis. This is necessary because the data size is reduced during the preprocessing of text data. In particular, a sufficient amount of data is required to perform large-scale machine learning such as deep learning. Therefore, synthetic data from generative models will contribute to solving the lack of original data to perform machine learning.