Applying Deep Generative Neural Networks to Data Augmentation for Consumer Survey Data with a Small Sample Size

Watanuki, Shinya; Edo, Katsue; Miura, Toshihiko

doi:10.3390/app14199030

Open AccessArticle

Applying Deep Generative Neural Networks to Data Augmentation for Consumer Survey Data with a Small Sample Size

by

Shinya Watanuki

^1,2,*,

Katsue Edo

²

and

Toshihiko Miura

³

¹

Department of Marketing, Faculty of Commerce, University of Marketing and Distribution Sciences, Kobe 651-2188, Japan

²

Frontier Research Institute for Small and Medium-sized Organizations, Hiroshima Business and Management School, Prefectural University of Hiroshima, Hiroshima 734-8558, Japan

³

Faculty of Commerce, Chuo University, Hachioji 192-0393, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 9030; https://doi.org/10.3390/app14199030 (registering DOI)

Submission received: 25 August 2024 / Revised: 25 September 2024 / Accepted: 3 October 2024 / Published: 6 October 2024

(This article belongs to the Special Issue Intelligent Data Mining, Analysis and Modeling Based on Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Questionnaire consumer survey research is primarily used for marketing research. To obtain credible results, collecting responses from numerous participants is necessary. However, two crucial challenges prevent marketers from conducting large-sample size surveys. The first is cost, as organizations with limited marketing budgets struggle to gather sufficient data. The second involves rare population groups, where it is difficult to obtain representative samples. Furthermore, the increasing awareness of privacy and security concerns has made it challenging to ask sensitive and personal questions, further complicating respondent recruitment. To address these challenges, we augmented small-sized datawith synthesized data generated using deep generative neural networks (DGNNs). The synthesized data from three types of DGNNs (CTGAN, TVAE, and CopulaGAN) were based on seed data. For validation, 11 datasets were prepared: real data (original and seed), synthesized data (CTGAN, TVAE, and CopulaGAN), and augmented data (original + CTGAN, original + TVAE, original + CopulaGAN, seed + CTGAN, seed + TVAE, and seed + CopulaGAN). The large-sample-sized data, termed “original data”, served as the benchmark, whereas the small-sample-sized data acted as the foundation for synthesizing additional data. These datasets were evaluated using machine learning algorithms, particularly focusing on classification tasks. Conclusively, augmenting and synthesizing consumer survey data have shown potential in enhancing predictive performance, irrespective of the dataset’s size. Nonetheless, the challenge remains to minimize discrepancies between the original data and other datasets concerning the values and orders of feature importance. Although the efficacy of all three approaches should be improved in future work, CopulaGAN more accurately grasps the dependencies between the variables in table data compared with the other two DGNNs. The results provide cues for augmenting data with dependencies between variables in various fields.

Keywords:

GAN; synthesized data; marketing research; marketing science; small- and medium-sized enterprises; rare populations; low incidence

1. Introduction

Questionnaire survey research is one of the most significant approaches utilized in marketing research. Survey research methods are generally carried out online or offline. Respondents rate questionnaires with five- or seven-point Likert scales for products, advertisements, and their demographic and psychographic information. According to European Society for Opinion and Marketing Research (ESOMAR) [1], quantitative methods, including the questionnaire survey research, are major marketing research approaches that account for more than 60% of the marketing research methods used for sales volume. Although survey research methods play a crucial role in marketing management and represent the prevailing approach toward gathering consumer opinions, several significant issues related to these methods must be resolved. The first problem is the cost issue. The larger the sample size, the closer the acquired data from the survey are to the population. However, the cost of executing survey research methods increases according to the sample size. This suggests that enterprises with a small marketing research budget cannot obtain accurate results from survey research. In many cases, small-size enterprises (e.g., start-up companies and those placed in local areas) cannot afford to invest sufficient budget for executing survey research with a large sample size [2]. As a result, these companies struggle to achieve consumer insights when developing new products and services. They are eventually obliged to execute marketing initiatives restricted to their closed knowledge. Previous studies on small business firms have pointed out that lacking a marketing attitude leads to business failure [3,4]. This suggests that the marketing initiatives of these businesses might be equal to gambling without considering the consumer demands. This marketing research issue in small business firms has remained unsolved for quite a long time now. The second issue concerns rare populations [5,6]. Many cases have targeted limited markets in marketing strategies. These include specialists in particular industry areas (i.e., surgeons, artists, and craftsmen), class groups (i.e., businesspeople with high salaries), and people with attributes (i.e., intractable disease patients and luxurious brand customers). Executing survey research for these rare populations is challenging. This issue is hard to solve in terms of budgeting alone and remains unresolved in the marketing research area. The third issue is the inconvenience of analyzing small-sample-sized data using statistical models, such as predictive models and some statistical testing methods. When predicting outcomes using customer data, several predictive models with small sample sizes could yield low-accuracy results. When performing structural equation model analysis with small-sample-sized data, several parameters in the model might not be calculated. Driver analysis is a useful analytical approach for making marketing decisions based on consumer survey research. Machine learning models, such as linear discriminant model analysis and regression models, can be employed for driver analysis to obtain effective variables for purchasing and preferring products and brands. The results of using small-sample-sized data for driver analysis may be inconsistent with those for a large-sample size. Therefore, the results of analyses using small-sample-sized data are irrelevant when making marketing decisions.

Several techniques for augmenting small-sample-sized data have been presented to overcome the small sample size issue in the image-processing field. The data augmentation techniques are classified into two categories [7]. The first category comprises traditional data augmentation techniques such as cropping, flipping, noise transformation, and random erasing. This approach creates various patterns of seed data using parts and edits them according to certain rules. The Mixup is a more sophisticated technique than traditional techniques [8]. The Mixup is an approach that uses probabilistic distributions based on existing data. It involves randomly selecting data pairs from the existing dataset. In the Mixup, sampling is executed according to the beta distribution. Fast auto augment (Fast AA) and population-based augmentation (PBA) also approach sampling data based on existing data according to augmentation policies [9,10]. The second category comprises deep learning-based data augmentation techniques, such as transfer learning [11] and synthesized data generated by deep generative neural networks (DGNNs) (e.g., the generative adversarial network (GAN) [12,13] and variational autoencoder (VAE) network [14]). Aside from effectively augmenting data, data augmentation methods based on synthesized data generated by DGNNs are effective for security applications [15,16]. The security issue is significant in the analysis of survey research on small sample sizes, including rare populations. Respondents in a survey search are limited, and the characteristics of their population might be easily speculated; hence, invaders can easily specify their individual information. Thus, data augmentation approaches using synthesized data by DGNNs are promising approaches. Therefore, in this study, we focus on a data augmentation technique using synthesized data generated by DGNNs.

The effectiveness of data augmentation techniques using synthesized data generated by DGNNs has been confirmed for a wide variety of computer vision tasks [7,11]. These techniques have been modified and extensively applied to tabular data forms. When addressing a small-sample-sized data issue, such as an imbalanced data problem, oversampling methods are usually used to augment the small data. The synthetic minority oversampling technique (SMOTE) is a primary helpful approach for oversampling small sample-sized data in tabular data, and modifications have been developed (i.e., adaptive synthetic sampling and borderline SMOTE) [17]. This approach’s base method augments minority class data by interpolating data based on the distance between minority and majority data. The usefulness of these oversampling approaches has prevailed in various areas, such as business and physiological data [18,19]. Although these oversampling methods play a crucial role in solving small-sample-sized data problems, we focus on DGNN-based augmentation techniques as data augmentation methods for tabular data, not to address an imbalanced data problem.

DGNN-based augmentation techniques for the tabular data format may be appropriate for addressing issues related to small-sized consumer survey data because consumer survey data are generally organized into a tabular data format. Several promising algorithms have been developed. Medical GAN (medGAN) is the approach of applying data synthesized by DGNNs to patient data [20]. Table GAN (TGAN) [21], which modifies the deep convolutional GAN [13] for tabular-formed data, is a universal data augmentation approach that is not limited to specified purposes. Conditional table GAN (CTGAN) modifies the TGAN to enable imbalanced data, while the tabular variational autoencoder (TVAE) is an adjusted VAE for the tabular data format [22]. CTGAN and TVAE are not limited to specified purposes, similar to TGAN. Moreover, Xu et al. [17] demonstrated that CTGAN and TVAE outperform other GAN-based synthesizers. They are likely to be effective for this study because they are effective for augmenting tabular-format data in various fields besides consumer survey data [23,24,25]. Generally, most tabular-formed data comprise multiple types of data formats, such as numerical and categorical data. CTGAN and TVAE are modified conventional GAN-based techniques for mixed table-formed data [22,26]. However, they did not consider the dependencies between variables. Several studies have reported that GAN preprocessed by a copula function, known as CopulaGAN, is an appropriate technique for synthesizing data with dependencies among variables [27,28]. Concretely, CopulaGAN is a modification of CTGAN that handles dependencies between variables by applying a copula function to the processes of CTGAN [27]. Because the copula function is introduced to handle the dependencies of price fluctuations between risk assets, such as stocks and bonds, it is applied to a market risk management method called Value-at-Risk in the financial field [29]. Several studies have attempted to generate synthesized asset data using DGNNs [30,31,32]. Harold [32] demonstrated that the synthesized data generated by CopulaGAN reduced volatilities compared with CTGAN and TVAE. Structural equation modeling (SEM) is a well-known statistical analysis method in marketing research [33]. SEM was used to clarify data structures by discerning relationships among variables. Therefore, consumer survey data synthesized using DGNNs might provide information on the dependencies between variables in consumer survey data and enable the handling of mixed table-formed data. This study assesses the abilities of three DGNN-based data synthesizer models—CTGAN, TVAE, and CopulaGAN—on table data with mixed-variable types and on data with dependencies between variables. The characteristics of the three DGNNs are organized in Table 1.

Although most questionnaire survey research data are in a tabular data format, no studies have yet reported on their augmentation using DGNNs in marketing research. Moreover, given that the application of DGNNs in marketing contexts is limited to the analysis of scanner panel databases [34] and the production of a product design [35], currently, no study has applied DGNNs to augment questionnaire-based consumer survey data. In addition, augmenting small-sample-sized consumer survey data synthesized by DGNNs might simultaneously solve small-sample-size-related issues (poor marketing decision-making due to limited challenging budgets, a rare population, and the calculation of statistical models) and privacy issues. In particular, the results of this study might contribute to planning marketing strategies for small-sized organizations with limited marketing research budgets. Even with a limited budget, augmenting small-sample-sized consumer survey data might enable us to make marketing plans quantitatively and objectively, similar to blue-chip companies such as P&G and IBM. In this study, we attempted to address these issues by employing a data augmentation method using synthesized data generated by DGNNs. Thus, this study can contribute to enhancing and increasing the use of data augmentation techniques for synthesizing data with DGNNs in several marketing research fields.

2. Methods

2.1. DGNNs

Noncopula (CTGAN and TVAE) and copula DGNN-based data synthesizers can resolve the problems associated with small-sized questionnaire survey data. The DGNNs assessed in this study are described below.

2.1.1. CTGAN

Like the other GAN algorithms, the CTGAN architecture primarily comprises two modules: generator and discriminator (Figure 1A). Here,

g

denotes the generator responsible for producing fake data, whereas C(∙) represents the discriminator tasked with distinguishing between fake and real inputted data. The loss function is denoted by

L

(∙), and z signifies the noise generated from a standard normal distribution (N(0,1)). Before being fed into the critic module, the seed data undergo normalization using mode-specific normalization methods.

Both

α_{i, j}

and

β_{i, j}

correspond to values associated with continuous columns in the seed data, whereas

d_{i, j}

represents a one-hot vector denoting a discrete value (

d_{i, j}

) of the i-th discrete column in the j-th row of the seed data. These values, namely,

α_{i, j}

,

β_{i, j}

, and

d_{i, j}

, are generated using the generator, which is constructed using a neural network architecture. The neural network architecture is detailed below (Equations (1)–(5)).

h_{1} = R e L U (B N ({F C}_{u \to v} (z \oplus c o n d))), u = |c o n d| + |z|, v = 256

(1)

h_{2} = R e L U (B N ({F C}_{u \to v} (z \oplus c o n d \oplus h_{1}))), u = |c o n d| + |z| + 256, v = 256

(2)

\hat{α_{i}} = t a n h ({F C}_{u \to v} (h_{2})), 1 \leq i \leq N_{c}, u = |c o n d| + |z| + 512, v = 1

(3)

\hat{β_{i}} = {g a m b e l}_{τ} ({F C}_{u \to v} (h_{2})), 1 \leq i \leq N_{c}, u = |c o n d| + |z| + 512, v = m_{i}

(4)

\hat{d_{i}} = {g a m b e l}_{τ} ({F C}_{u \to v} (h_{2})), 1 \leq i \leq N_{d}, u = |c o n d| + |z| + 512, v = |D_{i}|

(5)

The term

h_{i}

refers to a hidden layer within the neural network concerning the function types of layers within the neural network; BN and FC denote batch normalization and fully connected layers, respectively. The variables u and v indicate the number of neurons in the input and output layers, respectively. “cond” means the conditional vectors.

\hat{α_{i}}, \hat{β_{i}}, a n d \hat{d_{i}}

are activation functions in the l-th fully connected layers and ReLU, tanh, and gumbel denote specific activation functions (Leaky Rectified Linear Unit, hyperbolic tangent, and Gumbel, respectively). The parameter τ is associated with the Gumbel function.

N_{c}

and

N_{d}

represent the numbers of continuous-data and discrete-data columns, respectively.

D_{i}

is the ith discrete column.

m_{i}

corresponds with

D_{i}

and the ith musk vector consisting of 0 and 1. CTGAN uses the Wasserstein loss function [36], unlike the conventional GANs. This choice mitigates the mode-dropping phenomenon. The Adaptive moment estimation (Adam) optimizer is used to optimize the Wasserstein loss function.

L (\cdot) = E_{x ~ X, z ~ N (o, I)} [C (x) - C (g (z)) + r e g u l a z a t i o n t e r m]

(6)

x is data in the seed data X. z represents random vectors taken from

N (o, I)

.

L (\cdot)

is the loss function.

C (\cdot)

is the critic.

g

(∙) is the generator.

2.1.2. TVAE

The TVAE shares the same architecture as the VAE (Figure 1B) [12], and the neural network model is detailed below (Equations (7)–(19)). Here,

E (\cdot)

represents the probabilistic encoder, whereas

D (\cdot)

denotes the probabilistic decoder. The encoder and decoder are implemented using neural network models. These two neural networks are described below.

r_{j} = c a t (α_{1, j} {, β}_{1, j}, \dots, α_{N_{c}, j} {, β}_{N_{c}, j}, d_{1, j} \dots, d_{N_{d}, j})

(7)

h_{1} = R e L U (B N ({F C}_{u \to v} (r_{j})), u = r_{j}, v = 128

(8)

h_{2} = R e L U (B N ({F C}_{u \to v} (h_{1})), u = 128, v = 128

(9)

μ = {F C}_{u \to v} (h_{2})), u = 128, v = 128

(10)

σ = e x p ({\frac{1}{2} F C}_{u \to v} (h_{2})), u = 128, v = 128

(11)

q_{ϕ} (z_{j}| r_{j}) ~ N (μ, σ I)

(12)

The encoder network is similar to that of a variational autoencoder (VAE). Symbols such as

α_{i, j}

,

β_{i, j}

,

d_{i j}, h_{i}

,

N_{c}, N_{d},

BN, FC, u, v, and ReLU correspond to components within the network architecture, particularly in the context of CTGAN. The term “cat” represents categorical data. The symbols μ and σ denote vectors representing location and scale parameters within the hidden layers, respectively, with “exp” representing the exponential function. The variable

r_{j}

refers to rows within tabular data T, where

r_{j}

’s involve concatenating continuous and discrete data within the same tabular dataset T. z represents a hidden representation, i.e., hidden vectors sampled from a Gaussian distribution (

μ + σ ⊙ ε, ε = N (0, I)

). A detailed description is given in Figure 1. The expression

q_{ϕ} (z_{j}| r_{j})

denotes the probabilistic encoder model (Equation (12)), assuming a multivariate Gaussian distribution N (

μ, σ I

).

Concerning the decoder network,

{\bar{α}}_{i, j}

denotes values computed by the activation function, whereas

{\hat{α}}_{i, j}

,

{\hat{β}}_{i, j}

, and

{\hat{d}}_{i, j}

denote random variables.

δ_{i}

represents network parameters, with “softmax” indicating the softmax function. The expression

p_{θ} (r_{j}| z_{j})

denotes the probabilistic decoder model (Equation (19)), assuming a probability distribution denoted by

P (\cdot)

.

h_{1} = R e L U (B N ({F C}_{u \to v} (z_{j})), u = 128, v = 128

(13)

h_{2} = R e L U (B N ({F C}_{u \to v} (h_{1})), u = 128, v = 128

(14)

{\bar{α}}_{i, j} = {t a n h (F C}_{u \to v} (h_{2})), 1 \leq i \leq N_{c}, u = 128, v = 1

(15)

{\hat{α}}_{i, j} ~ N ({\bar{α}}_{i, j}, δ_{i}), 1 \leq i \leq N_{c}

(16)

{\hat{β}}_{i, j} = {s o f t m a x (F C}_{u \to v} (h_{2})), 1 \leq i \leq N_{c}, u = 128, v = m_{i}

(17)

{\hat{d}}_{i, j} = {s o f t m a x (F C}_{u \to v} (h_{2})), 1 \leq i \leq N_{d}, u = 128, v = |D_{i}|

(18)

p_{θ} (r_{j}| z_{j}) = \prod_{i = 1}^{N_{c}} P ({\hat{α}}_{i, j} - α_{i, j}) \prod_{i = 1}^{N_{c}} P ({\hat{β}}_{i, j} - β_{i, j}) \prod_{i = 1}^{N_{d}} P ({\hat{d}}_{i, j} - d_{i, j})

(19)

Because

p_{θ} (r_{j}| z_{j})

is a joint distribution, the formula can be transformed into an additive model, as described below.

C E (\cdot)

denotes the conditional expected loss. The weights of activation functions and

δ_{i}

are trained using the Adam algorithm.

{l o g p}_{θ} (r_{j}| z_{j}) = \log \frac{1}{\sqrt{2 π δ_{i}}} e x p \frac{(α_{i, j} - {\bar{α}}_{i, j})}{{2 δ}_{i}^{2}} + \sum_{i = 1}^{N_{c}} C E ({\hat{β}}_{i, j}, β_{i, j}) + \sum_{i = 1}^{N_{d}} C E ({\hat{d}}_{i, j} d_{i, j}) + c o n s t .

(20)

2.1.3. CopulaGAN

In CopulaGAN, the dependencies between columns in the input data are preprocessed using the Gaussian copula function as the cumulative distribution in the prior phase of data generation. The cumulative distribution functions of columns 0–n are expressed as

F_{0}

,

\dots

F_{n}

. Elements of each row are denoted as X = (

x_{0}

,

\dots

x_{n}

) in vector form.

F_{i} (x_{i})

is the marginal distribution function. The Gaussian copula function Y is defined below (Equation (21)).

Φ_{n} (; Σ)

is the joint distribution function of the n-variate Gaussian distribution with parameters

μ, σ,

and

Σ (

coefficient matrix).

Φ_{1}^{- 1} (\cdot)

is the inverse of the univariate cumulative distribution function of the Gaussian distribution. After all rows are processed, each parameter in Equation (22) is estimated and the covariance matrix

Σ

is calculated. Information on

μ, σ,

and

Σ

is used to generate synthesized data. The remaining processes are the same as those in CTGAN.

{Y = Φ}_{n} (Φ_{1}^{- 1} (F_{0} (x_{0})), Φ_{1}^{- 1} (F_{1} (x_{1})), \dots, Φ_{1}^{- 1} (F_{n} (x_{n})); Σ) {= Φ}_{n} (\frac{x_{0} - μ_{0}}{σ_{0}}, \dots, \frac{x_{n} - μ_{n}}{σ_{n}}; Σ)

(21)

F_{i} (x_{i}) = Φ_{1} (\frac{x_{i} - μ_{i}}{σ_{i}}), 0 \leq i \leq n

(22)

2.2. Questionnaire Survey Research

We conducted questionnaire survey research to validate the augmented data based on the synthesized data generated by the DGNNs. A survey evaluating product images under the category of motorbikes made by manufacturers from China and Japan was performed online, with Japan residents as the participants. We designed 26 questionnaires related to attributes for product evaluation in each country. The respondents were required to rate each questionnaire on the Likert scale from 1 to 11. Table S1 (“S” means Supplementary Materials. Refer to the Supplementary Materials section) describes the detailed questionnaires. The survey in this study was executed through the Internet based on a secured survey panel database provided by the marketing research agency. We cannot access the detailed data of respondents because the data are strictly managed by marketing research agency officers. When conducting this survey, the agency confirmed each respondent’s approval of using their data for this study.

2.3. Validation Procedures

2.3.1. Dataset

Eleven datasets were classified into two large groups: “Nonaugmented” and “Augmented” datasets (Figure 2). The “Nonaugmented” dataset comprises both real and synthesized datasets generated by DGNNs (CTGAN, TVAE, and CopulaGAN). The real dataset comprises the original and seed datasets. The original dataset is a large-sample-sized dataset used as a reference (Table S10). It is an innately desirable dataset for making marketing decisions. The seed data comprise a small-sample-sized dataset that is the seed for synthesizing data using DGNNs (Table S11). Seed data are illustrative and express a dataset obtained under severe and limited marketing research budgets. The other datasets are synthesized datasets generated by each DGNN. The “Augmented” dataset comprises six datasets that combine the real and synthesized datasets. Concretely, the original or seed dataset and any of the synthesized datasets are combined. Concerning the effectiveness of data augmentation in small-sample-sized survey data, the combined seed and synthesized datasets are used.

2.3.2. Validation Approach

We validated the effectiveness of the synthesized data following two steps (Figure 3). First, we assessed the quality of the synthesized data. Second, we evaluated the data augmentation practicability using the assessed synthesized data from the previous step. Outcome prediction and feature evaluation play crucial roles in marketing analysis, similar to the application of machine learning techniques aside from marketing research. Outcome prediction is applied to predict preferred brands, while feature evaluation is used to identify influential variables toward outcomes. The latter analysis is particularly referred to as the key driver analysis in the marketing analysis, playing a crucial role in marketing decision-making [37]. Thus, assessing the predictive performance and the feature importance is necessary in validating the usefulness of data augmentation with the synthesized data.

2.3.3. Assessing the Synthesized Data Accuracy

We assessed the synthesized data accuracy by adopting the following three metrics: column shape, column pair trend, and overall quality score [27].

The column shape score covers the 0.0 to 1.0 metrics. Column shape 1.0 shows that both data have completely equal distributions. This metric represents the similarity in the column distributions in each datum. The column shape formula is presented as follows:

C o l u m n s h a p e s c o r e = 1 - (K S s t a t i s t i c)

(23)

The KS statistic represents the values calculated by the Kolmogorov–Smirnov test.

The column pair trend score concerns the metrics using correlation coefficients. The best value for the column pair trend is 1.0, whereas the worst value is 0.0. This metric represents the structured similarities between the two data. The column pair trend formula is as follows:

C o l u m n p a i r t r e n d s c o r e = 1 - \frac{|S_{A, B} - R_{A, B}|}{2}

(24)

S represents the synthesized data; R represents the original data; and A and B are the columns in each dataset.

S_{A, B}

represents the correlation values of the synthesized data, and

R_{A, B}

represents the correlation values of the original data.

The overall quality score is the value calculated by averaging both the column shape score and the column pair trend score.

O v e r a l l q u a l i t y s c o r e = \frac{C o l u m n s h a p e s c o r e + C o l u m n p a i r t r e n d s c o r e}{2}

(25)

2.3.4. Evaluating the Practicability of Data Augmentation

We evaluated the practicability of data augmentation by defining the questionnaires from Q1 to Q26 as the features and Japan or China as the outcome. We addressed the tackled issue by using classification task algorithms in machine learning.

Even if the synthesized data can accurately be generated in terms of data quality, their high quality alone would not necessarily be practical when predicting the outcomes and evaluating the contributions to the outcomes based on the synthesized data. As for the latter, the contribution to the outcome is measured using feature importance scores. The feature importance scores calculated based on large-sample-sized data are considered credible. Given that we refer to the feature importance scores calculated on the original data as the benchmark in this work, we evaluated the degree of concordance between the feature importance scores for the original data and those for the other data. We assessed the practicability of data augmentation in terms of both the predictive performance and the concordance of the feature importance scores. Based on several previous studies [22,24,38,39], the efficacy of the 11 prepared datasets was validated by conducting experiments using several major machine learning algorithms (i.e., boosting, random forest, support vector machine <radial-based function>, and support vector machine <linear function>) in JASP 0.18.1 [40]. The performances of these machine learning algorithms with the prevailing evaluation metrics (i.e., accuracy, precision, recall, and F1 score) were evaluated.

Regarding the predictive performance, we calculated the averages of the evaluation metrics such as accuracy, precision, recall, F1 scores, and area under the curve (AUC) in each machine learning algorithm using each datum and its standard deviation. The average values of these evaluation metrics represent the predictive performance when using the data, while the standard deviation values depict the predictive performance stability. The useful data for the predictive analysis were easily assessed by organizing these values into a two-dimensional (2D) map and plotting each datum in it (Figure 4). The horizontal axis represents the predictive performance, which is the average value of the evaluation metrics, whereas the vertical axis denotes the stability concerning the evaluation metrics when conducting a predictive analysis, which comprises the standard deviation values for the evaluation metrics. These high average and low standard deviation values show the data qualities without depending on the abilities of machine learning algorithms under uniform conditions relative to the hyperparameters of the machine learning algorithms. This map shows that the data with the best predictive performance and the highest stability are the most useful data for prediction tasks in marketing analysis, regardless of the type of machine learning algorithm.

In relation to the degree of concordance of the feature importance scores (Figure 5), the original data, which are the benchmark data, corresponded with Data A, while the other data, such as the seed and synthesized data, corresponded with Data B and C. We calculated the correlation coefficients (Pearson’s r and Spearman’s rho) between the original and other data based on the feature importance scores computed in each machine learning algorithm. These correlation coefficients imply the degree of concordance between the original and other data, validating whether or not the other data offer possibilities of alternatives to the original data when predicting outcomes and assessing the influential features for outcomes. While the data with higher correlation coefficient values might be a prospective alternative to the original data, those with low correlation coefficient values might be less desirable in terms of alternatives to the original data.

To judge which dataset types and data are the most appropriate for marketing analysis, we propose 2D maps that organize the predictive performance results and the degree of concordance by the evaluation metrics (Figure 6). Similar to in Figure 4, the predictive performance was depicted on the horizontal axis. The degree of concordance, which can use the values of the averaged correlation coefficients based on Pearson r and Spearman rho, was set as the vertical axis. Thus, the higher the values on the horizontal axis, the better the results obtained using the concerned data in the machine learning algorithm calculation. The higher values on the vertical axis suggested a close pattern of the feature importance scores between the original and other data. In this map, the data with the best predictive performance and the closest feature importance scores between the original and other data were the most useful data for the marketing analysis.

3. Results

3.1. Questionnaire Survey Research

Table S2 presents the survey results. These survey data are referred to as the real data here. Therefore, we show the survey data results in a split data form of original and seed data. The first acquired survey data were transformed for the consecutive analysis using machine learning algorithms to validate the practicability. Accordingly, considering this consecutive analysis, the descriptive summary in Table S2 is provided based on the transformed data. Figure 7 explains the transformed data. After the initial data transformation, the county column used as an outcome variable in the consecutive machine learning analysis was added.

3.2. Validation Results

3.2.1. Assessing the Synthesized Data Accuracy

We generated the synthesized data based on the seed data using three DGNNs by each epoch. Hyperparameters for three DGNNs are described in Table S3. These DGNNs were developed by the SDV package implemented by Python [27]. We employed the default values used in the SDV package as the hyperparameters. Each DGNN was learned from 10 to 30,000 epochs. We assessed the synthesized data quality using the overall quality score. Although the CTGAN score was slightly increased to the 30,000th epoch, the overall quality score of the GAN-based synthesizers (CTGAN and CopulaGAN) was almost saturated around the 5000th epoch. The TVAE score was saturated around the 3000th epoch. We employed the 30,000th epoch as the best score because each performance index was considered saturated between the 3000th or 5000th and the 30,000th epoch. Therefore, for the synthesized data generated by these synthesizers, the model learned at 30,000 epochs was considered to achieve the best overall quality score. Table 2 presents the obtained results. We adopted the two synthesized data for the consecutive analysis. The mixed data (Figure 2) were created by these synthesized data, as well. Table S4 shows a descriptive summary of these three synthesized data. Similar to those in previous studies [41,42], the CTGAN took longer to synthesize the data compared to the TVAE.

3.2.2. Evaluating the Practicability of Data Augmentation

In this study, 11 datasets were created to validate the augmented data (Figure 8). Four machine learning algorithms (i.e., boosting, random forest, support vector machine <radial-based function>, and support vector machine <linear function>) were evaluated on the 11 datasets (Figure 8). The data were split into the following three datasets when these algorithms were run: training, validation, and testing. Then, 20% of the total data were assigned as the test data. The remaining data were the training data. Meanwhile, the data of 20% of the training data were assigned as the validation data. The evaluation metrics were calculated based on the results using the test data. Table S5 describes the parameter setting for each machine learning algorithm, while Table S6 shows the results of these machine learning algorithms.

To assess datasets in terms of their predictive performances and stabilities, the average and standard deviation values of the evaluation metrics across the machine learning algorithms were calculated. The evaluation metrics and confusion matrices are shown in Table 3 and Supplementary Table S7. For the synthesized data in the “Nonaugmented” dataset group, the data generated using the GAN-based synthesizers (CTGAN and CopulaGAN) outperformed those generated by TVAE in terms of predictive performance. These findings align with recent studies [26,43,44,45]. Notably, the real dataset exhibited the lowest performance. Augmented datasets are ranked second or third. Augmenting the original dataset improved the evaluation metrics of predictive performance by an average of 8.16%. Augmenting the seed dataset improved the evaluation metrics of predictive performance an average of 9.64%. The evaluation of the predictive analysis indicated that the GAN-based datasets in the “Nonaugmented” and “Augmented” dataset groups exhibited higher performance levels than the other datasets. In particular, concerning the seed dataset in the “Augmented” dataset group, the combination of the seed and synthesized datasets generated by CTGAN and CopulaGAN improved performance across evaluation metrics by an average of 11.41% and 10.93%, respectively, although TVAE enhanced the performance by 6.58%. The augmented and synthesized datasets had almost the same standard deviations as the original datasets and lower standard deviations than the seed datasets. The TVAE-related datasets had relatively lower standard deviations compared to the other datasets.

Regarding the concordance of the feature importance scores, we performed two correlation analyses (i.e., Pearson and Spearman) for the feature importance scores in each datum. Table S8 shows the results of the feature importance scores in each datum by the machine learning algorithms. The feature importance scores of the support vector machine were not calculated in JASP 0.18.0 [40]; therefore, the scores of these models are not described in Table S8. Table 4 presents the correlation analysis results between the original and other data. Figure S1A,B and Table S9A,B display the detailed results. Averaged correlation coefficients are the average values of each correlation coefficient across machine learning algorithms in each dataset. A value close to 1 indicates that the orders or numerical values of the variable/feature importance in any dataset resemble those in the original dataset. Most of the datasets yielded averaged correlation coefficient values from 0.500 to approximately 0.700, excluding two “Augmented” group datasets (“Real (original) + Synthesized (CTGAN)” and “Real (original) + Synthesized (TVAE)”). The averaged correlation coefficients in the synthesized datasets (“Nonaugmented” group) were almost the same as those in the “Augmented” group, excluding datasets related to the original dataset. The dataset synthesized by TVAE yielded the worst performance in the concordance analysis, whereas that synthesized by CopulaGAN yielded the highest averaged correlation coefficient among the synthesized datasets. In particular, its Spearman’s rho values were higher than those of the other synthesized and seed datasets. For the “Augmented” group, only the CopulaGAN-related dataset yielded a higher value than the averaged correlation coefficient between the original and seed datasets. This suggests that the concordance with the feature importance between the original and small-sample-sized datasets might improve by 3.42% when augmenting a small-sample-sized dataset with synthesized data generated by CopulaGAN (“Real (original) vs. Real (seed)” = 0.701/“Real (original) vs. Augmented [Real (seed) + Syn (CopulaGAN)]” = 0.725).

The actual predictive performance vs. stability map (Figure 9A) was created based on the columns “Averaged value of evaluation metrics” and “SD of AVEM” in Table 3. The predictive performance vs. concordance of the feature importance score map (Figure 9B) was created based on “Averaged value of evaluation metrics” in Table 2 and “Averaged correlation coefficients” in Table 4. Regarding the averaged correlation coefficient values of the original datasets in Table 4 and Figure 9B, we provided them with 1.00 because we defined them as the benchmark value. From the results of both maps, both augmented and synthesized datasets had higher predictive performance than real datasets, regardless of the data scale. The result of the actual predictive performance vs. stability map (Figure 9A) showed that the augmented and synthesized datasets achieved higher predictive performances than the two real datasets (the original and seed datasets), relatively lower fluctuation than the original dataset, and much lower fluctuation than the seed dataset. Two augmented datasets (“Real (seed) + Synthesized (CTGAN)” and “Real (seed) + Synthesized (CopulaGAN)”) and two synthesized datasets (“Synthesized (CTGAN)” and “Synthesized (CopulaGAN)”) were placed in desirable positions because they yielded high predictive performance and universality regardless of machine learning algorithms. However, the augmented and synthesized datasets had the same or relatively low averaged correlation coefficient values as the seed datasets, unlike the predictive performance. Although the augmented datasets related to the original dataset (Real (original) + Synthesized (CTGAN), Real (original) + Synthesized (TVAE)), which yielded averaged correlation coefficients of approximately 0.850, achieved relatively higher concordance with the original dataset, the values are presumed to be the contribution of the original dataset, not the synthesized dataset generated using each DGNN. Among the augmented data based on the seed data, the augmented data composed of both the seed and synthesized data generated by the CopulaGAN achieved the highest predictive performance and relatively higher concordance.

4. Discussion

To the best of our knowledge, this study represents the first attempt to apply a data augmentation approach to consumer survey data through the usage of synthesized data generated using DGNNs. Our findings revealed that the augmented and synthesized data outperformed real data in terms of predictive performance. This suggests that augmenting or synthesizing data with DGNNs may yield a superior performance in marketing analysis classification tasks, irrespective of the data scale and machine learning algorithms used. Consequently, augmenting and synthesizing small-sample-sized survey data could considerably enhance predictive performance. Moreover, even with a large-sample-sized dataset, if marketers aim to achieve better predictive performance than the current levels, augmenting or synthesizing the dataset could prove effective.

Furthermore, the predictive performance of the synthesized data approximated or surpassed that of the augmented data. The synthesized data generated by CTGAN obtained the highest predictive performance, surpassing that of the augmented data (“Real (seed) + Synthesized (CTGAN)” and “Real (original) + Synthesized (CTGAN)”). A similar trend was observed in datasets related to CopulaGAN and partially in those related to TVAE, indicating that synthesized data can effectively emulate real data as a “clone” of the original dataset. In particular, GAN-based synthesizers (CTGAN and CopulaGAN) were superior to the TVAE synthesizer. The distinctiveness of the network architecture between the GAN-based and VAE-based synthesizers might have caused differences in their predictive performance. The mode-specific normalization module is a unique function in GAN-based synthesizers. It might function effectively when the synthesizers train the data. Given that it adjusted complicated data composed of conditional and continuous columns to an appropriate kernel function using a variational Gaussian mixture model (VGM), ideal data might have been generated based on a probabilistic distribution. However, as this is one assumption, we must investigate this phenomenon further in the following research. Data cloning techniques using synthetic data generated using DGNNs have proven successful in computer vision tasks such as image, movie, and audio recognition [46,47,48,49].

The use of synthesized data as cloned real data may help in addressing security-related concerns in consumer survey research. Respondents often hesitate to disclose personal information regarding wealth and health owing to privacy issues [50], which could impact their willingness to participate in marketing activities. Marketers face challenges in obtaining such sensitive information and may incur higher costs compared with conventional consumer surveys. However, informing respondents in advance that “cloned” survey data synthesized by DGNNs will be used for analysis might alleviate concerns regarding data usage and participation. As synthesized data do not contain specified personal information, even in the event of a data breach, respondents’ privacy remains protected.

Data synthesis techniques can be integrated into the management of highly secure marketing research monitoring databases, such as those containing information on individuals with high incomes or specific medical conditions. This strategy is a promising privacy preserving strategy [51] in the marketing fields. Blockchain technology plays a pivotal role in privacy and security preservation. Joint technologies combining blockchain [52,53] and DGNN-synthesized data [54,55] have offered robust security measures, facilitating the access and utilization of customer databases and thereby enhancing the execution of marketing strategies.

In this way, the present study may offer the possibility to propose promising applications because the predictive performances of synthesized data were better than those of the original data. However, this suggests that there may be discrepancies in the predictive results between synthesized and original data. Although data generated by GAN-based synthesizers led to better results concerning performance indices, the TVAE results showed a closeness between the synthesized and original data. This suggests that TVAE might be a better synthesizer because of its closeness to the original data. The drawbacks identified using these results need to be investigated in future research.

The concordance analysis results revealed a discrepancy between the original data and other datasets, despite the improved predictive performances observed in augmented and synthesized data compared with the real data (seed and original data). This implies that achieving proximity to the values and orders of feature importance in the original data through data augmentation and synthesis is challenging. Consequently, there are concerns regarding the divergence in marketing decisions based on feature importance calculated using augmented and synthesized data compared with those based on the original data. Considering that the degree of concordance of feature importance between the original and other datasets may fall at approximately 85% or lower, marketers should exercise caution when making marketing decisions based on driver analysis, considering these observed discrepancies.

While our findings suggest that reproducing the feature importance observed in the original data presents challenges, augmented and synthesized data related to CopulaGAN exhibited correlation coefficients higher than the average, surpassing 0.7, unlike other DGNNs, which yielded coefficients below 0.7. This indicates that CopulaGAN, with the incorporation of copula function during processing, tends to perform better in achieving concordance with feature importance observed in the original data. This observation aligns with previous studies [28,32], suggesting that CopulaGAN might demonstrate superior performance in concordance analysis compared with other DGNNs, particularly owing to its effectiveness in handling correlated variables commonly found in consumer survey data. Therefore, the synthesized data generated using CopulaGAN might effectively augment data using dependencies between variables in table data other than financial data.

In general, the usage of data augmentation and synthesis techniques can mitigate the reliance on low-credibility data and bolster the implementation of more effective marketing strategies by improving predictive performance. However, a cautious approach may be necessary when assessing the alignment of feature importance through these methods. Implementing the findings of this study into marketing research practices could alleviate the financial burden associated with acquiring large sample sizes and soliciting sensitive information in consumer surveys. In addition, the synthesized data hold promise for establishing a highly accessible and secure consumer survey database. These findings may offer valuable insights for small- and medium-sized organizations with limited marketing budgets seeking to conduct accurate marketing analyses. Our study might present useful findings to practitioners and academicians, but we have several limitations and cautions. In this work, we assessed the usefulness of the data augmentation approach in terms of a classification task alone. Although the classification task is a major approach in marketing analysis, many conventional marketing analysis methods are available (e.g., regression task, factor analysis, and the structural equation model) [56]. Our results are insufficient for application to various marketing issues; thus, further validations regarding the applicability of the data augmentation approach in marketing analysis are needed. The data in this study were well balanced. However, in several marketing research cases, imbalanced data have been observed. Because effective approaches to overcoming imbalanced data issues have been developed to improve the accuracy of deep learning algorithms [57], these approaches need to be applied to handle imbalanced survey data in future studies. Moreover, when it is difficult to apply the current approach, it may be necessary to modify the algorithms in this study with reference to useful approaches to improving deep learning algorithms [58]. Some points must also be considered when applying our validation approach. Our validation approach can be directly applied when the nature of the original data is obvious. However, if the nature is unknown, confirming the generated data quality becomes challenging compared with the original data. Therefore, setting an appropriate number of epochs is hard with the overall quality score, column shapes, and column pair trends. In this situation, machine learning algorithms must be used for all generated data to select the appropriate generated data. Further research is required to confirm the usefulness of data augmentation with the synthesized data generated by the DGNNs in various marketing analysis fields.

In conclusion, applying a data augmentation method using DGNNs for consumer survey data with a small sample size could enhance the predictive performance. Moreover, the predictive performances of the data synthesized by DGNNs were superior to those of the actual data. However, our results also showed that the methods in the current study might enable slight improvement but pose challenges in drastically improving the degree of discrepancy with the feature importance score between large- and small-sample-size data despite augmenting or synthesizing small data. Given that the results using the data generated by CopulaGAN yielded better performance than the other DGNN-based synthesizers, our result might provide a starting point for reducing the discrepancies by considering the correlations between variables in an algorithm.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app14199030/s1, Figure S1(A). Correlation analysis results: (A) boosting classification correlation plot. Figure S1(B). Correlation analysis results: (B) random forest classification correlation plot. Table S1. Questionnaires. Table S2. Descriptive statistics: (A) original and (B) seed data. Table S3. Hyperparameters. Table S4. Descriptive statistics: (A) synthesized data from the CTGAN; (B) synthesized data from the TVAE); and (C) synthesized data from the CopulaGAN. Table S5. Parameter setting in each machine learning algorithm: (A) boosting classification; (B) random forest classification; (C) support vector machine classification <radial-based function>; and (D) support vector machine classification <linear function>. Table S6. Results of the machine learning algorithms in each datum. Table S7. Confusion matrix. Table S8. Feature importance: (A) boosting classification; (B) random forest classification. Table S9. (A) Correlation analysis results: (A) boosting classification, correlation table. Table S9. (B) Correlation analysis results: (B) random forest classification correlation table. Table S10. Original dataset. Table S11. Seed dataset.

Author Contributions

Conceptualization, S.W., K.E. and T.M.; methodology, S.W.; software, S.W.; validation, S.W., K.E. and T.M.; formal analysis, S.W.; investigation, S.W.; resources, S.W., K.E. and T.M.; data curation, S.W. and K.E.; writing—original draft preparation, S.W.; writing—review and editing, S.W.; visualization, S.W.; supervision, S.W., K.E. and T.M.; project administration, S.W. and K.E.; funding acquisition, K.E. and T.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.

Acknowledgments

We thank Kenji Fujimoto for his assistance with data collection and screening.

Conflicts of Interest

The authors declare no competing interests.

References

Global Market Research 2022. An ESOMAR Industry Report; ESOMAR: Amsterdam, The Netherlands, 2022. [Google Scholar]
McDaniel, S.W.; Parasuraman, A. Small Business Experience With and Attitudes Toward Formal Marketing Research. Am. J. Small Bus. 1985, 9, 1–6. [Google Scholar]
O’Donnell, A. Small Firm Marketing: Synthesising and Supporting Received Wisdom. J. Small Bus. Enterp. Dev. 2011, 18, 781–805. [Google Scholar] [CrossRef]
Bruno, A.V.; Leidecker, J.K. Causes of New Venture Failure: 1960s vs. 1980s. Bus. Horiz. 1988, 31, 51–56. [Google Scholar] [CrossRef]
Malhotra, N.K.; Agarwal, J.; Peterson, M. Methodological Issues in Cross-Cultural Marketing Research: A State-of-the-Art Review. Int. Mark. Rev. 1996, 13, 7–43. [Google Scholar] [CrossRef]
Thompson, W. Sampling Rare or Elusive Species: Concepts, Designs, and Techniques for Estimating Population Parameters; Island Press: Washington, DC, USA, 2013. [Google Scholar]
Alomar, K.; Aysel, H.I.; Cai, X. Data Augmentation in Classification and Segmentation: A Survey and New Strategies. J. Imaging 2023, 9, 46. [Google Scholar] [CrossRef]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar] [CrossRef]
Lim, S.; Kim, I.; Kim, T.; Kim, C.; Kim, S. Fast Autoaugment. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2019; Volume 32. [Google Scholar]
Ho, D.; Liang, E.; Chen, X.; Stoica, I.; Abbeel, P. Population Based Augmentation: Efficient Learning of Augmentation Policy Schedules. In Proceedings of the International Conference on Machine Learning, PMLR 2019, Long Beach, CA, USA, 9–15 June 2019; pp. 2731–2741. [Google Scholar]
Kumar, K.K.; Dinesh, P.M.; Rayavel, P.; Vijayaraja, L.; Dhanasekar, R.; Kesavan, R.; Raju, K.; Khan, A.A.; Wechtaisong, C.; Haq, M.A. Brain Tumor Identification Using Data Augmentation and Transfer Learning Approach. Comput. Syst. Sci. Eng. 2023, 46, 1845–1861. [Google Scholar] [CrossRef]
Courville, A.; Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2014; Volume 27, pp. 2672–2680. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2015, arXiv:1511.06434. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar] [CrossRef]
Lu, Y.; Wang, H.; Wei, W. Machine Learning for Synthetic Data Generation: A Review. arXiv 2023, arXiv:2302.04062. [Google Scholar] [CrossRef]
Assefa, S.A.; Dervovic, D.; Mahfouz, M.; Tillman, R.E.; Reddy, P.; Veloso, M. Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, New York, NY, USA, 15–16 October 2020; pp. 1–8. [Google Scholar] [CrossRef]
Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-Year Anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
Jeong, D.-H.; Kim, S.-E.; Choi, W.-H.; Ahn, S.-H. A Comparative Study on the Influence of Undersampling and Oversampling Techniques for the Classification of Physical Activities Using an Imbalanced Accelerometer Dataset. Healthcare 2022, 10, 1255. [Google Scholar] [CrossRef] [PubMed]
Wu, X.; Meng, S. E-Commerce Customer Churn Prediction Based on Improved SMOTE and AdaBoost. In Proceedings of the 2016 13th International Conference on Service Systems and Service Management (ICSSSM), Kunming, China, 24–26 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–5. [Google Scholar] [CrossRef]
Choi, E.; Biswal, S.; Malin, B.; Duke, J.; Stewart, W.F.; Sun, J. Generating Multi-Label Discrete Patient Records Using Generative Adversarial Networks. In Proceedings of the Machine Learning for Healthcare Conference, PMLR 2017, Boston, MA, USA, 18–19 August 2017; pp. 286–305. [Google Scholar]
Park, N.; Mohammadi, M.; Gorde, K.; Jajodia, S.; Park, H.; Kim, Y. Data Synthesis Based on Generative Adversarial Networks. arXiv 2018, arXiv:1806.03384. [Google Scholar] [CrossRef]
Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling Tabular Data Using Conditional Gan. Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2019; Volume 32, pp. 7335–7345. [Google Scholar]
Kotnana, S.; Han, D.; Anderson, T.; Züfle, A.; Kavak, H. Using Generative Adversarial Networks to Assist Synthetic Population Creation for Simulations. In Proceedings of the 2022 Annual Modeling and Simulation Conference (ANNSIM), San Diego, CA, USA, 18–20 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–12. [Google Scholar] [CrossRef]
Bourou, S.; El Saer, A.; Velivassaki, T.-H.; Voulkidis, A.; Zahariadis, T. A Review of Tabular Data Synthesis Using GANs on an IDS Dataset. Information 2021, 12, 375. [Google Scholar] [CrossRef]
McCoy, S.V. Exploration of User Privacy Preservation via CTGAN Data Synthesis for Deep Recommenders. Available online: https://cs230.stanford.edu/projects_fall_2021/reports/103173308.pdf (accessed on 25 September 2024).
Xu, L. Synthesizing Tabular Data Using Conditional GAN. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2020. [Google Scholar]
Patki, N.; Wedge, R.; Veeramachaneni, K. The Synthetic Data Vault. In Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 17–19 October; IEEE: Piscataway, NJ, USA, 2016; pp. 399–410. [Google Scholar] [CrossRef]
Kamthe, S.; Assefa, S.; Deisenroth, M. Copula Flows for Synthetic Data Generation. arXiv 2021, arXiv:2101.00598. [Google Scholar] [CrossRef]
Embrechts, P.; McNeil, A.; Straumann, D. Correlation and Dependence in Risk Management: Properties and Pitfalls. In Risk Management: Value at Risk and Beyond; Dempster, M., Ed.; Cambridge University Press: Cambridge, UK, 2002; pp. 176–223. [Google Scholar]
Peña, J.-M.; Suárez, F.; Larré, O.; Ramírez, D.; Cifuentes, A. A Modified CTGAN-Plus-Features Based Method for Optimal Asset Allocation. arXiv 2023, arXiv:2302.02269v3. [Google Scholar] [CrossRef]
Potluru, V.K.; Borrajo, D.; Coletta, A.; Dalmasso, N.; El-Laham, Y.; Fons, E.; Ghassemi, M.; Gopalakrishnan, S.; Gosai, V.; Kreačić, E.; et al. Synthetic Data Applications in Finance. arXiv 2023, arXiv:2401.00081. [Google Scholar] [CrossRef]
Corluy, H.; Nijssen, S. Generating Data for Financial Portfolio Optimization. Master’s Thesis, Ecole Polytechnique de Louvain, Université Catholique de Louvain, Ottignies-Louvain-la-Neuve, Belgium, 2022. [Google Scholar]
Baumgartner, H.; Homburg, C. Applications of Structural Equation Modeling in Marketing and Consumer Research: A Review. Int. J. Res. Mark. 1996, 13, 139–161. [Google Scholar] [CrossRef]
Anand, P.; Lee, C. Using Deep Learning to Overcome Privacy and Scalability Issues in Customer Data Transfer. Mark. Sci. 2023, 42, 189–207. [Google Scholar] [CrossRef]
Burnap, A.; Hauser, J.R.; Timoshenko, A. Product Aesthetic Design: A Machine Learning Augmentation. Mark. Sci. 2023, 42, 1029–1056. [Google Scholar] [CrossRef]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved Training of Wasserstein Gans. Adv. Neural Inf. Process. Syst. 2017, 30, 5767–5777. [Google Scholar]
Chapman, C.; Feit, E.M. R for Marketing Research and Analytics; Springer Nature: New York, NY, USA, 2019. [Google Scholar]
Zhao, Z.; Kunar, A.; Birke, R.; Van der Scheer, H.; Chen, L.Y. Ctab-Gan+: Enhancing Tabular Data Synthesis. Front. Big Data 2023, 6, 1296508. [Google Scholar] [CrossRef]
Arunnehru, J.; Thalapathiraj, S.; Dhanasekar, R.; Vijayaraja, L.; Kannadasan, R.; Khan, A.A.; Haq, M.A.; Alshehri, M.; Alwanain, M.I.; Keshta, I. Machine Vision-Based Human Action Recognition Using Spatio-Temporal Motion Features (STMF) with Difference Intensity Distance Group Pattern (DIDGP). Electronics 2022, 11, 2363. [Google Scholar] [CrossRef]
Love, J.; Selker, R.; Marsman, M.; Jamil, T.; Dropmann, D.; Verhagen, J.; Ly, A.; Gronau, Q.F.; Šmíra, M.; Epskamp, S. JASP: Graphical Statistical Software for Common Statistical Designs. J. Stat. Softw. 2019, 88, 1–17. [Google Scholar] [CrossRef]
Watson, D.S.; Blesch, K.; Kapar, J.; Wright, M.N. Smooth Densities and Generative Modeling with Unsupervised Random Forests. arXiv 2022, arXiv:2205.09435. [Google Scholar] [CrossRef]
Muñoz-Cancino, R.; Bravo, C.; Ríos, S.A.; Graña, M. Assessment of Creditworthiness Models Privacy-Preserving Training with Synthetic Data. In Proceedings of the Hybrid Artificial Intelligent Systems: 17th International Conference, HAIS 2022, Salamanca, Spain, 5–7 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 375–384. [Google Scholar] [CrossRef]
Chia, M.Y.; Koo, C.H.; Huang, Y.F.; Di Chan, W.; Pang, J.Y. Artificial Intelligence Generated Synthetic Datasets as the Remedy for Data Scarcity in Water Quality Index Estimation. Water Resour Manag. 2023, 37, 6183–6198. [Google Scholar] [CrossRef]
Pasha Syed, A.R.; Anbalagan, R.; Setlur, A.S.; Karunakaran, C.; Shetty, J.; Kumar, J.; Niranjan, V. Implementation of Ensemble Machine Learning Algorithms on Exome Datasets for Predicting Early Diagnosis of Cancers. BMC Bioinform. 2022, 23, 496. [Google Scholar] [CrossRef]
Inan, M.S.K.; Hossain, S.; Uddin, M.N. Data Augmentation Guided Breast Cancer Diagnosis and Prognosis Using an Integrated Deep-Generative Framework Based on Breast Tumor’s Morphological Information. Inform. Med. Unlocked 2023, 37, 101171. [Google Scholar] [CrossRef]
Kim, H.; Garrido, P.; Tewari, A.; Xu, W.; Thies, J.; Niessner, M.; Pérez, P.; Richardt, C.; Zollhöfer, M.; Theobalt, C. Deep Video Portraits. ACM Trans. Graph. 2018, 37, 1–14. [Google Scholar] [CrossRef]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Cao, Y.; Li, S.; Liu, Y.; Yan, Z.; Dai, Y.; Yu, P.S.; Sun, L. A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT. arXiv 2023, arXiv:2303.04226. [Google Scholar] [CrossRef]
Liu, L.; Xu, W.; Zollhöfer, M.; Kim, H.; Bernard, F.; Habermann, M.; Wang, W.; Theobalt, C. Neural Rendering and Reenactment of Human Actor Videos. ACM Trans. Graph. 2019, 38, 1–14. [Google Scholar] [CrossRef]
Edwards, P.; Roberts, I.; Clarke, M.; DiGuiseppi, C.; Pratap, S.; Wentz, R.; Kwan, I. Increasing Response Rates to Postal Questionnaires: Systematic Review. BMJ 2002, 324, 1183. [Google Scholar] [CrossRef] [PubMed]
Carvalho, T.; Moniz, N.; Faria, P.; Antunes, L. Survey on Privacy-Preserving Techniques for Microdata Publication. ACM Comput. Surv. 2023, 55, 1–42. [Google Scholar] [CrossRef]
Bathula, A.; Merugu, S.; Skandha, S.S. Academic Projects on Certification Management Using Blockchain-A Review. In Proceedings of the 2022 International Conference on Recent Trends in Microelectronics, Automation, Computing and Communications Systems (ICMACC), Hyderabad, India, 28–30 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar] [CrossRef]
Bathula, A.; Muhuri, S.; Gupta, S.K.; Merugu, S. Secure Certificate Sharing Based on Blockchain Framework for Online Education. Multimed. Tools Appl. 2023, 82, 16479–16500. [Google Scholar] [CrossRef]
Veeraragavan, N.R.; Nygård, J.F. Generating Synthetic Data in a Secure Federated General Adversarial Networks for a Consortium of Health Registries. arXiv 2022, arXiv:2212.01629. [Google Scholar] [CrossRef]
Cui, L.; Qu, Y.; Xie, G.; Zeng, D.; Li, R.; Shen, S.; Yu, S. Security and Privacy-Enhanced Federated Learning for Anomaly Detection in IoT Infrastructures. IEEE Trans. Ind. Inform. 2021, 18, 3492–3500. [Google Scholar] [CrossRef]
Bagozzi, R. Advanced Marketing Research; John Wiley & Sons: Oxford, UK, 2019. [Google Scholar]
Ahmad, U.; Ali, M.J.; Khan, F.A.; Khan, A.A.; Rehman, A.U.; Shahid, M.M.A.; Haq, M.A.; Khan, I.; Alzamil, Z.S.; Alhussen, A. Large Scale Fish Images Classification and Localization Using Transfer Learning and Localization Aware CNN Architecture. Comput. Syst. Sci. Eng. 2023, 45, 2125–2140. [Google Scholar] [CrossRef]
Jawaharlalnehru, A.; Sambandham, T.; Sekar, V.; Ravikumar, D.; Loganathan, V.; Kannadasan, R.; Khan, A.A.; Wechtaisong, C.; Haq, M.A.; Alhussen, A. Target Object Detection from Unmanned Aerial Vehicle (UAV) Images Based on Improved YOLO Algorithm. Electronics 2022, 11, 2343. [Google Scholar] [CrossRef]

Figure 1. DGNN architectures: (A) CTGAN and (B) TVAE.

Figure 2. Datasets used in this study. The original data are consumer survey data with a large sample size. The seed data are equivalent to the consumer survey data with a small sample size.

Figure 3. Validation procedures.

Figure 4. Conceptual predictive performance vs. stability map.

Figure 5. Assessment of the degree of concordance in the feature importance scores.

Figure 6. Conceptual predictive performance vs. concordance of the feature importance score map.

Figure 7. The data transformation procedure: the initial data are transformed into the appropriate data form to implement machine learning models.

Figure 8. Sample sizes (number of rows) in each datum: the number in the black square expresses the sample size (number of rows).

Figure 9. (A) Actual predictive performance vs. stability map. (B) Actual predictive performance vs. concordance of the feature importance score map. Abbreviations: SD, standard deviation; Ori, original data; Sd, seed data; Syn, synthetic data; CG, CTGAN; T, TVAE; and CpG, CopulaGAN; Averaged Correl Coef, averaged correlation coefficients.

Table 1. Characteristics of the DGNNs.

	CTGAN	TVAE	CopulaGAN
Mixture table dataset	Yes	Yes	Yes
Dependencies between variables	No	No	Yes

Table 2. Quality of the synthesized data generated by the DGNNs.

Epochs	Wall Times			Overall Quality Scores			Column Shapes			Column Pair Trends
Epochs	CTGAN	TVAE	CP-GAN	CTGAN	TVAE	CP-GAN	CTGAN	TVAE	CP-GAN	CTGAN	TVAE	CP-GAN
10	2.72 s	1.81 s	3.35 s	69.56%	67.07%	66.76%	66.03%	67.48%	62.60%	73.09%	71.83%	70.93%
100	9.23 s	3.06 s	9.64 s	65.98%	79.72%	65.01%	59.00%	73.83%	59.02%	72.96%	85.61%	70.99%
500	43.8 s	8.55 s	35.6 s	78.18%	82.22%	70.81%	69.28%	73.59%	65.71%	87.08%	90.85%	75.91%
1000	1 min 32 s	16.6 s	1 min 8 s	79.02%	82.99%	71.88%	67.07%	73.78%	64.51%	90.97%	92.19%	79.24%
3000	4 min 6 s	47.3 s	3 min 34 s	79.45%	83.51%	78.14%	68.81%	74.61%	66.41%	90.08%	92.40%	89.86%
5000	6 min 55 s	1 min 15 s	5 min 58 s	81.03%	83.56%	80.50%	70.80%	75.05%	70.95%	91.26%	92.07%	90.05%
10,000	16 min 33 s	3 min 3 s	12 min 5 s	81.34%	83.75%	81.77%	71.09%	75.73%	72.16%	91.58%	91.78%	91.38%
20,000	31 min 4 s	5 min 53 s	25 min 19 s	81.98%	83.41%	81.34%	72.09%	74.83%	72.02%	91.86%	91.98%	90.66%
30,000	48 min 32 s	9 min 56 s	37 min 9 s	82.45%	83.93%	82.80%	72.68%	75.70%	74.24%	92.23%	92.15%	91.36%

The bold and underlined values express the best performance within the epochs in each DGNN. Abbreviations: CTGAN, Conditional Table GAN; CP-GAN, CopulaGAN; TVAE, Tabular Variational AutoEncoder.

Table 3. Predictive performances across the machine learning algorithms when using each datum.

Dataset		Evaluation Metrics										Averaged Value of Evaluation Metrics	SD of AVEM
Data Type	Data	Accuracy		Precision		Recall		F1 Score		AUC
Data Type	Data	Ave	SD	Ave	SD	Ave	SD	Ave	SD	Ave	SD
Nonaugmented	Real (original)	0.783	0.099	0.797	0.094	0.783	0.099	0.779	0.104	0.813	0.123	0.791	0.104
	Real (seed)	0.788	0.220	0.871	0.095	0.788	0.220	0.755	0.277	0.859	0.211	0.812	0.205
	Synthesized (CTGAN)	0.900	0.097	0.917	0.067	0.900	0.097	0.897	0.103	0.916	0.110	0.906	0.095
	Synthesized (TVAE)	0.852	0.076	0.867	0.052	0.853	0.076	0.851	0.081	0.886	0.105	0.862	0.078
	Synthesized (CopulaGAN)	0.896	0.071	0.905	0.054	0.896	0.071	0.891	0.080	0.906	0.116	0.899	0.079
Augmented	Real (original) + Synthesized (CTGAN)	0.848	0.086	0.859	0.074	0.848	0.085	0.845	0.089	0.873	0.109	0.854	0.089
	Real (original) + Synthesized (TVAE)	0.830	0.060	0.847	0.040	0.830	0.060	0.827	0.065	0.862	0.091	0.839	0.063
	Real (original) + Synthesized (CopulaGAN)	0.886	0.059	0.892	0.047	0.886	0.059	0.870	0.085	0.828	0.168	0.872	0.084
	Real (seed) + Synthesized (CTGAN)	0.896	0.103	0.914	0.068	0.896	0.103	0.893	0.110	0.914	0.115	0.902	0.100
	Real (seed) + Synthesized (TVAE)	0.855	0.067	0.866	0.047	0.855	0.067	0.853	0.071	0.889	0.100	0.863	0.070
	Real (seed) + Synthesized (CopulaGAN)	0.898	0.077	0.908	0.058	0.898	0.077	0.893	0.087	0.895	0.113	0.899	0.082

The bold and underlined values express the best performances, the bold values express the second-ranked performances, and the underlined values depict the third-ranked performances in each column in each datum. Abbreviations: AUC, area under the curve; Ave, average; SD, standard deviation; AVEM, averaged value of evaluation metrics.

Table 4. Concordance analysis: correlation coefficients between the original and other data.

Dataset		Machine Learning Algorithm				Averaged Correlation Coefficients
		Boosting		Random Forest
		Correlation Coefficient
		Pearson	Spearman	Pearson	Spearman
Pairs of Data Types	Compared Data	r	rho	r	rho
Benchmark	Real (original)	1.000 ***	1.000 ***	1.000 ***	1.000 ***	1.000
Original real data vs. Nonaugmented data	Real (seed)	0.751 ***	0.723 ***	0.646 ***	0.683 ***	0.701
	Synthesized (CTGAN)	0.735 ***	0.555 **	0.737 ***	0.747 ***	0.694
	Synthesized (TVAE)	0.672 ***	0.684 ***	0.401 *	0.704 ***	0.615
	Synthesized (CopulaGAN)	0.634 ***	0.770 ***	0.820 ***	0.702 ***	0.732
Original real data vs. Augmented data	Real (original) + Synthesized(CTGAN)	0.821 ***	0.881 ***	0.849 ***	0.837 ***	0.847
	Real (original) + Synthesized (TVAE)	0.779 ***	0.859 ***	0.886 ***	0.890 ***	0.854
	Real (original) + Synthesized (CopulaGAN)	0.020	0.669 ***	0.839 ***	0.710 ***	0.560
	Real (seed) + Synthesized(CTGAN)	0.767 ***	0.518 **	0.706 ***	0.717 ***	0.677
	Real (seed) + Synthesized (TVAE)	0.718 ***	0.692 ***	0.46 *	0.745 ***	0.654
	Real (seed) + Synthesized (CopulaGAN)	0.585 **	0.765 ***	0.839 ***	0.710 ***	0.725

*** p < 0.001. ** p < 0.01. * p < 0.05. The bold and underlined values express the best performances, the bold values express the second-ranked performances, and the underlined values express the third-ranked performances in each column in each data pair.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Watanuki, S.; Edo, K.; Miura, T. Applying Deep Generative Neural Networks to Data Augmentation for Consumer Survey Data with a Small Sample Size. Appl. Sci. 2024, 14, 9030. https://doi.org/10.3390/app14199030

AMA Style

Watanuki S, Edo K, Miura T. Applying Deep Generative Neural Networks to Data Augmentation for Consumer Survey Data with a Small Sample Size. Applied Sciences. 2024; 14(19):9030. https://doi.org/10.3390/app14199030

Chicago/Turabian Style

Watanuki, Shinya, Katsue Edo, and Toshihiko Miura. 2024. "Applying Deep Generative Neural Networks to Data Augmentation for Consumer Survey Data with a Small Sample Size" Applied Sciences 14, no. 19: 9030. https://doi.org/10.3390/app14199030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Applying Deep Generative Neural Networks to Data Augmentation for Consumer Survey Data with a Small Sample Size

Abstract

1. Introduction

2. Methods

2.1. DGNNs

2.1.1. CTGAN

2.1.2. TVAE

2.1.3. CopulaGAN

2.2. Questionnaire Survey Research

2.3. Validation Procedures

2.3.1. Dataset

2.3.2. Validation Approach

2.3.3. Assessing the Synthesized Data Accuracy

2.3.4. Evaluating the Practicability of Data Augmentation

3. Results

3.1. Questionnaire Survey Research

3.2. Validation Results

3.2.1. Assessing the Synthesized Data Accuracy

3.2.2. Evaluating the Practicability of Data Augmentation

4. Discussion

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI