3.1. Problem Statement
In this study, we first define the original data matrix as
Among these, to are n-dimensional row vectors, representing the number of m samples under n attributes, and R represents the set of real numbers. The matrix containing random missing elements on the original matrix is .
Then, we extract the mask matrix
of the missing data, simulate the distribution of missing and non-missing data, and record the raw and generated data during the training process. The dimension of the mask matrix is the same as that of the original dataset, and the missing identification matrix
corresponding to the missing data is defined as
In addition, for the convenience of subsequent training, we convert the missing data type of NAN to random noise
and define a numerical matrix
for input model training.
3.2. Model Structure
Generative adversarial networks typically consist of a generator, , and a discriminator, , where learns and reconstructs the distribution of data, while assesses the probability that input data originate from real data as opposed to being generated by .
However, this model is overly naive and fails to capture the heterogeneity between different subsets or class labels within real datasets [
34]. In order to accomplish the imputation task, as shown in
Figure 1, the DTAE-CGAN process begins with the combination of the complete data
and the mask matrix
, resulting in data with missing values, denoted as
, where the function
represents element-wise multiplication. Here,
is a binary vector of the same shape as
, with
indicating that
has observed data (not missing) and
indicating that
is missing. Subsequently,
and the label
, as additional information, along with zero-mean Gaussian noise
, are fed into
, resulting in the output data
. However, we are only concerned with the imputed values when
. Further, the imputed matrix
obtained from the original data
is
.
In DTAE-CGAN, the generator is internally implemented via an autoencoder, which consists of an encoder and a decoder, each equipped with three sets of convolutional layers, batch normalization layers, and ReLU activation functions. The encoder compresses the input missing data into multiple low-dimensional representations, extracting the intrinsic representation of the data; meanwhile, the decoder maps the input missing data to a higher dimension, reconstructing the original input data. Furthermore, this study introduces dynamic filling of missing values through a detracking autoencoder within the generator, aiming to prevent the generator from learning meaningless identity mapping and enhance diversity. In the improved detracking encoder, the computation rule for the k-th node of the first hidden layer connected to the input layer is as follows:
The activation function of neuron is denoted by
. The weight parameter connecting the l-th node of the input layer to the k-th node of the first hidden layer is represented by
, and
is the bias for the k-th node of the hidden layer. In the subsequent h-th hidden layer, the input
is also ignored. The calculation rule for the k-th node is
In the subsequent h-th hidden layer, the number of nodes in the previous hidden layer is represented by . The weight parameter connects the l-th node of the previous hidden layer to the k-th node of the current hidden layer. The output of the l-th node in the previous hidden layer, under this computation rule, is considered. The bias for the k-th node of the h-th hidden layer is denoted by . Consequently, the final output disregards the corresponding network input and is instead calculated based on other network inputs. This approach allows for the computation of output values without directly relying on the specific input value at the same position, effectively inferring missing or unavailable information from the available data in the network.
The discriminator primarily consists of convolutional layers and attention layers, into which the imputed matrix
and label
are fed. After
is reduced in dimensionality through the convolutional layers, the attention layer is mainly used to select data with high relevance to the output, grasping local information. This process yields a probability distribution matrix of the same dimension as the original data. With its structure depicted in
Figure 2, the computation method is as follows:
The current training state
is compared with the discriminator’s encoding target
to calculate the similarity. This results in an attention probability distribution, represented as attention weights.
represents the weight vector, and and are the weight matrices. The context vector is the weight sum of attention probability distribution.
Upon completing the calculation of the attention vector , the data are brought into a fully connected layer, where they are combined with the correctly constructed loss function for classification purposes. In this setup, an output of 1 indicates that the discriminator judges the data produced by the generator to be close to the real data, while an output of 0 is considered to be fake data generated by the generator. This process is repeated. The discriminator takes as input both the output from the generator and the conditional label and outputs a probability matrix that represents the likelihood of being classified as real data .
3.3. Loss Functions
The loss function of
comprises three parts: generative loss, reconstruction loss, and correlation loss. When optimizing
, the objective is to maximize the probability of
being judged as real data by
, which involves minimizing the cross-entropy (CE) loss to deceive the discriminator. The goal is to make the discriminator output values as close to 1 as possible at the corresponding positions where
in the mask matrix.
During the reconstruction process, we minimize the
distance (i.e., MSE for continuous data) or cross-entropy loss (for binary data) between the non-missing parts of
and the corresponding parts in the original data
.
To maintain the correlation between the local features of the generated data and the original data, we also introduce a correlation coefficient matrix and use the mean squared error to quantitatively measure the difference between the two matrices. In the formula,
represents the mean of the current column, and the matrix element
denotes the Pearson correlation coefficient between the ith and jth columns of the data.
The loss function of
is defined as follows, where
is a hyperparameter that controls the degree of regularization. An
regularization term involving the weights
of the discriminator is introduced to prevent overfitting.
3.4. Optimization Goals
Overall, the optimization goal of DTAE-CGAN is
The decision-making process is as follows: the training objective for the generator is to maximize the probability of the generated samples being identified as “real”, while minimizing the generative error
, reconstruction error
, and correlation error
. This is achieved by the generator function
, which takes the missing matrix, conditional label, and noise as input to generate the estimated matrix
. The training objective for the discriminator is to accurately identify real samples as “real” and generated samples as “fake” [
1]. This is accomplished through the discriminator function
, which accepts generator outputs and labels, ultimately producing a distribution matrix
representing the probability that the input sample is real data. Through minimizing the loss of the generator and maximizing the loss of the discriminator, the generator and discriminator engage in a competitive interaction. As training progresses, the samples generated by the generator become sufficiently realistic that the discriminator is unable to accurately distinguish between real and generated data. Simultaneously, the discriminator cannot provide additional information to further improve the generator. This signifies the attainment of Nash equilibrium, where both parties are unable to unilaterally modify their strategies to gain better returns. The game theoretic process drives our model to learn effective representations of the data distribution, as shown in Algorithm 1.
Algorithm 1. DTAE-CGAN for Missing Value Imputation |
Input: |
• Dataset |
• Number of epochs |
Output: |
• Trained DTAE-CGAN model |
• Best generator model |
• Filled missing value in the test set |
• RMSE value |
1: Load dataset |
2: Perform standardization |
3: Split dataset into train and test sets |
4: Generate random missing data mask |
5: Defne generator model |
6: Defne discriminator model |
7: Defne DTAE-CGAN model using generator and discriminator |
8: Set best value to positive infnity |
9: Set test best value to positive infnity |
10: Set no improvement count to 0 |
11: for each epoch in range(epochs) do |
12: Randomly sample generated data from train set |
13: Generate noise using normal distribution |
14: Train DTAE-CGAN with Adam optimizer |
15: Calculate current value and accuracy |
16: Calculate filled data and compute test value |
17: if current value is less than best value and test value is less than best test |
value then |
18: Update best test value |
19: Save current generator model |
20: end if |
21: end for |
22: Load best model |
23: Fill missing data in test set using generator |
24: Compute RMSE value |