1. Introduction
The traditional method to address spectral classification of stars is to combine their photometric and spectroscopic data together. The most commonly used Harvard stellar spectral classification system was proposed by the Harvard University Observatory in the late 19th century [
1]. In accordance with the order of the surface temperature of the star, the system divides the stellar spectra into O, B, A, F, G, K, M, and other types [
2].
M dwarfs are the most common stars in the Galaxy [
3] and are characterized by low brightness, small diameter and mass, and a surface temperature around or lower than 3500 K. With the nuclear fusion speed inside the M dwarfs being slow, M dwarfs tend to have a long life span, and they exist in all stages of the evolution of the Galaxy [
4]. A huge number of spectra are obtained with the emergence of sky survey telescopes, such as Sloan Digital Sky Survey (SDSS) [
5,
6] and Large Sky Area Multi-Object Fiber Spectroscopy Telescope (LAMOST) [
7,
8].
However, the distribution of the subtypes is unbalanced. In the SDSS-DR15, as shown in
Figure 1, the spectra of M0–M4 are relatively greater in number, whereas that of M5–M9 is limited. The generation of specific subtype spectra of M dwarfs is helpful to solve the problem of unbalanced distribution and provide more reliable samples for research. For example, in the SDSS dataset, the number of M5–M9 is very low. When the data are limited, it is difficult for astronomers to analyze them using machine learning or deep learning methods, such as classification, clustering, dimensionality reduction, etc. If we can effectively expand the data, we can improve the M Dwarf dataset to better understand these stars. In this study, we select M-class stars with unbalanced distribution of subtypes M0–M6 (signal-to-noise ratio > 5) to verify the effectiveness of our method.
With the development of machine learning and deep learning, generation models were remarkably improved. An increasing number of methods are proposed to solve the lack of data, and all kinds of data, especially the two-dimensional data from the real world, were expanded effectively. AutoEncoder is a kind of neural network. After encoding and decoding, it can obtain output similar to the input. The Variational AutoEncoder (VAE), a model proposed by Kingma and Welling [
9], combines variational Bayesians with neural network and achieves good results with data generation. Generative Adversarial Nets (GAN) is a model proposed by Goodfellow et al. [
10] to solve the lack of data, especially for the 2D data from the real world [
10,
11,
12,
13,
14]. GAN consists of a discriminator and a generator. The discriminator is designed to determine whether the input data are real or fake data generated by the generator, and the task of the generator is to generate fake data that can confuse the discriminator as much as possible. Through such a dynamic game process, similar data are generated.
The Adversarial AutoEncoder (AAE) is a model proposed by Makhzani et al. [
15]. The AAE replaces the generator of the traditional generation model, GAN, with an AutoEncoder that can better learn the feature of discrete data. At the same time, the discriminator is used to correct the distribution after encoding. By doing this, the problem of traditional GAN with the generation of discrete data is solved effectively. However, due to the restriction of traditional GAN structure, the AAE also has problems, such as unstable training and model collapse, and training a good AAE with a small amount of data is difficult.
For the quality of the generated spectrum, it is necessary to qualitatively test its similarity with the original spectrum. Principal Component Analysis (PCA) [
16] is a widely used data dimensionality reduction algorithm that can extract features of high-dimensional data. T-Distributed Stochastic Neighbor Embedding (t-SNE) is a visualization method for high-dimensional data proposed by by Arjovsky et al. [
17]. These two methods can visually demonstrate the similarity between the observational spectra and the generated spectra.
Simultaneously, we use the generated spectrum to enhance the data of the classifier to further quantitatively verify the value of the generated spectrum. Fully connected neural network is a commonly used feature extraction method; through multilayer full connection, the feature of the spectrum can be effectively extracted. Training the classifier through two methods can visually show the performance improvement of the classifier after data expansion using the generated spectrum. The contribution of this paper could be summarized as three-fold:
We used AAE to generate spectral data, and the model performed well with various kinds of spectral data, providing new ideas for the generation of spectral data.
From a qualitative and quantitative perspective, we proved the high quality of the generated spectra and the effectiveness and robustness of the AAE.
Our work provides a new direction for the combination of astronomy and machine learning.
2. Method
In this work, we propose to use an Adversarial AutoEncoder (AAE) to generate spectral data. The model is composed of a generator and a discriminator; the generator is an AutoEncoder composed of an encoder and a decoder, and the discriminator is implemented by a GAN discriminant network. AAE does not directly train the network to generate spectral data. Instead, the output of the encoder in the AutoEncoder is constrained to conform to a preselected prior distribution by the game process between the discriminator and the generator. The network parameters of autoencoder are continuously optimized to make the output of the decoder as consistent as possible with the input of the AutoEncoder. Finally, a decoder is obtained as the generator of spectral data, which can stably decode the vector that conforms to a prior distribution into high-quality spectral data.
In this study, we use two fully connected neural networks to form the encoder and decoder of the AutoEncoder, and the GAN discriminant network is constituted of a two-layer, fully connected neural network. The model is shown in
Figure 2. The training process is divided into two stages: the reconstruction stage, which aims to obtain a decoder that can stably reconstruct the encoding vector into high-quality spectral data, and the regularization stage, which aims to constrain the encoding vector generated by the encoder to an artificially selected prior distribution through GAN’s confrontation process.
Formally, we denote the encoder as Q, P as the decoder and D as the generator of the GAN. The input of the spectral data and the output of decoder are denoted as
x and
. The encoding vectors generated by encoder are denoted as
z and are constrained to a prior distribution
; in this paper,
adopts standard normal distribution. The parameter of our model is shown in
Table 1:
2.1. Reconstruction Stage
In the reconstruction stage, the encoder encodes the discrete spectral data
x with 3522 dimensional features into a 32-dimensional code vector
z, and then restores it to a 3522-dimensional reconstruction output
through a symmetrical decoder. To ensure the data reconstructed by the decoder are similar to the real data, the generated spectrum and the original spectrum use binary cross entropy as the loss function to measure the similarity between original and reconstructed data, as shown in Equation (
1), and they also use the gradient descent algorithm to minimize the loss and back-propagation to update the parameters.
2.2. Regularization Stage
In the regularization stage, we conduct confrontation training similar to GAN, where the decoder of the AutoEncoder is used as the generator and the GAN discriminant network is used as the discriminator.
First, a standard normal distribution code vector with a size of 32 is randomly generated as the positive sample, and the code vector generated by the encoder is used as the negative sample. Then, the discriminator extracts the features of the input encoding vector and uses the Sigmoid activation function in the last layer of the discriminator to normalize the final output within the range of [0, 1]. The output of the discriminator represents the probability of the input being true, which determines whether the input’s distribution is closer to the real spectral data coding or the standard normal vector distribution. Therefore, the parameter optimization of the discriminator is carried out according to the idea of zero-sum game, as shown in Equation (
2), which represents the probability that the discriminator successfully recognizes the real data as true and the generated data as fake.
When training the generator, we improve the generator’s ability to confuse the discriminator by optimizing the parameters of the encoder so as to minimize the probability of the discriminator successfully discriminating the generated data. As such, the encoding vector
z generated by the encoder conforms to the prior distribution
as much as possible, as shown in Equation (
3), which represents the probability that the generated data are recognized as fake by the discriminator.
In the process of model training, the ability of decoder to reconstruct the code vector z into spectrum continuously improves, and the code vector distribution obtained by the encoder gradually approaches the standard normal distribution simultaneously. Through this model, we obtain a decoder as the spectrum generator, which can reconstruct any vector conforming to the standard normal distribution into a high-quality target spectrum. When generating the spectrum, we first randomly generate a set of vector groups conforming to the standard normal distribution as the encoding vector, and then directly use the trained decoder as the generator to generate the spectrum. The detailed optimization procedures are summarized in Algorithm 1.
Algorithm 1 Adversarial AutoEncoder Training Strategy. |
Input: Target spectral data ; |
Output: Spectrum generated by AAE; |
1: for do |
2: for each mini-batch do |
Reconstruction phase: |
3: Encoding to by Q; |
4: Decoding to by P; |
5: compute Reconstruction loss between and as Equation (1) and update the |
encoder and decode; |
Regularization phase: |
6: Randomly choose vectors from a Gaussian distribution as true data |
7: Generate vectors from as false data by Generator G (same as Encoder P) |
8: Combine and as training data Z; |
9: Discriminating Z and compute Regularization loss as Equations (2) and (3), then |
update Discriminator D and Generator G; |
10: end for |
11: end for |
Generate data: |
12: Randomly choose vectors from a Gaussian distribution |
13: Use Encoder Q as Generator transforms to as generate data |
14: return |
2.3. Visualization of Dimensionality Reduction
PCA and t-SNE convert high-dimensional data to low-dimensional data under the premise of small information loss, which is conducive to visualization. The goal of t-SNE is to map the data distribution of the high-dimensional space into the low-dimensional space and take the difference of the probability distribution of the data in the high-dimensional and low-dimensional space as a constraint condition. We define the KL divergence to represent the difference between the two distributions; the smaller the KL divergence is, the smaller the difference is:
where
and
represent two different spectral data,
and
correspond to
and
in low-dimensional space,
represents the variance of the Gaussian distribution constructed with
as the center,
represents the probability that
is the neighborhood of
, and
represents the conditional probability that
is the domain of
. The gradient descent method is used to solve the optimization problem and find the minimum value of KL divergence. During iterative optimization, the data distribution in the low-dimensional space continuously gets closer to the data distribution in the high-dimensional space. Finally, the resulting low-dimensional space data can be considered the mapping of high-dimensional data in low dimensions.