Analyzing Supervised Machine Learning Models for Classifying Astronomical Objects Using Gaia DR3 Spectral Features

Pérez Cruz, Orestes Javier; Martínez Pinto, Cynthia Alejandra; Navarro Jiménez, Silvana Guadalupe; Corral Escobedo, Luis José; Outeiro, Minia Manteiga

doi:10.3390/app14199058

Open AccessArticle

Analyzing Supervised Machine Learning Models for Classifying Astronomical Objects Using Gaia DR3 Spectral Features

by

Orestes Javier Pérez Cruz

^1,*

,

Cynthia Alejandra Martínez Pinto

^1,*

,

Silvana Guadalupe Navarro Jiménez

²,

Luis José Corral Escobedo

²

and

Minia Manteiga Outeiro

³

¹

Tecnológico Nacional de México, Instituto Tecnológico de Ciudad Guzmán, Av. Tecnológico #100, Ciudad Guzmán 49000, Mexico

²

Instituto de Astronomía y Meteorología, Universidad de Guadalajara, Av. Vallarta 2602, Guadalajara 44100, Mexico

³

Departamento de Ciencias de la Computación y Tecnologías de la Información, Universidad Da Coruña, Paseo de Ronda 51, 15001 A Coruña, Spain

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 9058; https://doi.org/10.3390/app14199058 (registering DOI)

Submission received: 30 July 2024 / Revised: 6 September 2024 / Accepted: 25 September 2024 / Published: 8 October 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we present an analysis of the effectiveness of various machine learning algorithms in classifying astronomical objects using data from the third release (DR3) of the Gaia space mission. The dataset used includes spectral information from the satellite’s red and blue spectrophotometers. The primary goal is to achieve reliable classification with high confidence for symbiotic stars, planetary nebulae, and red giants. Symbiotic stars are binary systems formed by a high-temperature star (a white dwarf in most cases) and an evolved star (Mira type or red giant star); their spectra varies between the typical for these objects (depending on the orbital phase of the object) and present emission lines similar to those observed in PN spectra, which is the reason for this first selection. Several classification algorithms are evaluated, including Random Forest (RF), Support Vector Machine (SVM), Artificial Neural Networks (ANN), Gradient Boosting (GB), and Naive Bayes classifier. The evaluation is based on different metrics such as Precision, Recall, F1-Score, and the Kappa index. The study confirms the effectiveness of classifying the mentioned stars using only their spectral information. The models trained with Artificial Neural Networks and Random Forest demonstrated superior performance, surpassing an accuracy rate of 94.67%.

Keywords:

automatic spectral classification; machine learning; Gaia DR3; astronomical objects; planetary nebulae; symbiotic stars; red giants

1. Introduction

Gaia, a project of the European Space Agency (ESA), began scientific operations in mid-2014, after a successful commissioning phase. Gaia’s primary scientific aim is to scrutinize the kinematic, dynamic, chemical, and evolutionary status of the Milky Way galaxy. Through continuous sky scanning, Gaia gathers astrometric, photometric, and spectroscopic data from an extensive set of stars. The mission’s ability to cover an extended region in space at many times makes the systematic identification, characterization, and classification of diverse objects easier, including variable stars, owing to the multitemporal nature of the observations [1].

The spectrophotometer is one of the key instruments on the satellite, and it consists of two photometers: one covering the blue region of the electromagnetic spectrum (BP), with wavelengths ranging from 330 to 680 nm, and another in the red region (RP), covering the range of 640 to 1050 nm. These devices generate low-resolution spectra consisting of 62 pixels each [1].

The Gaia DR3 catalog, released on 13 June 2022, represents a significant advancement in astronomical data, surpassing its predecessors in several key aspects. This release expands not only the quantity of sources studied, but also enhances the quality and diversity of the data provided. The dataset underwent rigorous validation processes to ensure its reliability for scientific use, offering an unparalleled combination of quantity, diversity, and quality in astronomical measurements [2].

For the first time, GDR3 includes calibrated spectra obtained with the blue and red spectrophotometers (BP and RP, respectively) [3]. The internal calibration process standardizes spectra obtained at different times, adjusting them to a common flux and pixel scale. This approach considers variations in the focal plane over time, resulting in an average spectrum of observations for each source [4]. While this method eliminates information on quick spectral variations, it allows for better source identification.

There are some peculiar stars that are difficult to classify using conventional methods due to their unique characteristics. Additionally, the vast amount of spectral information generated by modern telescopes makes it impractical for astronomers to process such data individually. Automatic classification has become imperative in the current era, where a large volume of data needs to be handled. In its DR3 catalog, Gaia released approximately 470 million sources with astrophysical parameters derived from BP/RP spectra [5].

Modern astronomy, driven by humanity’s insatiable curiosity and the desire to understand the universe, is essential for developing a scientific worldview based on empirical observations, verified theories, and logical reasoning [6].

This research specifically focuses on three types of astronomical objects: symbiotic stars, planetary nebulae, and red giants. These types of objects are of interest because they originate from the evolution of low-mass stars. As detailed below, symbiotic stars are distinguished by their binary nature. However, the similarity of their spectra sometimes leads to misclassification.

Red giant (RG). These stars evolved from intermediate-mass stars (from 0.8 to 8 M_☉), with a larger radio but lower surface temperature. As stars like our Sun evolve, they deplete the hydrogen that sustains the nuclear fusion occurring at their core. This causes the core to contract, increasing temperature and pressure, leading to the expansion of the outer layers. As they evolve, their outer layers are expelled in the form of low-speed winds that form envelopes around the remaining high-temperature star (white dwarf), thus becoming a Planetary Nebula. White dwarfs represent the final stage in the life cycle of stars like our Sun. They return a significant number of fusion product elements to the interstellar medium, where these elements then reside [7].
Planetary nebulae (PN). This is an expanding ionized circumstellar cloud that was ejected during the asymptotic giant branch (AGB) phase of its progenitor star, a star below 8 to 9 solar masses [8]. A residual remnant persists from the star in the form of a white dwarf, characterized by its elevated temperature. The UV (ultraviolet) photons emitted by this star ionize the envelope, and the recombination processes in the ionized gas cause emissions in the visible and infrared (IR) range. Additionally, the physical conditions in these ionized envelopes are such that “forbidden” lines of oxygen, nitrogen, sulfur (OII, OIII, NII, SII), etc., are also produced, which are characteristic of these nebulae. These nebulae, in general, form rings or bubbles, but depending on the characteristics of the surrounding material or the binary nature of the progenitor (as in the case of symbiotic stars), they can also have elliptical, bipolar, and even quadrupolar or more complex morphologies [9].
Symbiotic stars (SS). These are stellar systems composed of two separate stars orbiting around their center of mass (MC). They consist of an evolved red giant star (spectral type K or M) that loses and transfers mass to its companion, which is typically a white dwarf with high temperature emitting an important fraction of energy as ionizing photons. Occasionally, a significant wind is detected in this component. The term “stellar symbiosis” is used because each star depends on and influences the evolution of the other. Understanding the processes of mass transfer and accretion in these systems is not only essential for understanding the evolution of stars in general, but also for understanding any binary interaction involving evolved giants [10]. It is also important to know the fraction of objects of this type, both to count them within binary (or multiple) objects and to verify theories of star formation and evolution.

The astronomical objects, PN and SS, are difficult to distinguish from each other due to their shared characteristics and their low representation compared to other celestial objects.

Spectroscopy is a powerful observational technique that provides the ability to obtain and analyze the emission of objects across all wavelengths. In the visible and IR region, it enables the analysis of ionized or neutral chemical elements, as well as the determination of the physical characteristics of the gas. In the mid to far-infrared range, it allows for the analysis of the dust present around objects [11].

PN and SS are characterized by intense emission lines in their visible spectrum, although they can be easily confused with each other. Moreover, photometrically, they can be mistaken for stars of other types, such as red giants or main sequence stars (MS). As mentioned earlier, it is important to identify and quantify these objects as they are the primary contributors to the chemical evolution of galaxies.

To identify the best classification models, several machine learning techniques were analyzed, including Random Forest (RF), Support Vector Machine (SVM), Artificial Neural Networks (ANN), Gradient Boosting, and Naive Bayes Classifier. The models were trained using solely the spectral information of stars of this type, which was obtained from the Gaia DR3 Catalog.

It is important to note that the spectral resolution of the Gaia dataset used in this study is relatively low compared to the spectral resolution obtained with other instruments on Earth or in orbit. A higher spectral resolution would allow for the easier differentiation and distinction of these objects. However, those instruments do not have the coverage and depth that Gaia provides, as Gaia has observed around 220 million low-resolution BP/RP spectra, reaching a magnitude of G < 17.65 [12]. The results were satisfactory, achieving high classification precision values. These models will be a valuable tool, and can support the classification of this type of peculiar stars.

2. Materials and Methods

2.1. Acquisition and Processing of Data

The first step was data acquisition, with data retrieved from the Gaia DR3 Catalog. Initially, the stars were identified using SIMBAD (SYMBAD: https://simbad.cds.unistra.fr/simbad/ accessed on 16 January 2024), a dynamic database that provides information on astronomical objects published in scientific articles and in free databases [13].

Subsequently, a crossmatch was performed in the xp_continuous_mean_spectrum table within Gaia DR3 to determine the astronomical objects by star type. The table contains the mean BP and RP spectra based on the continuous representation in basis functions [3]. Table 1 displays some of its columns, which include the necessary information used to reconstruct the calibrated spectra of the astronomical objects.

The calibrated spectra are represented as a linear combination of basic functions instead of using the conventional flux and wavelength table. This approach helps to avoid potential loss of information when sampling the spectra [4]. The pseudowavelength, denoted as

u

, is used to represent the spectrum

h_{s k}

of a source

s

observed in calibration unit

k

. In this representation, the spectrum is transformed into a linear combination of bases

\sum_{n = 0}^{N} b_{s, n} ϕ_{n}

, which is defined as the mean spectrum and can be expressed by the following equation:

h_{s κ} (u_{i}) \approx \sum_{n = 0}^{N} b_{s n} \sum_{j = - J}^{J} A_{κ} (u_{i}, u_{i + j}) \cdot φ_{n} (u_{i + j}),

(1)

b_{s n}

represents the spectral coefficients of the source spectrum

s

.

φ_{n}

is a linear combination of basis functions.

A_{k}

represents the convolutional kernel, which can be expressed as a linear combination of polynomial basis functions,

A_{κ} (u_{i}, u_{i + j}) = \sum_{l = 0}^{L} c_{j l} \cdot {(u_{i} - u_{ref})}^{l},

(2)

u_{r e f}

is a conveniently chosen reference pseudowavelength.

c_{j l}

coefficients are defined as a polynomial in the AC (Across Scan) coordinate [4].

The following quantities of spectra were retrieved per type of stars from the Gaia DR3 (GDR3) catalog, available on the Gaia Archive website: 201 symbiotic stars, 574 planetary nebulae, and 69,146 red giants. This count resulted from the crossmatch between the SIMBAD and GDR3 databases. As can be seen, the number of red giants is considerably higher compared to the other types of stars. This is because they have a greater representation in our galaxy. Therefore, a subset of these red giants was selected, specifically a sample of 1200. Table 2 displays a comparative class distribution between originally downloaded stellar spectra and those selected for the initial analysis dataset, illustrating the imbalance in the representation of different star types.

2.2. Data Preprocessing

The raw downloaded spectra were internally calibrated within each wavelength range of BP and RP. These were processed using the GaiaXPy library, where each spectrum is calibrated and sampled to a default uniform wavelength grid using the calibrate routine, resulting in a single spectrum on the wavelength range covered by BP and RP.

This calibration and sampling process generates flux values for all the sampled absolute spectra, resulting in a total of 343 values per spectrum. The default sampling was used, resulting in a wavelength range from 336 to 1020 nm, with a 2 nm increment between each sampling point.

To improve the performance and stability of machine learning algorithms during training and inference, min–max normalization was applied to the flux values of each spectrum, setting a scale of 0–1 [14]. This approach expressed all spectrum values as a fraction of the maximum value, establishing a common scale across different spectra (See Figure 1). This prevents certain variables from dominating others due to their absolute values. The following equation demonstrates how min-max normalization is applied:

X_{s c a l e d} = \frac{X_{i} - X_{m i n}}{X_{m a x} - X_{m i n}},

(3)

The normalization process allows for a clearer distinction between different types of spectra, even at a glance. The spectrum of the PN is predominantly composed of emission lines, while the RG spectra mainly exhibit a continuum with many absorption lines and/or bands. On the other hand, the SS spectra represent a combination of both, displaying emission lines and absorption bands in the IR region of the spectrum (700 to 1000 nm).

This dataset consists of 1975 records representing the spectra of the target stars. Each record is composed of 343 features, corresponding to the normalized flux values within the wavelength range of 336 to 1020 nm.

Furthermore, an extra column was incorporated in the dataset, which contains the corresponding labels for the star types. These labels are crucial for identifying and classifying each spectrum based on its category, enabling the utilization of supervised algorithms for training and prediction purposes.

The data exhibit a notable class imbalance, as there is a significant difference in the number of spectra for each star type. Data imbalance can have a detrimental impact on the performance of machine learning algorithms because they may struggle to learn patterns, and make inaccurate decisions for minority classes. This assertion was conclusively validated in Section 3.3, which further substantiates the analysis through weighted loss calculations and presents the outcomes of a ten-fold cross-validation procedure, assessing the findings by computing mean and standard deviation values.

To mitigate this issue and improve classification accuracy, data balancing techniques were implemented, such as oversampling the minority class and undersampling the majority class, ensuring an equitable distribution between both classes [15]. This approach allows classification algorithms to receive a balanced representation of the classes during training.

In addition to the data balancing techniques applied in this study, such as oversampling of minority classes and synthetic data generation with noise, it is important to consider other methods to address the imbalance. An alternative approach that can be effective is the use of weighted loss during model training. This technique assigns a higher weight to minority classes in the algorithm’s loss function, thus compensating for the disproportion in class representation without modifying the original dataset [16].

Our study will adopt a comparative approach, first analyzing the results obtained with the original imbalanced dataset and then comparing them with the outcomes after applying class balancing techniques. This methodology will allow us to objectively assess the impact of class imbalance on our specific problem and justify any decisions regarding the use of data balancing techniques.

The original dataset exhibits a significant imbalance in the representation of different star types, as illustrated in Table 2. This imbalance poses potential challenges for training machine learning algorithms, as it could lead to bias towards the majority class (red giant stars) and poor performance in classifying minority classes (symbiotic stars and planetary nebulae).

To address this issue, we propose a two-phase approach instead of immediately applying class balancing techniques:

Initial analysis with imbalanced data—First, we will train and evaluate our models using the original imbalanced dataset. This will allow us to establish a baseline performance and assess the actual impact of class imbalance on our specific problem;
Comparison with balanced data—If significant bias or poor performance is observed in the minority classes, we will proceed to apply class balancing techniques. We will use the oversampling method for minority classes, as described earlier, and compare the results with the original dataset.

As will be demonstrated in the following section, due to the class imbalance and the suboptimal performance exhibited by some algorithms on the imbalanced dataset, a decision was made to construct a balanced dataset. This new dataset ensures that each star type is represented by 1000 samples. The choice of selecting one thousand objects per class is based on several key factors. Firstly, this number is large enough to provide a representative and robust sample of each class, allowing machine learning algorithms to adequately capture the features and variability of the data. Additionally, having one thousand objects per class ensures a proper balance, mitigating bias toward the majority class and enhancing the model’s ability to generalize and correctly recognize objects from minority classes [17].

This balanced approach aims to address both the inherent class imbalance in the original data and the performance issues observed with certain algorithms (SVM and Naive Bayes), potentially leading to more accurate and reliable classification results across all star types. The comparative results of the algorithms’ performance on both the original imbalanced dataset and this new balanced dataset will be presented and discussed in detail in the Results section.

In the case of red giants, there was no issue, as the recovered quantity exceeded this number. Therefore, samples were randomly selected until the desired quantity was reached. However, in the case of symbiotic stars and planetary nebulae, the number of samples was insufficient. Therefore, the option was taken to generate new spectra from the original ones. To achieve this, the method of adding white noise was employed. A sequence of random numbers was generated, following a normal distribution with mean 0 and a variable standard deviation ranging from 0.01 to 0.05.

The process of generating new spectra involved combining the original data with the generated white noise (see Figure 2), and this allowed for an expansion of the dataset and a balancing of the classes, ensuring that all categories were adequately represented.

Applying the same level of standard deviation to all spectra, it is ensured that all samples have the same amount of added random variability. This avoids possible biases or excessive differences between the generated spectra, which could affect the interpretation and comparison of the results. These standard deviation values are in line with typical noise fluctuations observed in many scientific experiments and spectroscopic measurements.

This new balanced dataset, like the previous one, would consist of 343 features representing the flux values, and an additional column representing the spectrum label. In this case, a final count of 3000 spectra was achieved, with an equal distribution of 1000 spectra for each object type. This balanced dataset ensures that each type of star is adequately and proportionally represented, which is crucial to avoiding biases and enabling more accurate analysis and modeling. Table 3 presents the distribution of stellar spectra in the balanced dataset, categorized by star type. It illustrates the composition of each class, distinguishing between original spectra obtained from observations and those synthetically generated to achieve balance.

2.3. Exploring Class Differences through t-SNE Visualization

To analyze potential differences between classes in our study, we employed the t-SNE (t-Distributed Stochastic Neighbor Embedding) algorithm. t-SNE is a popular unsupervised machine learning technique for data visualization and dimensionality reduction [18]. We applied t-SNE to our dataset, projecting it into lower dimensional spaces of two dimensions. By analyzing the resulting plots, we were able to identify clusters or groupings of samples that shared similar characteristics. These clusters provided insights into the presence of distinct classes and shed light on the differences between them. Additionally, t-SNE enabled the identification of outliers or samples that deviated from the main clusters.

The t-SNE projection of the balanced data shows a significant improvement in the separation and definition of class clusters. This indicates that the classes are more distinguishable from each other compared to the unbalanced data. The resulting visualization provides a clearer representation of the inherent differences between the classes (see Figure 3).

However, despite this improvement, the presence of overlapping points between the classes can still be observed. This suggests that there may be inherent similarities or shared characteristics between certain samples from different classes. These areas of overlap indicate that the boundaries between the classes are not clearly defined and may represent cases where classification is more challenging.

It is important to note that, when applying machine learning algorithms to classify these classes, it is possible to achieve good overall results due to the improved separation and definition of the clusters, but it is also normal to expect some errors in classification due to the presence of overlapping points and similarities between the classes. However, it is expected that the performance will be improved (compared to t-SNE) since the number of parameters used is higher (in this case only the first two principal components are used).

2.4. Analysis and Selection of Algorithms

The formed datasets were divided into two subsets each. The first subset, representing 80% of the total samples, would be the training set, which was used to train various machine learning algorithms. This 80/20 split was chosen based on the Pareto principle or 80/20 rule, a common practice in machine learning. This division strikes a balance between having enough data to train robust models and retaining an adequate amount for subsequent validation [19]. During the training process, the algorithms learn patterns and relationships in the data to make predictions or decisions based on new data. The goal is for the algorithms to capture the underlying patterns in the training data and be able to generalize that knowledge to unseen data.

The other subset, representing the remaining 20%, was reserved for testing purposes. This dataset is used exclusively for evaluation and is not used during training. The aim of testing is to determine whether the algorithms have successfully learned and generalized without overfitting. Using a separate test set it helps detect if the algorithm has overfitted the training data, and provides a more realistic estimation of its performance.

For the analysis, the following supervised ML algorithms were used for classification: Random Forest, Support Vector Machine, Artificial Neural Networks, Gradient Boosting, and Naive Bayes. The selection of these algorithms provides a diverse combination of classification approaches, allowing for the evaluation and comparison of their performance on the test set. This enables us to obtain a more comprehensive understanding of their classification capability, and determine which one suits best our specific problem.

2.4.1. Algorithm Random Forest

Random Forest is a supervised machine learning algorithm that combines tree predictors in a way that each tree depends on the values of a randomly sampled vector, independently and with the same distribution for all trees in the forest [20]. Decision trees tend to overfit, meaning they learn the training data accurately but struggle to apply that knowledge to new data. However, it is possible to enhance their generalization ability by combining multiple trees into a set. This technique, known as an ensemble, has been proven to be highly effective in various problems, striking a balance between ease of use, flexibility, and the ability to apply learning to different situations.

An advantage of this algorithm is that it does not require scaled data. However, in our case, the data were normalized, which allows all parameters to have equal importance. Several training tests were conducted by varying the parameters provided to the algorithm in each case (See Table 4).

2.4.2. Algorithm Support Vector Machine

Support Vector Machine (SVM) is a supervised machine learning algorithm primarily used for data classification. Instead of directly operating on the original data, SVM represents them as points in a multi-dimensional space [21]. Each feature becomes a coordinate of these points, enabling us to visualize and analyze the relationships between variables. The goal of SVM is to find the hyper-plane that optimally separates the classes.

Different parameter tests were conducted, using different kernels for each one. Table 5 displays the analyzed configurations.

2.4.3. Algorithm Artificial Neural Networks

Artificial Neural Networks (ANN) are a subset of machine learning tools and are at the core of deep learning algorithms. Their name and structure are inspired by the human brain, trying to reproduce the way biological neurons send signals to each other. They consist of several layers of nodes, including an input layer, one or more hidden layers, and an output layer. ANNs possess high processing speeds and the ability to learn the solution to a problem from a set of examples [22].

The designed neural network has the following topology: an input layer of 64 neurons, followed by three hidden layers of 32 neurons each. All layers are dense, meaning all neurons are fully connected, and they use the ReLU activation function to introduce nonlinearity into the data. After each dense layer, a Dropout layer is added, which randomly deactivates 10% of the neurons during training. This helps prevent overfitting and improves the generalization ability of the model. The output layer consists of three neurons and uses the softmax activation function, commonly used in multiclass classification problems.

To compile the model, the “adam” optimizer is used, which is an optimization algorithm that adjusts the weights of the neural network during training. The loss function is set as “sparse_categorical_crossentropy”, which is suitable for multiclass classification problems with integer labels. Table 6 showcases the neural network configuration.

2.4.4. Algorithm Gradient Boosting

Gradient Boosting is an algorithm that focuses on numerical optimization of the function space rather than the parameter space. It is based on additive stage-wise expansions and aims to find an approximation of the objective function that minimizes a specific loss function. It works iteratively, where, at each stage, a new component is added to the existing approximation, adjusting it based on the gradient of the loss function. This allows for a gradual improvement of the approximation, and achieves competitive results in both regression and classification problems [23].

Different parameter combinations were tested, using various loss functions and different learning rates, among others, resulting in the following configurations (See Table 7).

2.4.5. Algorithm Naive Bayes

The Naive Bayes classifier is a mathematical classification technique widely used in machine learning. It is based on Bayes’ Theorem and uses probabilistic calculations to find the most appropriate classification for a given dataset within a problem domain. It is very useful for cases where the number of target classifications is greater than two, making it more suitable for real-life classification applications [24] (see Table 8).

The algorithm was trained with the following configuration.

3. Results

3.1. Definition of Metrics

To compare the accuracy of the previously presented algorithms, the following metrics were used: Precision, F1-score, Recall (Sensitivity), and Cohen’s Kappa coefficient. These metrics were calculated after the algorithms evaluated the test dataset using the confusion matrix. This matrix displays the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) for each class.

Precision is a metric that calculates the proportion of correct predictions in relation to the total number of samples [25]. It is useful for evaluating the overall performance of a classifier. The formula is as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

Recall is a metric that measures the proportion of positive instances that were correctly identified [25]. It is useful for evaluating the classifier’s ability to find all relevant samples of a specific class. The formula is as follows:

R e c a l l = \frac{T P}{T P + F N}

(5)

F1-score is a measure that combines Precision and Recall into a single value. It provides a balanced measure between the classifier’s precision and recall capabilities [25]. It is particularly useful when the dataset is imbalanced in terms of classes. The formula is as follows:

F 1 - S c o r e = \frac{T P}{T P + \frac{1}{2} (F P + F N)}

(6)

Cohen’s Kappa coefficient is a measure that expresses the level of agreement between two annotators in a classification problem [26]. It is defined as follows:

κ = \frac{2 \times (T P \times T N - F N \times F P)}{(T P + F P) \times (F P + T N) + (T P + F N) \times (F N + T N)}

(7)

In our multiclass classification study, the evaluation metrics commonly used for binary classification problems were adapted for application in a multiclass context. To assess the performance of our classification model, we used the macro-averaged technique [27].

The choice of this approach was based on the need to treat all classes equally during evaluation, regardless of their size or data distribution. First, the metrics were calculated in a binary manner for each class individually, and then these metrics were averaged to obtain an overall evaluation of the model.

It is important to note that results from models trained on balanced and imbalanced datasets are not directly comparable due to differences in data distribution and model behavior. In addition to evaluating the performance, this study highlights how data balancing influences a model’s ability to generalize and accurately classify instances. By presenting the results separately, we aim to demonstrate the impact of data distribution on model performance and provide insights into the advantages and limitations of using balanced datasets.

3.2. Evaluation of Trained Models

3.2.1. Random Forest

Figure 4 displays the confusion matrix generated when running the Random Forest algorithm. It can be observed that the classification of symbiotic stars and planetary nebulae had higher imprecision. However, in the balanced dataset, despite the presence of errors, they are lower compared to the imbalanced dataset. This is reflected in the improved Recall value, which increased from 0.8575 to 0.9467, indicating a higher capability of the model to correctly identify stars. Table 9 presents the values obtained for all evaluated metrics using the Random Forest algorithm.

3.2.2. Support Vector Machine

The SVM algorithm exhibited poor performance in the tests conducted with the imbalanced dataset. In this scenario, the model tended to overfit and failed to generalize correctly, resulting in an incorrect classification of 12% of the total samples using the imbalanced dataset (See Figure 5). However, when using the balanced dataset, a significant improvement in performance was observed, with a considerable reduction in the classification error (6.16%). The recall value also improved from 0.7807 to 0.9383. Table 10 presents the values obtained for all evaluated metrics using the SVM algorithm.

3.2.3. Gradient Boosting

The Gradient Boosting algorithm has shown similar performance to Random Forest in our study. However, there is a presence of false positives due to the imbalance in the training data. Nevertheless, significant improvements are achieved when using the balanced dataset (See Figure 6). The F1-Score value increased from 0.8666 to 0.9283. This increase indicates a higher capability of the algorithm to correctly identify the relevant stars, thereby reducing false positives. Table 11 presents the values obtained for all evaluated metrics using the Gradient Boosting algorithm.

3.2.4. Naive Bayes

The Naive Bayes algorithm achieved poor performance compared to other algorithms used in this study. One of the reasons for this low performance is its assumption of independence between features. This assumption may not hold true in many real-world cases, resulting in poor generalization and unsatisfactory results, as evidenced by the analysis of the confusion matrix in Figure 7. Even when conducting tests using balanced datasets, Naive Bayes fails to achieve adequate generalization compared to the previously used algorithms. This is reflected in lower F1-Score values, with 0.7877 for imbalanced data and 0.8005 for balanced data. Table 12 presents the values obtained for all evaluated metrics using the Gradient Boosting algorithm.

3.2.5. Artificial Neural Networks

The Artificial Neural Networks algorithm showed superior results compared to the previous algorithms, achieving the highest overall precision values. This is evident in the confusion matrix shown in Figure 8, where a decrease in the number of false positives is observed for both the balanced and imbalanced datasets. The F1-Score value supports this claim, with a value of 0.9012 and 0.9533 for the imbalanced and balanced datasets, respectively (See Table 13). Its ability to capture complex relationships among the data and adapt to different patterns positions it as a favorable option, particularly in cases of imbalanced datasets.

3.3. Comparison of the Results

To comprehensively address class disparity, the models were evaluated using the imbalanced dataset with weighted loss, as illustrated in Table 14. This technique assigns different weights to minority and majority classes, allowing it to compensate for the imbalance and improve the model’s performance in predicting minority classes.

Among the various algorithms evaluated, the Random Forest and Artificial Neural Networks stood out as the top performers in handling the class imbalance issue. These two algorithms exhibited remarkable performance, achieving precision values of 0.9467 and 0.9532, respectively, when trained on the balanced dataset. Even with the imbalanced dataset, their performance remained robust—Random Forest achieved a precision of 0.9031, while the Neural Network reached 0.9312. These findings demonstrate the superior effectiveness of these two algorithms in handling imbalanced data compared to other approaches.

Table 15 and Table 16 provide a comprehensive comparison of the metrics obtained using balanced and imbalanced data, respectively. These two standout algorithms will be utilized to validate new spectra candidates for symbiotic stars and planetary nebulae classification in the next section.

Given their outstanding performance, the Random Forest and Artificial Neural Networks models were selected for further validation using a rigorous ten-fold cross-validation approach. This validation was applied exclusively to the balanced dataset, which was augmented with synthetic noise using oversampling techniques. To ensure the robustness of the evaluation process, care was taken to prevent any overlap between original records and their artificial counterparts across different folds. Following best practices for handling artificial records, if an original record was included in the training set of a given fold, its noisy artificial versions were excluded from the test set of the same fold. To meet this requirement, the original and generated spectra were kept within the same fold. This strategy prevented any overlap between original records and their artificial counterparts during evaluation. The results of this evaluation are summarized in Table 17 and Table 18.

The results obtained through 10-fold cross-validation on the balanced dataset were notable, with mean precision values of 0.92086 for Random Forest and 0.9235 for Artificial Neural Networks. Despite mitigating the potential bias associated with the addition of synthetic data, the results were satisfactory, slightly surpassing those achieved with the imbalanced dataset.

3.4. Second Evaluation of Trained Models

3.4.1. Suspected Symbiotic Stars

To assess the effectiveness and demonstrate the practical application of the trained models, we conducted the classification of two sets of stars suspected to be symbiotic stars. For this task, we utilized the ANN and Random Forest models, which showed superior performance in the earlier tests.

The first dataset used was obtained from the article “A catalogue of symbiotic stars” [28], initially consisting of 30 candidate symbiotic stars. From the original dataset, common stars used in the model training process were excluded, as well as those lacking spectral information in GDR3. The result of this careful selection process was a set of 15 stars, which were subsequently evaluated using our previously trained models, as detailed in Table 19. The Figure 9 shows the spectral representation of the stars obtained previously from the article.

The second dataset was obtained from the article “A machine learning approach for identification and classification of symbiotic stars using 2MASS and WISE” [20]. Like the dataset obtained from the previous article, common stars used to train the models were removed, as well as those lacking spectral information in GDR3. The outcome of this selection process was a set of 17 stars, which were subsequently evaluated using the previously trained models, as detailed in Table 20. The Figure 10 shows the spectral representation of the stars obtained previously from the article.

The use of spectral information in the training of the models allowed for capturing specific characteristics and patterns associated with symbiotic stars. Spectral properties, such as emission and absorption lines, played a crucial role in the classification of the stars.

However, the quality and reliability of the obtained results must be analyzed with caution. Although most of the stars were classified as symbiotic by the models, further validation is necessary to effectively confirm the true nature of the classified stars. To achieve this, new models trained with the whole spectra could be employed.

Furthermore, it is important to consider other types of stars to increase the size of the dataset and achieve the better generalization of the models.

3.4.2. Suspected Planetary Nebulae

In addition to the previous tests, an additional analysis was conducted using a carefully selected set of 15 stars that are considered candidates for planetary nebulae. Following this careful selection process, this group of stars was evaluated using our previously trained models, as detailed in Table 21. This information was obtained from the Symbad database.

The purpose of including this group of 15 candidate stars for planetary nebulae in the study was to further expand the understanding of the models and their ability to identify and classify specific celestial objects. The Figure 11 shows the spectral representation of the stars obtained previously from the article.

4. Discussion

This study focuses on the classification of symbiotic stars, planetary nebulae, and red giants using machine learning techniques, contributing to the growing application of artificial intelligence in astronomy. Unlike previous works centered on more general classifications, this specific approach represents a significant advancement in the classification of stellar objects in advanced evolutionary stages. The results obtained with Random Forest and Artificial Neural Networks surpass previous studies, such as that of Kheirdastan and Bazarghan [30], achieving accuracies above 90% in both balanced and unbalanced datasets, compared to the previously reported 80%.

A distinctive aspect of this work is the explicit approach to the problem of class imbalance, a common challenge relating to real astronomical data. While Qi used the SMOTE technique to handle imbalance [31], our study directly compares the performances of models on balanced and unbalanced sets, providing valuable insights into the robustness of algorithms under realistic conditions. The effectiveness demonstrated by Artificial Neural Networks, especially with unbalanced data, aligns with recent trends, such as that shown in the work of Zhao Z. on the classification of stellar spectra [32].

Despite the significant achievements, limitations similar to those mentioned by Tamez Villarreal and Barton are acknowledged in terms of the diversity and size of the datasets used. It is suggested that future studies focus on expanding the database, including a greater variety of objects and observational conditions [33], so as to further improve the robustness and generalization of the models.

5. Conclusions

The study focused on the classification of the following astronomical objects: symbiotic stars, planetary nebulae, and red giants. It was observed that red giants were more numerous and represented a larger proportion compared to the other classes. This imbalance caused difficulties in classification, as the models tended to confuse part of the minority classes (symbiotic stars and planetary nebulae) with the majority class (red giants).

It was found that the Random Forest and ANN algorithms showed satisfactory results, with accuracy values of 0.9467 and 0.9532, respectively, for the balanced dataset. On the other hand, for the imbalanced dataset, Random Forest achieved an accuracy of 0.9031, while the ANN achieved an accuracy of 0.9312. Therefore, these algorithms demonstrate greater effectiveness on imbalanced datasets compared to other approaches. ANN proved to be highly effective when working with imbalanced datasets due to their ability to learn complex patterns and adapt to different class distributions.

As a result of our research, two machine learning models were successfully trained to classify peculiar stars with high accuracy. These models have been validated using diverse datasets, demonstrating their effectiveness in real-world scenarios. Additionally, these models can be employed to classify stars that are difficult to categorize, including those without prior classification or those considered potential candidates for peculiar stars. This functionality enhances their effectiveness in identifying and assessing previously ambiguous or uncertain stellar objects.

These classification models present a valuable and highly useful tool for astronomers, providing effective support in the classification of peculiar stars. Their application significantly contributes to a better understanding of the stellar life cycle.

Author Contributions

Data Curation, O.J.P.C.; Formal analysis, O.J.P.C.; Conceptualization, C.A.M.P., S.G.N.J., L.J.C.E. and M.M.O.; Methodology, C.A.M.P., S.G.N.J. and L.J.C.E.; Software, O.J.P.C.; Validation, S.G.N.J. and O.J.P.C.; Visualization, O.J.P.C.; Investigation, O.J.P.C., C.A.M.P. and S.G.N.J.; Resources, C.A.M.P. and M.M.O.; Writing—original draft, O.J.P.C.; Writing—review & editing, C.A.M.P., S.G.N.J., L.J.C.E. and M.M.O.; Supervision, C.A.M.P., S.G.N.J., L.J.C.E. and M.M.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All original contributions and findings presented in this study are documented within the article. The data supporting the results, along with the tests conducted and the corresponding code, are available in the publicly accessible GitHub repository: https://github.com/opcruz/gaiaDR3ML (accessed on 29 July 2024). Should additional details or clarifications be required, further inquiries can be directed to the corresponding author.

Acknowledgments

We would like to acknowledge the use of artificial intelligence, specifically ChatGPT 3.5, in the preparation of this article. AI was utilized to assist us in improving the clarity and readability of certain paragraphs, ensuring the language was more accessible. Additionally, AI supported us in translating specific sections into English to maintain accuracy. However, we would like to emphasize that AI was not used to generate any new information or content; it solely served as a tool to refine the presentation of the material we had already developed. All original research, analysis, and conclusions presented in this article are the work of the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Prusti, T.; de Bruijne, J.H.J.; Vallenari, A.; Babusiaux, C.; Bailer-Jones, C.A.L.; Bastian, U.; Biermann, M.; Evans, D.W.; Eyer, L.; Gaia Collaboration; et al. The Gaia mission. Astron. Astrophys. 2016, 595, A1. [Google Scholar] [CrossRef]
Witten, C.E.C.; Aguado, D.S.; Sanders, J.L.; Belokurov, V.; Evans, N.W.; Koposov, S.E.; Prieto, C.A.; De Angeli, F.; Irwin, M.J. Information content of BP/RP spectra in Gaia DR3. Mon. Not. R. Astron. Soc. 2022, 516, 3254–3265. [Google Scholar] [CrossRef]
Babusiaux, C.; Fabricius, C.; Khanna, S.; Muraveva, T.; Reylé, C.; Spoto, F.; Vallenari, A. Gaia Data Release 3—Catalogue validation. Astron. Astrophys. 2023, 674, A32. [Google Scholar] [CrossRef]
Carrasco, J.M.; Weiler, M.; Jordi, C.; Fabricius, C.; De Angeli, F.; Evans, D.W.; van Leeuwen, F.; Riello, M.; Montegriffo, P. Internal calibration of Gaia BP/RP low-resolution spectra. Astron. Astrophys. 2021, 652, A86. [Google Scholar] [CrossRef]
Gaia Data Release 3 Contents Summary—Gaia-Cosmos. Available online: https://www.cosmos.esa.int/web/gaia/dr3 (accessed on 23 November 2022).
Karttunen, H.; Kröger, P.; Oja, H.; Poutanen, M.; Donner, K.J. Fundamental Astronomy; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar] [CrossRef]
Jastrow, R. Red Giants and White Dwarfs; W. W. Norton & Company: New York, NY, USA, 1990; p. 269. [Google Scholar]
Frankowski, A.; Soker, N. Very late thermal pulses influenced by accretion in planetary nebulae. New Astron. 2009, 14, 654–658. [Google Scholar] [CrossRef]
Kwok, S. The Origin and Evolution of Planetary Nebulae; Cambridge University Press: New York, NY, USA, 2000; Volume 243. [Google Scholar]
Mikolajewska, J. Symbiotic stars: Observations confront theory. arXiv 2012, arXiv:1110.2361. [Google Scholar] [CrossRef]
Carroll, B.W.; Ostlie, D.A. An Introduction to Modern Astrophysics, 2nd ed.; Cambridge University Press (CUP): Cambridge, UK, 2019; pp. 1–1341. [Google Scholar] [CrossRef]
Gaia Collaboration; Vallenari, A.; Brown, A.; Prusti, T. Gaia Data Release 3: Summary of the content and survey properties. Astron. Astrophys. 2022, 674, A1. [Google Scholar] [CrossRef]
Wenger, M.; Ochsenbein, F.; Egret, D.; Dubois, P.; Bonnarel, F.; Borde, S.; Genova, F.; Jasniewicz, G.; Laloë, S.; Lesteven, S.; et al. The SIMBAD astronomical database—The CDS reference database for astronomical objects. Astron. Astrophys. Suppl. Ser. 2000, 143, 9–22. [Google Scholar] [CrossRef]
Gopal, S.; Patro, K.; Sahu, K.K. Normalization: A Preprocessing Stage. Int. Adv. Res. J. Sci. Eng. Technol. 2015, 2, 20–22. [Google Scholar] [CrossRef]
Tyagi, S.; Mittal, S. Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning. In Lecture Notes in Electrical Engineering; Springer: Cham, Switzerland, 2020; Volume 597, pp. 209–221. [Google Scholar] [CrossRef]
Fernando, K.R.M.; Tsokos, C.P. Dynamically Weighted Balanced Loss: Class Imbalanced Learning and Confidence Calibration of Deep Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 2940–2951. [Google Scholar] [CrossRef]
Rahman, M.M. Sample Size Determination for Survey Research and Non-Probability Sampling Techniques: A Review and Set of Recommendations. J. Entrep. Bus. Econ. 2023, 11, 42–62. [Google Scholar]
Van Der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Vrigazova, B. The Proportion for Splitting Data into Training and Test Set for the Bootstrap in Classification Problems. Bus. Syst. Res. Int. J. Soc. Adv. Innov. Res. Econ. 2021, 12, 228–242. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Noble, W.S. What is a support vector machine? Nat. Biotechnol. 2006, 24, 1565–1567. [Google Scholar] [CrossRef]
Bishop, C.M. Neural networks and their applications. Rev. Sci. Instrum. 1998, 65, 1803. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Yang, F.J. An implementation of naive bayes classifier. In Proceedings of the 2018 International Conference on Computational Science and Computational Intelligence, CSCI, Las Vegas, NV, USA, 12–14 December 2018; pp. 301–306. [Google Scholar] [CrossRef]
Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar]
Chicco, D.; Warrens, M.J.; Jurman, G. The Matthews Correlation Coefficient (MCC) is More Informative Than Cohen’s Kappa and Brier Score in Binary Classification Assessment. IEEE Access 2021, 9, 78368–78381. [Google Scholar] [CrossRef]
Farhadpour, S.; Warner, T.A.; Maxwell, A.E. Selecting and Interpreting Multiclass Loss and Accuracy Assessment Metrics for Classifications with Class Imbalance: Guidance and Best Practices. Remote Sens. 2024, 16, 533. [Google Scholar] [CrossRef]
Belczyński, K.; Mikołajewska, J.; Munari, U.; Ivison, R.J.; Friedjung, M. A catalogue of symbiotic stars. Astron. Astrophys. Suppl. Ser. 2000, 146, 407–435. [Google Scholar] [CrossRef]
Akras, S.; Leal-Ferreira, M.L.; Guzman-Ramirez, L.; Ramos-Larios, G. A machine learning approach for identification and classification of symbiotic stars using 2MASS and WISE. Mon. Not. R. Astron. Soc. 2019, 483, 5077–5104. [Google Scholar] [CrossRef]
Kheirdastan, S.; Bazarghan, M. SDSS-DR12 bulk stellar spectral classification: Artificial neural networks approach. Astrophys. Space Sci. 2016, 361, 304. [Google Scholar] [CrossRef]
Qi, Z. Stellar Classification by Machine Learning. SHS Web Conf. 2022, 144, 03006. [Google Scholar] [CrossRef]
Zhao, Z.; Wei, J.; Jiang, B. Automated Stellar Spectra Classification with Ensemble Convolutional Neural Network. Adv. Astron. 2022, 2022, 4489359. [Google Scholar] [CrossRef]
Villarreal, J.T.; Barton, S. Stellar Classification based on Various Star Characteristics using Machine Learning Algorithms. J. Stud. Res. 2023, 12. [Google Scholar] [CrossRef]

Figure 1. The figures represent three different types of stars, (blue) symbiotic star [EM* AS 323], (red) red giant [2MASS J04390924+2847545], and (green) planetary nebula [Hen 2-442]. In the initial diagram, the flux values are denoted in Watts per nanometer per square meter (W/nm/m²), derived from external calibration using GaiaXpy library. In the adjacent figure, the flux values are normalized to a range between 0 and 1.

Figure 2. The figure displays the original spectrum of the symbiotic star (blue) and the spectrum resulting from the addition of noise (green), following a normal distribution with a mean of 0 and a standard deviation of 0.05.

Figure 3. t-SNE representation of unbalanced and balanced data in a two-dimensional space. The points represent samples from different classes. (Left) t-SNE projection of unbalanced data. A prominent cluster of the RG class is evident in the left side, whereas samples from the PN class are in the upper right region. The SS class exhibits weak clustering, blending with the other classes. (Right) t-SNE projection of balanced data in a two-dimensional space reveals improved separation and clustering of classes. The class clusters are more well-defined, although there still exist points that overlap with other classes.

Figure 4. Confusion matrix resulting from the execution of the Random Forest algorithm. On the left are the results with the unbalanced dataset and on the right with balanced dataset.

Figure 5. Confusion matrix resulting from the execution of the SVM algorithm. On the left are the results derived with unbalanced dataset and on the right with balanced dataset.

Figure 6. Confusion matrix resulting from the execution of the Gradient Boosting algorithm. On the left are the results derived with the unbalanced dataset and on the right with the balanced dataset.

Figure 7. Confusion matrix resulting from the execution of the Naive Bayes algorithm. On the left are the results derived with the unbalanced dataset and on the right with the balanced dataset.

Figure 8. Confusion matrix resulting from the execution of the ANN algorithm. On the left are the results derived with the unbalanced dataset and on the right with the balanced dataset.

Figure 9. The figure displays the spectra obtained from the Gaia DR3 catalog that were sourced from the article “A catalogue of symbiotic stars” [28].

Figure 10. The figure displays the spectra obtained from the Gaia DR3 catalog that were sourced from the article “A machine learning approach for identification and classification of symbiotic stars using 2MASS and WISE” [29].

Figure 11. The figure displays the spectra of suspected planetary nebulae stars obtained from the Gaia DR3 catalog that were sourced from Symbad database online.

Table 1. Description of the fields of the XP_CONTINUOUS_MEAN_SPECTRUM table of the Gaia DR3 catalog.

Field	Description
source_id	Unique Source Identifier (unique within a particular data publication).
bp_basis_function_id/rp_basis_function_id	Identifier that defines the set of fundamental functions for representing the BP/RP spectrum.
degrees_of_freedom	Degrees of freedom for spectrum representation.
bp_n_parameters/rp_n_parameters	Number of parameters for BP/RP spectrum representation. Number of parameters in the LSQ system.
bp_n_measurements/rp_n_measurements	Number of measurements used for generating the BP/RP spectrum.
bp_n_rejected_measurements/ rp_n_rejected_measurements	Number of rejected measurements in the generation of the BP/RP spectrum.
bp_standard_deviation/rp_standard_deviation	Standard deviation for the representation of the BP/RP spectrum.
bp_chi_squared/rp_chi_squared	Chi-square for the representation of the BP/RP spectrum.
bp_coefficients/rp_coefficients	Coefficients (C) of the basis function for the representation of the BP/RP spectrum.
bp_coefficient_errors/rp_coefficient_errors	Errors in the coefficients of the basis function for the representation of the BP/RP spectrum.
bp_coefficient_correlations/ rp_coefficient_correlations	Upper triangular part of the correlation matrix M of the coefficients C for the basis function representation of the BP/RP spectrum.
bp_n_relevant_bases/rp_n_relevant_bases	Number of relevant bases for representing this average BP/RP spectrum.
rp_relative_shrinking/bp_relative_shrinking	Measure of the relative contraction of the coefficient vector when truncation is applied to the average BP/RP spectrum.

Table 2. Class distribution between the originally downloaded stellar spectra and those selected for the initial analysis, revealing imbalances in the representation of star types.

Stars	Downloaded		Selection
Symbiotic Stars	201	0.29%	201	10.18%
Planetary Nebulae	574	0.82%	574	29.06%
Red Giants	69,146	98.89%	1200	60.76%
Total	69,921	100%	1975	100%

Table 3. Class distribution in the balanced dataset, showing the proportion between original stellar spectra and synthetically generated ones for each star type.

Stars	Original		Generated
Symbiotic Stars	201	20.1%	799	79.9%
Planetary Nebulae	574	57.4%	426	42.6%
Red Giants	1000	100%	0	0.0%

Table 4. Description of the parameters used in the training of the Random Forest algorithm.

Parameters	Selection
Number of trees	[10, 500]
Max deep	Without restrictions
Criterion	gini (Gini impurity)
Maximum features	Square root of number of features
Maximum number of leaves	Unlimited

Table 5. Description of the parameters used in training the SVM algorithm.

Parameters	Selection
Kernel	[‘linear’, ‘poly’, ‘rbf’]
C (Regularization parameter)	1.0
Polynomial kernel degree	3

Table 6. Description of the parameters used in training the ANN algorithm.

Parameters	Selection
Layers	(343-64-32-32-3)
Activation Function	ReLU
Loss Function	sparse_categorical_crossentropy
Optimizer	adam

Table 7. Description of the parameters used in training the Gradient Boosting algorithm.

Parameters	Selection
Loss function	[‘log_loss’, ‘deviance’, ‘ exponential’]
Number of estimators	[100–500]
Learning rate	[0.1–0.5]
Max deep	3
Number of features	Total of features

Table 8. Description of the parameters used in the training of the Naive Bayes algorithm.

Parameters	Selection
Prior class probabilities	None

Table 9. Metrics obtained from the Random Forest algorithm.

Metrics	Unbalanced Dataset	Balanced Dataset
Precision	0.9031	0.9467
Recall	0.8575	0.9467
F1-Score	0.8780	0.9467
Kappa	0.8401	0.9200

Table 10. Metrics obtained from the SVM algorithm.

Metrics	Unbalanced Dataset	Balanced Dataset
Precision	0.8567	0.9384
Recall	0.7807	0.9383
F1-Score	0.8111	0.9382
Kappa	0.7818	0.9075

Table 11. Metrics obtained from the Gradient Boosting algorithm.

Metrics	Unbalanced Dataset	Balanced Dataset
Precision	0.8848	0.9286
Recall	0.8515	0.9283
F1-Score	0.8666	0.9283
Kappa	0.8158	0.8925

Table 12. Metrics obtained from the Naive Bayes algorithm.

Metrics	Unbalanced Dataset	Balanced Dataset
Precision	0.7800	0.8052
Recall	0.8307	0.8000
F1-Score	0.7877	0.8005
Kappa	0.7380	0.7000

Table 13. Metrics obtained from the ANN algorithm.

Metrics	Unbalanced Dataset	Balanced Dataset
Precision	0.9312	0.9532
Recall	0.8867	0.9533
F1-Score	0.9067	0.9532
Kappa	0.8686	0.9300

Table 14. Comparison of the metrics obtained by the models on the unbalanced dataset with weighted loss.

Metrics	Models (Unbalanced Dataset—Weighted Loss)
Metrics	Naive Bayes	SVM	Gradient Boosting	Random Forest	RNA
Precision	0.7800	0.8190	0.8729	0.8550	0.8878
Recall	0.8307	0.8611	0.8683	0.8501	0.8785
F1-Score	0.7877	0.8275	0.8701	0.8406	0.8783
Kappa	0.7380	0.7919	0.8234	0.8001	0.8318

Table 15. Comparison of the metrics obtained by the models on the unbalanced dataset.

Metrics	Models (Unbalanced Dataset)
Metrics	Naive Bayes	SVM	Gradient Boosting	Random Forest	RNA
Precision	0.7800	0.8567	0.8848	0.9031	0.9312
Recall	0.8307	0.7807	0.8515	0.8575	0.8867
F1-Score	0.7877	0.8111	0.8666	0.8780	0.9067
Kappa	0.7380	0.7818	0.8158	0.8401	0.8686

Table 16. Comparison of the metrics obtained by the models on the balanced dataset.

Metrics	Models (Balanced Dataset)
Metrics	Naive Bayes	SVM	Gradient Boosting	Random Forest	RNA
Precision	0.8052	0.9384	0.9286	0.9467	0.9532
Recall	0.8000	0.9383	0.9283	0.9467	0.9533
F1-Score	0.8005	0.9382	0.9283	0.9467	0.9533
Kappa	0.8052	0.9384	0.9286	0.9467	0.9532

Table 17. Validation metrics (average and standard deviation) for Artificial Neural Networks using ten-fold cross-validation on the balanced dataset.

Fold	ANN (10-Folds)
Fold	Precision	Recall	F1-Score	Kappa
1	0.947488	0.944805	0.945072	0.825000
2	0.891020	0.883333	0.882066	0.885000
3	0.932514	0.923333	0.924476	0.845000
4	0.899270	0.896667	0.894430	0.890000
5	0.929290	0.926667	0.926388	0.889623
6	0.931456	0.926421	0.926638	0.879615
7	0.922921	0.919732	0.918512	0.829484
8	0.897014	0.886288	0.882648	0.884217
9	0.932910	0.922819	0.920468	0.924242
10	0.951290	0.949495	0.949701	0.825000
mean	0.923517	0.917956	0.917040	0.876941
std	0.020972	0.022543	0.023614	0.033809

Table 18. Validation metrics (average and standard deviation) for Random Forest using ten-fold cross-validation on the balanced dataset.

Fold	Random Forest (10-Folds)
Fold	Precision	Recall	F1-Score	Kappa
1	0.914182	0.909091	0.908246	0.863749
2	0.895438	0.893333	0.893091	0.840000
3	0.927076	0.926667	0.926477	0.890000
4	0.923895	0.923333	0.922410	0.885000
5	0.936138	0.933333	0.933392	0.900000
6	0.873845	0.872910	0.872848	0.809385
7	0.918529	0.916388	0.914746	0.874606
8	0.936141	0.933110	0.932169	0.899675
9	0.910052	0.899329	0.897137	0.848958
10	0.973351	0.973064	0.973094	0.959596
mean	0.920865	0.918056	0.917361	0.877097
std	0.026444	0.027226	0.027407	0.040834

Table 19. Results of the classification using ANN and Random Forest. Both algorithms were trained using both balanced and imbalanced datasets. The dataset of suspected symbiotic stars was obtained from the article “A catalogue of symbiotic stars” [28]. The ANN model’s results are expressed as probabilities with four significant digits, whereas the Random Forest model only shows the resulting class.

Suspected Symbiotic Stars	Trained Model (Balanced Data)				Trained Model (Unbalanced Data)
	ANN			RF	ANN			RF
	SY	PN	RG	Class	SY	PN	RG	Class
RAW 1691	0.9898	0.006	0.0042	Sy	0.8839	0.0414	0.0748	Sy
[BE74] 583	0.9824	0.0148	0.0029	Sy	0.9334	0.0547	0.0119	Sy
StHA 55	0.5252	0.3066	0.1682	PN	0.5468	0.1763	0.2769	Sy
WRAY 16−51	0.9998	0.0002	0.0	Sy	0.9829	0.0171	0.0	Sy
NSV 05572	0.9845	0.0154	0.0	Sy	0.9787	0.0213	0.0	Sy
AE Cir	0.9839	0.0155	0.0006	PN	0.7039	0.2945	0.0015	PN
V748 Cen	0.9999	0.0001	0.0	PN	0.0222	0.9778	0.0	PN
V345 Nor	0.983	0.017	0.0	Sy	0.9492	0.0508	0.0	Sy
V934 Her	0.9769	0.003	0.0201	Sy	0.8235	0.0426	0.134	Sy
Hen 3−1383	0.9491	0.0486	0.0022	Sy	0.901	0.099	0.0	Sy
WRAY 16−294	0.9134	0.0799	0.0067	Sy	0.8504	0.1486	0.001	Sy
DT Ser	0.0	0.9084	0.0916	RG	0.0004	0.1259	0.8737	RG
V618 Sgr	0.9962	0.0038	0.0	Sy	0.9759	0.0241	0.0	Sy
V335 Vul	0.8891	0.0648	0.0461	PN	0.755	0.2011	0.0439	PN
V627 Cas	0.8628	0.0892	0.0481	Sy	0.8563	0.1435	0.0003	Sy

Table 20. Results of the classification using ANN and Random Forest. Both algorithms were trained using both balanced and imbalanced datasets. The dataset of suspected symbiotic stars was obtained from the article “A machine learning approach for identification and classification of symbiotic stars using 2MASS and WISE” [29]. The ANN model’s results are expressed as probabilities with four significant digits, whereas the Random Forest model only shows the resulting class.

Suspected Symbiotic Stars	Trained Model (Balanced Data)				Trained Model (Unbalanced Data)
	ANN			RF	ANN			RF
	SY	PN	RG	Class	SY	PN	RG	Class
V748 Cen	1.0	0.0	0.0	PN	0.0222	0.9778	0.0	PN
WRAY 16-294	0.9134	0.0799	0.0067	Sy	0.8504	0.1486	0.001	Sy
DASCH J075731.1+201735	0.7153	0.0142	0.2704	Sy	0.7673	0.0485	0.1841	RG
ASAS J174600-2321.3	0.0642	0.1435	0.7923	RG	0.0091	0.091	0.8999	RG
IPHASJ201550.96+373004.2	0.2938	0.703	0.0032	PN	0.1461	0.8369	0.017	PN
IPHASJ202058.52+380949.8	0.9299	0.0687	0.0013	Sy	0.7117	0.2695	0.0188	Sy
IPHASJ202947.93+355926.5	0.9343	0.0654	0.0003	Sy	0.8125	0.1869	0.0005	RG
IPHASJ204713.69+463517.5	0.9189	0.0737	0.0075	Sy	0.8513	0.148	0.0006	Sy
IPHASJ215628.47+571445.5	0.972	0.028	0.0	Sy	0.82	0.1799	0.0001	PN
IPHASJ231735.92+634506.4	0.9772	0.0223	0.0006	Sy	0.8664	0.1307	0.0029	Sy
VPHASDR2J175320.4-295327.4	0.8999	0.0862	0.0139	RG	0.7163	0.2261	0.0577	RG
VPHASDR2J175346.2-284826.6	0.8452	0.1433	0.0116	Sy	0.8842	0.1157	0.0002	Sy
VPHASDR2J172830.6-292124.5	0.9833	0.0157	0.001	Sy	0.9507	0.0493	0.0	Sy
VPHASDR2J181154.5-243536.2	0.5373	0.2681	0.1946	Sy	0.7686	0.2258	0.0055	Sy
VPHASDR2J181333.6-245225.0	0.9604	0.0329	0.0067	Sy	0.879	0.1208	0.0001	Sy
VPHASDR2J181123.2-241430.0	0.9174	0.0714	0.0111	Sy	0.7778	0.2096	0.0126	Sy
VPHASDR2J141301.4-653320.1	0.9903	0.0094	0.0003	Sy	0.9544	0.0456	0.0	Sy

Table 21. Results of the classification using ANN and Random Forest. Both algorithms were trained using both balanced and imbalanced datasets. The dataset of suspected planetary nebulae stars was obtained from the Symbad online database. The ANN model’s results are expressed as probabilities with four significant digits, whereas the Random Forest model only shows the resulting class.

Suspected Planetary Nebulae Stars	Trained Model (Balanced Data)				Trained Model (Unbalanced Data)
	ANN			RF	ANN			RF
	SY	PN	RG	Class	SY	PN	RG	Class
IRAS 02379+5724	0.8607	0.0888	0.0505	PN	0.0204	0.1176	0.8621	RG
UCAC2 46104304	0.9584	0.0372	0.0044	Sy	0.5947	0.396	0.0093	Sy
IRAS 05495+2620	0.0	0.9937	0.0063	PN	1e-04	0.9979	0.0021	PN
IRAS 06549-2330	0.0033	0.9135	0.0833	PN	0.0073	0.961	0.0316	PN
PN G228.1+00.8	0.0	0.9667	0.0333	PN	0.0012	0.9821	0.0167	PN
PN G239.3-02.7	0.0	0.8194	0.1806	RG	0.0004	0.1959	0.8038	RG
PN Y-C 40	0.0171	0.0416	0.9413	RG	0.0073	0.1335	0.8592	RG
EGB 2	0.0127	0.6487	0.3386	PN	0.0071	0.5841	0.4088	RG
IRAS 10348-6320	0.0	1.0	0.0	PN	0.0058	0.9942	0.0	PN
IRAS 11555-6031	0.0092	0.0612	0.9297	RG	0.0065	0.0445	0.949	RG
IRAS 11415-6540	0.0188	0.0728	0.9084	RG	0.0091	0.0515	0.9394	RG
PN G292.3+00.5	0.0	1.0	0.0	PN	0.0245	0.9755	0.0	PN
2MASS J05221214-6949580	0.002	0.998	0.0	PN	0.0031	0.9969	0.0	PN
[RP2006] J052438-690413	0.0	0.7944	0.2056	RG	0.0007	0.1282	0.8711	RG
[RP2006] 312	0.01	0.99	0.0	PN	0.0067	0.9933	0.0	PN

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pérez Cruz, O.J.; Martínez Pinto, C.A.; Navarro Jiménez, S.G.; Corral Escobedo, L.J.; Outeiro, M.M. Analyzing Supervised Machine Learning Models for Classifying Astronomical Objects Using Gaia DR3 Spectral Features. Appl. Sci. 2024, 14, 9058. https://doi.org/10.3390/app14199058

AMA Style

Pérez Cruz OJ, Martínez Pinto CA, Navarro Jiménez SG, Corral Escobedo LJ, Outeiro MM. Analyzing Supervised Machine Learning Models for Classifying Astronomical Objects Using Gaia DR3 Spectral Features. Applied Sciences. 2024; 14(19):9058. https://doi.org/10.3390/app14199058

Chicago/Turabian Style

Pérez Cruz, Orestes Javier, Cynthia Alejandra Martínez Pinto, Silvana Guadalupe Navarro Jiménez, Luis José Corral Escobedo, and Minia Manteiga Outeiro. 2024. "Analyzing Supervised Machine Learning Models for Classifying Astronomical Objects Using Gaia DR3 Spectral Features" Applied Sciences 14, no. 19: 9058. https://doi.org/10.3390/app14199058

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Analyzing Supervised Machine Learning Models for Classifying Astronomical Objects Using Gaia DR3 Spectral Features

Abstract

1. Introduction

2. Materials and Methods

2.1. Acquisition and Processing of Data

2.2. Data Preprocessing

2.3. Exploring Class Differences through t-SNE Visualization

2.4. Analysis and Selection of Algorithms

2.4.1. Algorithm Random Forest

2.4.2. Algorithm Support Vector Machine

2.4.3. Algorithm Artificial Neural Networks

2.4.4. Algorithm Gradient Boosting

2.4.5. Algorithm Naive Bayes

3. Results

3.1. Definition of Metrics

3.2. Evaluation of Trained Models

3.2.1. Random Forest

3.2.2. Support Vector Machine

3.2.3. Gradient Boosting

3.2.4. Naive Bayes

3.2.5. Artificial Neural Networks

3.3. Comparison of the Results

3.4. Second Evaluation of Trained Models

3.4.1. Suspected Symbiotic Stars

3.4.2. Suspected Planetary Nebulae

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI