Spectral Classification of Quasar Subject to Redshift: A Statistical Study

Ghosh, Prithwish; Chakraborty, Shinjon

doi:10.3390/IOCMA2023-14418

Open AccessProceeding Paper

Spectral Classification of Quasar Subject to Redshift: A Statistical Study^†

by

Prithwish Ghosh

^1,*,‡

and

Shinjon Chakraborty

^2,‡

¹

Department of Statistics, Visva Bharati, Santiniketan 731235, India

²

Department of Statistics, University of Calcutta, Kolkata 700019, India

^*

Author to whom correspondence should be addressed.

^†

Presented at the 1st International Online Conference on Mathematics and Applications, 1–15 May 2023; Available online: https://iocma2023.sciforum.net/.

^‡

These authors contributed equally to this work.

Comput. Sci. Math. Forum 2023, 7(1), 43; https://doi.org/10.3390/IOCMA2023-14418

Published: 28 April 2023

(This article belongs to the Proceedings of The 1st International Online Conference on Mathematics and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Quasars are astronomical star-like objects having a large ultraviolet flux of radiation accompanied by generally broad emission lines and absorption lines in some cases found at large redshift. The used data is extracted from the Veron Cetti Catalogue of AGN and Quasar. The objective of this work is to partition the quasar based on their spectral properties using multivariate techniques and classify them with respect to the obtained clusters. Performing the K-means partitioning method, two robust clusters were obtained with cluster sizes 39,581 and 129,377. The percentage of misclassification observed based on the obtained clusters considering a multivariate classification technique and machine learning classification algorithm, i.e., linear discriminant analysis and XG-Boost, respectively. The linear discriminant analysis and XG-Boost evaluated a misclassification of around 0.84 and 0.15%, respectively. Additionally, a heuristic literature-based categorization subject to redshift yielded an accuracy of around 96 %. This gives us cross-validating arguments about the astronomical data, that machine learning algorithms might perform on par with conventional multivariate techniques, if not better.

Keywords:

quasar; partitioning; K-means; XG-Boost; discriminant analysis; machine learning

1. Introduction

Quasar astronomical objects include stars with a large ultraviolet radiation flux accompanied by generally broad emission lines and absorption lines in cases found with a large redshift. Nearly 10% of the quasars are radio-loud. The increase in redshift for quasars can reach

z = 7

after which it decreases and this phenomenon decreases at higher redshift. We know that quasars are extremely luminous distant objects in our universe; therefore, for their lights to reach the Earth, they are often subjected to redshift due to space metric expansion [1]. Supermassive black holes originate from the power of quasars, believed to exist at the cores of galaxies. Near to the galaxy cores, the Doppler shifts of stars tell us that they are rotating around tremendous masses with very steep gravitational gradients, suggestive of black holes. The taxonomy of quasars includes various sub-types representing the subsets of featured applications having distinct properties, these types of quasars include radio-loud, weak emission line quasars, broad absorption-line (BAL), optically violent variable (OVV), radio-quiet, type 2 (or type II) and red quasars. These sub-types consist of parameters such as colour index redshift, absolute magnitude, and magnitude. Assuming a deceleration parameter

q_{0} =

0, and a Hubble constant

H_{0} =

71 km/s/Mpc, the dataset used here consists of parameters such as the declination of the object, the right ascension of the object, the (B-V) colour of the object when known, the redshift of the object, the (U-B) colour of the object when known, etc.

2. Materials and Methods

2.1. Missing Value Imputations

The censored values in this dataset have been imputed by multiple imputation techniques known as predictive mean matching (PMM). Basically, PMM calculates the predicted value of target variable Y according to the specified imputation model.

2.2. Choice of Optimal Clusters

2.2.1. Distortion Plot

The initial step for any unsupervised learning is to find the optimal number of partitions in which the data may be partitioned. The distortion plot method is one of the most popular methods to determine the optimal value of k or the number of partitions to make. It is calculated as the average sum of squared distances from the partition centres of the created partitions. Here we used the Euclidean distance metric. This is the distance of the samples to their closest cluster centre with respect to their sum of the squares. Basically, where the curve or elbow-like part is given in the graph, this is the optimal cluster number.

2.2.2. Dunn Index

Dunn’s index [2] tries to find those partitioned sets that are compact and different. For a random number of partitions, where

c_{i}

represents the i-th cluster, Dunn’s index, D, is calculated with the formula given below:

D = m i n_{1 \leq i \leq k} (m i n_{i + 1 \leq j \leq k} (\frac{d i s t (c_{i}, c_{j})}{m i n_{1 \leq i \leq k} d i a m (c_{l})}))

(1)

where

d i s t (c_{i}, c_{j})

is the interval between clusters

c_{i}

and

c_{j}

, where

d i s t (c_{i}, c_{j}) = m i n_{x_{i} \in c_{i}, x_{j} \in c_{j}} d x_{i}, x_{j}

,

d (x_{i}, x_{j})

is the interval between data points

x_{i} \in c_{i}

and

x_{j} \in c_{j}

, and

d i a m (c_{l})

is the diameter of

c_{l}

where

d i a m (c_{l}) = m a x_{x_{l_{1}}, x_{l_{2}}} d (x_{l_{1}}, x_{l_{2}})

.

The optimal value is responsible for maximizing the Dunn’s index.

2.3. Clustering (Partitioning) Algorithms and Discriminant Analysis

Clustering is the technique of grouping individuals with multiple characteristics, according to their similarities or dissimilarity. Basically, we partitioned the data to obtain information concerning how the quasar data carries the information. Some of the well-known algorithms used in the study are as listed below.

2.3.1. K-Means

K means clustering is a technique that aims to find homogeneous subgroups within the data. The idea is to divide the quasar data into k distinct groups such that the observations within a group are similar, and observations between groups are as different as possible [3,4].

J (X, V) = \sum_{j = 1}^{k} J_{i} (x_{i}, v_{j}) = \sum_{j = 1}^{k} (\sum_{i = 1}^{m} u_{i j} d^{2} (x_{i}, v_{j}))

(2)

where

J_{i} (x_{i}, v_{j}) = \sum_{i = 1}^{m} u_{i j} d^{2} (x_{i}, v_{j})

is the objective function within the cluster

c_{i}

,

u_{i j} = 1

, if

x_{i} \in c_{j}

and 0 otherwise.

d^{2} (x_{i}, v_{j})

is the gap between

x_{i}

and

v_{j}

d^{2} (x_{i}, v_{j}) = {||\sum_{k = 1}^{n} x_{k}^{i} - v_{k}^{j}||}^{2}

, where the number dimensions for each points is n.

x_{k}^{i}

is the value of k-th dimension of

x_{i}

, and

v_{k}^{j}

is the value of k-th dimensions of

v_{j}

.

The partitions are defined by an m × k binary membership matrix U, where the elements are

u_{i j}

.

u_{i j} = \{\begin{matrix} 1 if d^{2} (x_{i}, m_{j}) \leq d^{2} (x_{i}, m_{j^{*}}), j \neq j^{*}, \forall j^{*} = 1 \cdot k \\ 0 o t h e r w i s e \end{matrix}

where the fixed membership matrix U = [

u_{i j}

], and the choice of centre

v_{j}

that minimizes J(X, V) is the mean for all data in the cluster j, calculated from the below equation:

v_{j} = \frac{1}{| c_{j} |} \sum_{i, x_{i} \in c_{j}}^{m} x - i

(3)

where

| c_{j} |

is the size of partitions

c_{j}

and

| c_{j} | = \sum_{i = 1}^{m} u_{i j}

2.3.2. The Linear Discriminant Analysis

To transform the features into a lower-dimensional space, the linear discriminant analysis technique was used, where the ratio of the between-class variance to the within-class variance is maximized, guaranteeing separability of the maximum class. The aim of the linear discriminant analysis technique is to express the original data matrix on a lower-dimensional space. The linear discriminant function is

{(μ_{1} - μ_{2})}^{^{'}} Σ^{- 1}

.

2.3.3. XG-Boost Algorithm

A highly used machine learning technique for various problems is tree boosting. Basically, the scalable end-to-end tree boosting system is called XG-Boost [5]. In function space the XG-Boost algorithm works as the Newton–Raphson method. A Taylor approximation of second-order https://en.wikipedia.org/wiki/Taylor_series (accessed on 21 January 2023) is used to link to the Newton–Raphson Method (applied to the loss function). A generic unregulated XG-Boost algorithm was used with the input of a set

(x_{i}, y_{i})

where i from 1 to N, a differentiable loss function L(

y_{i}

,

θ

), and a learning rate

α

with a number of weak learners M.

Method and Mode: Creating a model with a constant value:

{\hat{f}}_{(0)} (x) = arg min_{θ} \sum_{i = 1}^{N} L (y_{i}, θ)

(4)

for m = 0 to M we compute the gradient and Hessians:

\hat{g_{m}} (x_{i}) = {[\frac{δ L (y_{i}, f (x_{i}))}{δ f (x_{i})}]}_{f (x) = \hat{f_{(m - 1)} (x)}} \hat{h_{m}} (x_{i}) = {[\frac{δ L (y_{i}, f (x_{i}))}{δ f {(x_{i})}^{2}}]}_{f (x) = \hat{f_{(m - 1)} (x)}}

(5)

Now we fit a base learner with the help of the training set

{[x_{i}, - \frac{\hat{g_{m}} (x_{i})}{\hat{h_{m}} (x_{i})}]}_{i = 1}^{N}

by calculating optimization problem, obtaining:

\hat{ϕ_{m}} = arg min_{ϕ \in Φ} \sum_{i = 1}^{N} \frac{1}{2} \hat{h_{m}} (x_{i}) {[- \frac{\hat{g_{m}} (x_{i})}{\hat{h_{m}} (x_{i})} - ϕ (x_{i})]}^{2}, \hat{f_{m}} (x) = α \hat{ϕ_{m}} (x)

(6)

the output will be

\hat{f_{(m)}} (x) = \hat{f_{(M)}} (x) = \sum_{m = 0}^{M} \hat{f_{m}} (x)

The features include:

Speed: It can automatically perform parallel computation on Windows and Linux, with OpenMP. It is generally 10 times faster than the classical models.

Input type: Several types of input data can be used.

Data file: Local data.

Xgb.DMatrix: Its own class.

Sparsity: Both the tree boosters and linear booster are accepted by this under the optimized sparse input.

Customization: Possesses supoorted customized objective and evaluation functions [6].

3. Results and Discussion

After applying the aforementioned statistical techniques on the relevant dataset, we observe that from the elbow plot (Figure 1) and Dunn’s index (Table 1) the optimal number of distinct clusters is two. After applying the k-means partitioning algorithm over the combined Lick indices, the robust clusters are demonstrated in Figure 2. To check the extent of clustering, i.e., the efficacy of the k-means algorithm, the percentage accuracy is calculated through the linear discriminant analysis (a multivariate technique) and XG-Boost (a machine learning algorithm). The percentage accuracy under linear discriminant analysis is 96.16%, while under XG-Boost it is 99.85%, evident from Table 2 and Table 3, respectively. We have partitioned the data into two parts, defined as the training and testing datasets, which are 60 and 40%, respectively. We then used the algorithm to classify the test dataset.

We know that redshift is one of the most complex and fundamental characteristics of an astronomical object since it is believed that the properties of astronomical objects change with distance, measured by the redshift and acts as a parameter in the spectral analysis. We have partitioned the quasars heuristically based on the redshift into three distinct categories: Low redshift (0–2), medium redshift (2–4.1), and high redshift (4.1–6.44). We used an 80 and 20% partition to classify the training and test data. The classification is shown in Table 4 where a 95.92% accuracy was observed.

4. Conclusions

Summarising the results corresponding to the series of multivariate and machine learning techniques on the Veron Cetti Catalogue (13th Edition), two distinct clusters were observed subject to a combination of Lick indices (including redshift), designated as spectral quasar properties with a 99% accuracy, as computed by linear discriminant analysis and the XG-Boost algorithm.

Heuristically partitioning the quasars based on redshift into three distinct groups: low, medium and high, saw an accuracy of 96% when applying the machine learning-based classification algorithm, XG-Boost, indicating that the heuristic literature-based clustering is valid when concerning redshift.

In the context of astronomical studies, utilizing widely used multivariate techniques we show that newly developed machine learning algorithms perform on par with the widely used multivariate techniques, if not better.

Combining the results of the classification based on the Lick indices (including redshift) and on redshift alone, it is evident that even after providing extra information in the form of training data the accuracy of the classification over the combined indices is greater. Therefore, we may conclude that for quasar clustering based on the Lick indices (including redshift) is more effective when working with the Veron Cetti Catalogue.

Author Contributions

Conceptualization, P.G. and S.C.; methodology, P.G. and S.C.; software, P.G.; validation, S.C.; formal analysis, S.C. and P.G.; investigation, P.G.; resources, S.C.; data curation, S.C. and P.G.; writing—original draft preparation, P.G.; writing—review and editing, S.C.; visualization, S.C. and P.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used here is extracted from the Veron Cetti Catalogue [7] of AGN and Quasars (13th Edition). The dimension of this dataset is 168,940 × 13. This catalogue contains 133,336 quasars, 1374 BL Lac objects, and 34,231 active galaxies (including 15,627 Seyfert 1 galaxies), totalling 168,941 objects. It includes the positions and redshift as well as the photometry (U, B, and V) and 6 and 20 cm flux densities. https://heasarc.gsfc.nasa.gov/W3Browse/all/veroncat.html (accessed on 20 January 2023).

Acknowledgments

We are thankful to the Professors of Visva Bharati University and Calcutta University for their constant support and motivation, resulting in the successful completion of this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Grupen, C.; Cowan, G.; Eidelman, S.; Stroh, T. Astroparticle Physics; Springer: Berlin/Heidelberg, Germany, 2005; Volume 50. [Google Scholar]
Dunn, J.C. Well-Separated Clusters and Optimal Fuzzy Partition. J. Cybern. 1974, 4, 95–104. [Google Scholar] [CrossRef]
Hartigan, J.A.; Wong, M.A. A k-means clustering algorithm. Appl. Stat. 1979, 28, 100–108. [Google Scholar] [CrossRef]
Ansari, Z.; Azeem, M.F.; Ahmed, W.; Babu, A.V. Quantitative evaluation of Performance and Validity Indices for Clustering the Web Navigational Sessions. World Comput. Sci. Inf. Technol. J. (WCSIT) 2011, 1, 217–226. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Ghosh Prithwish, L. Breast Cancer (Wisconsin) Diagnostic Prediction. Int. J. Sci. Res. 2021, 11, 178–185. [Google Scholar]
Veron-Cetty, M.P.; Veron, P. A catalogue of quasars and active nuclei. Astron. Astrophys. 2010, 518, A10. [Google Scholar] [CrossRef]

Figure 1. Elbow Plot.

Figure 2. Cluster Plot.

Table 1. Dunn’s index for clustering.

k	Number of Partition = 2	Number of Partition = 3	Number of Partition = 4
	18.03072757	−60.43461534	−7.093826221

Table 2. Confusion matrix for k-means, where k = 2 with an accuracy = 99.16%.

Actual \| Predicted	1st Predicted Group	2nd Predicted Group
1st Actual Group	38,496	324
2nd Actual Group	1085	129,035

Table 3. Confusion matrix for clustering by XG-Boost, where the accuracy = 99.85% with respect to the clusters after splitting the data into the testing (40%) and training (60%) groups.

Predicted \| Reference	Reference Cluster 1	Reference Cluster 2
Predicted 1st Cluster	15,784	48
Predicted 2nd Cluster	51	51,692

Table 4. Confusion matrix for XG-Boost, where the accuracy = 95.92% with respect to the redshift.

Predicted \| Reference	Reference High	Reference Medium	Reference Low
Predicted High	169	0	67
Predicted medium	0	26,561	284
Predicted Low	26	1002	5678

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ghosh, P.; Chakraborty, S. Spectral Classification of Quasar Subject to Redshift: A Statistical Study. Comput. Sci. Math. Forum 2023, 7, 43. https://doi.org/10.3390/IOCMA2023-14418

AMA Style

Ghosh P, Chakraborty S. Spectral Classification of Quasar Subject to Redshift: A Statistical Study. Computer Sciences & Mathematics Forum. 2023; 7(1):43. https://doi.org/10.3390/IOCMA2023-14418

Chicago/Turabian Style

Ghosh, Prithwish, and Shinjon Chakraborty. 2023. "Spectral Classification of Quasar Subject to Redshift: A Statistical Study" Computer Sciences & Mathematics Forum 7, no. 1: 43. https://doi.org/10.3390/IOCMA2023-14418

Article Menu

Spectral Classification of Quasar Subject to Redshift: A Statistical Study^†

Abstract

1. Introduction