1. Introduction
Most galaxies can be broadly separated into two morphological types: spiral and elliptical [
1]. Galaxy Zoo [
2] was the first attempt to analyze the distribution of a large number of spiral and elliptical galaxies in the local universe. Using the power of crowd sourcing, it provided morphological classifications of nearly 900,000 galaxies. Subsets of “clean” and “superclean” datasets were used to deduce the distribution of elliptical and spiral galaxies in the local universe [
3,
4,
5].
A more recent catalog of galaxy morphology is the catalog of ∼3,000,000 Sloan Digital Sky Survey (SDSS) galaxies [
6] classified automatically using machine learning [
7,
8]. While the catalog is large, it is limited in the sense that the vast majority of the galaxies in that catalog do not have spectroscopic data.
Photometric redshift (photo-z) plays a vital role in the study of astronomy and cosmology. Spectroscopic measurements of millions of celestial objects is technically daunting and expensive compared to photometric measurements. The redshift can be estimated from the photometric measurements, and that estimation is often sufficient for many applications involving statistical analysis of a large population of astronomical objects [
9]. Clearly, photometric measurements can provide more redshift estimates per unit telescope time compared to spectroscopic measurements [
10]. Therefore, during the past decade significant efforts have been aimed toward developing photometric redshift estimation methods, most of them can be classified into two types: template-fitting and empirical methods [
11].
Empirical methods use celestial objects with known spectroscopic redshift as training data to estimate the redshift based on patterns in the photometric measurements. The performance of empirical methods is limited to the range of the spectroscopic redshift of the samples in the training set. Template-based methods estimate the photometric redshift by using spectral templates. The estimation is accomplished by selecting the Spectral Energy Distribution (SED) from a library of templates such that the SED best reproduces the observed fluxes in the broadband filters [
9]. These methods are preferred when exploring new regimes, while empirical methods are preferred when large training sets with spectroscopic redshift are available [
11]. In general, empirical methods provide better accuracy [
11]. For detailed discussion of photometric redshift techniques see reviews [
11,
12].
In some cases hybrid methods that combine empirical and template approaches can improve the accuracy of the photometric redshift reconstruction. Examples of empirical methods include predication trees and random forests [
13,
14], Polynomial Fitting [
15], the Nearest Neighbour Polynomial (NNP) technique [
16], Decision Trees [
17], Artificial Neural Networks [
18,
19], and Support Vector Machines [
20]. Gaussian Process Regression (GPR) can provide competitive results compared to ANN and least-square fitting methods [
21]. Some studies showed improved prediction accuracy using the combination of templates and magnitude priors [
22]. The inclusion of near IR magnitude and angular size also showed significant contribution to the accuracy of the photometric redshift prediction [
23]. The combination of morphological and photometric variables have shown to increase the accuracy, especially in cases where fewer bands are available [
24].
Examples of photometric redshift catalogs include the catalog of ∼10
SDSS DR4 objects with redshift values in the range of
[
18], and the catalog of SDSS DR9 galaxies, in which an artificial neural network was used [
25]. The ANNz2 artificial neural network [
26] was used to create a photometric redshift catalog of ∼3.9 × 10
for the Kilo-Degree Survey Data Release 3 [
27]. Another large catalog is contains the photometric redshift catalog of about ∼2 × 10
galaxies from SDSS DR12, with redshift range of
0.8 [
28].
Here we test several machine learning algorithms and sets of photometric variables for the purpose of photometric redshift estimation, and apply the method to create a catalog of ∼3,000,000 galaxies that have information about their broad morphology and their photometric redshift. That information can be used to profile the distribution of broad morphology of galaxies in different redshift ranges.
2. Data
The initial data were taken from the catalog of ∼3 × 10
SDSS galaxies separated by their broad morphology into elliptical and spiral galaxies [
6], all taken from SDSS DR8 [
29]. The vast majority of these DR8 galaxies do not have photometric redshift computed through previous catalogs. For instance, a joint query with the redshift catalog of ∼2 × 10
SDSS DR12 galaxies [
28] only includes 827,591 of the galaxies in the catalog, which is merely about 27.6%.
The reason for the exclusion of these galaxies from the
photoz table [
28] of SDSS DR12 could be that the galaxies in the photoz table are objects identified as galaxies by all primary photometric measurements included in the GalaxyTag view [
28], while the objects in the catalog of broad morphology were selected by using the “type” field of the PhotoObjAll table, and then filtered by applying further analysis of the image based on the morphology of the object. Therefore, many objects that have an object type “galaxy” (type = 3) in the PhotoObjAll table and are included in the catalog of broad morphology might not have been included in the set objects included in the photoz table. For instance, DR12 has 48,528,684 objects with model i magnitude smaller than 19 and identified to have the type “galaxy”, while just 11,761,054 of these objects are included in the photoz table.
Figure 1 shows examples of objects identified as galaxies in SDSS DR12 PhotoObjAll, but are not included in the photoz table of DR12.
The catalog also has the certainty of each galaxy to belong in each of the broad morphological classes, and the threshold can be used to control the consistency of the subset [
6]. The certainty values are used in a similar fashion to the way the degree of agreement between human annotators is used in Galaxy Zoo [
2]. By using a certainty threshold of 0.54 for the spiral galaxies and 0.8 for the elliptical galaxies, the catalog contains ∼9 × 10
spiral galaxies and ∼6 × 10
elliptical galaxies with consistency of ∼98% with the Galaxy Zoo debiased “superclean” dataset [
2], as thoroughly described in [
6]. All galaxies are bright (i magnitude brighter than 18) and large (Petrosian radius measured in the r band larger than 5.5
), and therefore allow the identification of the morphologies of the galaxies in the catalog while excluding small and faint objects that their morphology cannot be identified.
The source code used to create the catalog is also publicly available [
30].
Figure 2 shows the distribution of the r model magnitude, the Petrosian radius measured in the r band, and the distribution of the redshift among 115,359 galaxies included in the catalog that also had spectra.
A subset of 20,000 galaxies that have spectra was used for training and testing the algorithm. Naturally, these galaxies need to have spectra so that the predicted photometric redshift can be compared to their spectroscopic redshift to deduce the efficacy of the algorithm. The photometric information was taken from the PhotoObjAll table of SDSS DR8.
The purpose of the set of galaxies with spectra is to train and test a model to estimate the redshift of the galaxies in the catalog of SDSS galaxies with broad morphology classification [
6]. For that purpose, the set of galaxies with spectra that are used for training and testing needs to be as similar as possible to the entire population of galaxies in the catalog it aims at analyzing.
Figure 3 shows the distribution of the magnitude and size of the galaxies in the catalog of galaxies with broad morphological classification. As the figure shows, the brightness and size of the galaxies in the catalog is very similar to the brightness and size of the galaxies in the training set.
Selecting training samples from the same set of galaxies that need to be classified might lead to performance evaluation that reflects the sample from which the galaxies were taken, and not necessarily the entire set of SDSS galaxies, which also includes small and faint galaxies. Also, the training set contains galaxies that were selected as spectroscopic targets, and are therefore not necessarily a random representation of SDSS general galaxy population. However, since the purpose of the algorithm is to estimate the redshift only for galaxies within that sample, higher accuracy can be achieved if the population of the samples in the training set is similar to the population of the samples that will be classified with the machine learning system. The galaxies in both the catalog and the training set are galaxies that were selected using the same criteria, and
Figure 2 and
Figure 3 show that the population of galaxies in the catalog is similar to the population of galaxies included in the training set. The training set does not include random galaxies from SDSS spectroscopic sample, but just galaxies that are part of the catalog, and their population is similar to the galaxy population in the catalog. While the solution is expected to achieve poor performance for galaxies outside of that sample, it is designed specifically for a certain catalog.
3. Methods
3.1. Pattern Recognition Algorithms
Several supervised machine learning algorithms were tested, and the performance was evaluated to identify the algorithms that demonstrated the highest efficacy. These algorithms included Simple Linear Regression, MultiLayer Perceptron [
31], M5P [
32], ZeroR, Decision Table [
33], and Random Forest [
34]. Given the photometric redshift is a continuous value and not a crisp class, suitable algorithms need to be able to perform a regression and compute a continuous value as their output.
The
Logistic Regression algorithm predicts a multi-dimensional point by minimizing its squared error.
MultiLayer Perceptron builds a multi-layer neural network of weighted perceptron nodes. Each node receives several input values, and “fires” a value to the next layer if the results of the function (called “activation function”) using these input values as parameters reaches a certain threshold weight [
35]. The weights in the nodes are optimized by running the training samples through the network iteratively, and adjusting the weights based on the results of the training samples. Given the output layer contains multiple perceptrons, their values can be interpolated to provide a continuous value.
The
M5P algorithm implements M5 model trees and rules [
32]. As each leaf in the M5 model is a linear regression function, it is suitable for predicting continuous values rather than a crisp class.
ZeroR is a simple classifier that makes a prediction based on the frequency of the output variable in the training set. A
Decision Table [
33] is a rule-based method that uses a frequency table to make a prediction. The frequency table is built based on the frequency of the features in the training samples, and their distribution in different ranges [
33]. The tree-based
Random Forest algorithm [
34] builds a classifier using a large number of random decision tree classifiers. Each decision tree is created randomly such that each node is a different feature, and the decision is made based on the value of that feature in a specific given test sample, until reaching the leaf that is assigned with an output value. These trees are used to create an ensemble classifier such that each tree is a classifier, and the output is determined by an interpolation of the results of all decision trees. Given each decision tree provides an output, the high number of outputs being interpolated makes the method suitable also for the prediction of continuous values. The implementation of the algorithm was taken from the Weka open source machine learning toolbox [
36].
3.2. Variable Selection
Feature selection in a multidimensional environment is a complex task that often requires heuristics or assumptions, which can then be tested empirically. Several different methods were used for variable selection, including hand-crafted variable selection and automatic statistical selection of the variables. The hand-crafted set of variables is the variables used in [
25] to compute the photometric redshift of SDSS DR9 objects. These variables are listed in
Table 1, and a short description of each variable can be found in
Table A1.
Another method of variable selection is based on computing the variable’s analysis of variance (ANOVA) F-value, and then selecting the highest rated features [
37]. That was done by using the python library
scikit-learn 1. The function
sklearn.feature_sclection.SelectKBest takes a dataset and a comparison function to choose the most informative features. For the comparison function, we used the ANOVA F-value computation function
sklearn.feature_selection.f_classif. These variables are selected automatically, and therefore some of the variables might not necessarily have a straightforward physical explanation, but in the context of the database can provide useful patterns when used in combination with other variables. We chose the top 13 (KBest13), 21 (KBest21), and 31 (KBest31) variables and ran the random forest algorithm on each subset of variables.
Table 2 shows the mean absolute error when using the different feature sets. The mean absolute error is defined by
, where
is the photometric redshift of galaxy
i,
is the spectroscopic redshift of galaxy
i, and N is the total number of galaxies in the test samples. That process is repeated 10 times using a 10-fold cross-validation test strategy, meaning that the test is performed 10 times such that in each run a different set of 10% of the samples are used for testing, and the remaining 90% for training.
As
Table 2 shows, using 21 variables provided the best performance, and the larger feature set did not lead to better accuracy, although the difference in performance when using different sets of variables is small. All experiments were performed with a standard 10-fold cross-validation testing strategy.
Finally, the set of KBest21 variables was enhanced with the four color variables, that are not included in SDSS and were therefore not analyzed by ANOVA, creating the KBestMod variable set. That variable set included the following variables: deVMag_g, deVMag_r, deVMag_u, dered_g, dered_r, dered_u, expMag_g, expMag_r, fiberMag_g, fiberMag_u, g, petroMag_g, petroMag_r, psfMag_g, psfMag_u, ra, dec, i, r, u, u-g, g-r, r-i, i-z.
3.3. Performance Evaluation
In order to evaluate the performance we used three metrics of the performance: The mean absolute error, the root square error, and the normalized error. The root mean square error (RMSE) is defined by Equation (
1)
The normalized
Z error
is defined by Equation (
2).
and the mean normalized error
is the mean
of the test galaxies.
Table 3 shows the performance for the different variable sets when using the random forest classifier. As the table shows, the
KBestMod feature set performs better than the other feature sets for all three performance metrics.
Additionally, the random forest algorithm performs better than other algorithms.
Table 4 shows the performance of the Simple Linear Regression, MultiLayer Perceptron, M5P, ZeroR, Decision Table, and Random Forest machine learning algorithms when using the
KBestMod feature set.
As the table shows, the random forest algorithm provided the best performance, with of ∼0.0022, lower than any of the other algorithms. The standard deviation of the normalized error is also the lowest when using the random forest classifier (0.0107). The median absolute deviation of the random forest classifier is ∼0.056. M5P produced a marginally better mean absolute error and root square error of and , respectively. Simple Linear Regression provided the worst performance, with a mean absolute error of and root absolute error of .
It should be mentioned that the performance figures provided in
Table 4 reflect the performance on the galaxy sample of the catalog described in
Section 2, but not the entire galaxy population in SDSS.
The performance of the algorithm was also compared to the performance of the redshift estimation using Multilayer Perceptron, as well as the photometric redshift methods based on the Nearest Neighbour Polynomial (NNP), and the neural network algorithms CC2 and D1 [
9].
Table 5 shows that the photometric redshift estimation based on the random forest is favorably comparable to other methods, and therefore can provide a solution to the estimation of the redshift of the galaxies in the catalog.
Photometric redshift accuracy can change in different color ranges.
Figure 4 and
Figure 5 show the change in the absolute and normalized error, respectively, at different magnitudes in the different bands. The red lines show the least square error linear regression.
Figure 6 and
Figure 7 show the absolute and normalized error, respectively, as a function of the color.
The figures show that the error increases when the objects get dimmer. The correlation between the error and the magnitude can be expected as the dimmer galaxies also tend to have higher redshift, and the error expected to increase as the redshift gets higher. The threshold of 0.54 was selected based on previous experiments by comparing it to the Galaxy Zoo “superclean” samples [
6]. Any threshold higher than 0.54 does not increase the consistency of the dataset substantially.
As
Figure 6 and
Figure 7 show, the error also tends to decrease slightly when the galaxies are bluer. That can be explained by the observation that spiral galaxies in the catalog tend to have lower redshift. Since spiral galaxies also tend to be bluer than elliptical galaxies, bluer galaxies in the catalog have lower redshift, and therefore their estimated photometric redshifts have a lower error.
Figure 8 shows the frequency of galaxies as a function of the normalized and absolute error. As the figure shows, catastrophic outliers with error of more than 0.15 are very rare, and are less than 0.2% of the cases.
3.4. Dependence on the Size of the Training Set
The performance of machine learning algorithms is heavily dependent on the size of the dataset on which they are trained, and larger training sets normally lead to improved performance of the algorithm. However, the accuracy does not grow with the size of the training set in a linear fashion, and at a certain point it is expected that increasing the size of the training set has a negligible contribution to the performance of the machine learning algorithm [
38].
To determine the effect of the size of the training set on the performance, the random forest algorithm and the
KBestMod feature set were used with several training set sizes ranging from 1000 to 20,000 galaxies. The results of these experiments are shown in
Table 6. As the table shows, the accuracy of the algorithm improves as the size of the dataset gets larger, but the improvement in the
becomes negligible when the number of training samples reaches ∼5000. The mean absolute error and the root error also show a very small decrease beyond 5000 training samples, but the decrease is more substantial compared to the
. Therefore, more than 5000 training samples will make a minor contribution to the
, while having somewhat higher impact on the mean absolute error and the root error. It should be noted that for the machine learning algorithms used in this study, using large training sets as shown in the table does not add substantial computing requirements.
4. Catalog
The purpose of the photometric redshift methods described in
Section 3 is to compute the photometric redshifts of the objects in the catalog of galaxy morphologies described in
Section 2. The photometric redshift of the galaxies in the catalog was computed with the
KBest_mod feature set and the random forest algorithm.
The catalog contains the 2,912,341 SDSS galaxies classified automatically to spiral and elliptical galaxies [
6]. For each galaxy, the catalog contains the SDSS DR8 object ID of the galaxy, its right ascension, declination, elliptical and spiral marginal probabilities, and the computed photometric redshift.
Figure 9 displays the distribution of the galaxies in the catalog across different photometric redshift ranges. The figures shows that the number of elliptical galaxies remains fairly constant across the redshift ranges, but increases at around redshift of 0.35. The higher number of galaxies in that redshift range is aligned with previous studies, showing a peak in the total number of galaxies at around z = 0.35 [
39]. On the other hand, it should be noted that the drop in the number of elliptical galaxies beyond z = 0.35 can be related to the limiting magnitude of the catalog. Galaxies with i magnitude dimmer than 18 are excluded from the catalog, and the number of galaxies with redshift greater than 0.35 that satisfy the magnitude threshold is small, and gets smaller as the redshift increases and consequently the galaxies get dimmer. The number of spiral galaxies peaks at around redshift of 0.085, and then decreases gradually.
The catalog was also analyzed regarding the distribution of broad morphology of the galaxies in each redshift range.
Figure 10 shows the fraction of galaxies identified as spiral among the total number of galaxies within each photometric redshift range. As the figure shows, the proportion of spiral galaxies in the catalog drops as the redshift gets higher. In the redshift range of 0.05–0.1 the fraction of spiral galaxies in the total number of galaxies is ∼0.62, while it is ∼0.4 at the redshift of 0.15, and drops to less than 0.2 when the redshift is 0.2 or higher.
It should be noted that the population of galaxies in the catalog might not represent a random sample of the galaxies in the local universe, but is limited to galaxies that are sufficiently bright and sufficiently large to be analyzed morphologically given the limitation of SDSS imaging power. Elliptical galaxies are brighter than spiral galaxies [
40], and therefore the magnitude threshold (i magnitude < 18) can lead to the higher number of elliptical galaxies that meet that threshold to be included in the catalog. Because the redshift and magnitude are strongly correlated, at higher redshifts more elliptical galaxies pass the magnitude threshold compared to spiral galaxies, consequently leading to the higher population of elliptical galaxies compared to spiral galaxies at these redshift ranges.
The increased population of elliptical galaxies at higher redshifts can also be the result of the fact that spiral patterns become more difficult to identify in fainter and smaller galaxies, although the galaxies are all relatively large (Petrosian radius larger than 5.5) and bright (i magnitude < 18), and the annotations of the morphologies of these galaxies agree to a very high extent of ∼98% with the debiased “superclean” annotations of Galaxy Zoo. Because the initial selection of galaxies is based on the magnitude, more elliptical galaxies with higher photometric redshift can be included in the catalog, and therefore the increase of their population at higher redshifts does not necessarily reflect higher population in the higher redshift ranges of this catalog. On the other hand, the size threshold ensures that only large objects are selected, so that bright and small objects are excluded from the catalog.
Figure 11 shows the distribution of the galaxies in the catalog by their broad morphology, right ascension, and photometric redshift. Similarly to
Figure 10, the lower redshift range has a much higher number of spiral galaxies, which can be also related to the fact that spiral patterns are more difficult to identify as the redshift gets higher.
The redshift range of the galaxies used in the catalog is relatively small in terms of galaxy evolution. Some observations within that redshift range have been noted, such as the higher population of faint blue galaxies at redshift range of 0.3 to ∼1 [
41]. A more recent observation showed that the population of settled disk galaxies changes in the range of
[
42]. The absolute magnitude of galaxies also increases (becomes dimmer) when the redshift increases in the range of
[
39]. Another study showed a decrease in the population of massive late type galaxies in the
, which largely agrees with the distribution of the late-type galaxies in this catalog [
43].
5. Conclusions
The primary goals of this study are to test machine learning and variable selection algorithms for computing photometric redshift, optimize it for a specific population of galaxies, and mainly apply these algorithms to provide a large catalog of galaxy morphology and photometric redshift.
The catalog presented in this paper is similar to the early Galaxy Zoo 1 catalog, but because it was classified automatically it provides a much higher number of galaxies. Of the ∼3 × 10 galaxies in the catalog, ∼1.5 × 10 are galaxies with 98% agreement rate with the Galaxy Zoo 1 debiased “superclean” accuracy. It is limited in the sense that, like Galaxy Zoo, it represents the galaxies in the catalog, and not necessarily a complete and unbiased sample of SDSS galaxies.