1. Introduction
In spatial epidemiology, the spatial distribution of diseases is used to construct disease maps for finding the complex spatial patterns of interesting diseases. When Bayesian hierarchical models are used to investigate the disease mapping, various spatially structured random effects can be considered in models. Recently, we have been witnessing a resurgence of interest in disease mapping, and many efficient methods have been proposed in the literature (see Moraga and Lawson 2012 [
1]; Duncan et al., 2017 [
2]; Lawson 2018 [
3]; Baer and Lawson 2019 [
4]). To the best of our knowledge, the application of disease mapping concepts to explore related issues in astronomy within the context of spatial regression remains unaddressed. This knowledge gap is the driving force behind our investigation into whether spatial disease mapping techniques can be utilized to examine the occurrence of Earth-size planets in the Kepler survey. Disease mapping leverages neighboring region information for parameter estimation in epidemiology, leading to more accurate spatial predictions. In this study, we extended this approach by incorporating spatial random effects to capture the spatial correlation in the data. Interestingly, the incorporation of neighboring region information is still relatively unexplored in astronomy (e.g., Petigura et al., 2018 [
5]).
To the best of our knowledge, however, how to apply the concepts of disease mapping to discuss the related issues in astronomy has not been adequately addressed under the spatial regression settings. This motivates us to explore whether the techniques of spatial disease mapping can be applied to investigate the occurrence of Earth-size planets in the Kepler survey.
The Kepler mission aims to explore the diversity of planets and planetary systems. The discovery of thousands of transiting planets and planet candidates by the Kepler mission drastically broadens our knowledge of exoplanets, especially in the category of close-in (≲1 AU) and small (≲4 earth radii) planets around main-sequence dwarf stars (see Batalha 2014 [
6]; Burke et al., 2014 [
7]; Mullally et al., 2015 [
8]). The inference of the occurrence of Earth-size planets is an interesting problem that has attracted the attention of astronomers because of the important theories regarding planet formation and evolution models (see Benz et al., 2014 [
9]). Owing to the low false positive rate of the survey (see Fressin et al., 2013 [
10]; Lissauer et al., 2014 [
11]) while seeing different results from Santerne et al. (2016) [
12] for giant-planet candidates, numerous works offered a window into the statistical studies of planet occurrence rates in terms of orbital periods and planet radius (see Dong and Zhu 2013 [
13]; Fressin et al., 2013 [
10]; Petigura et al., 2013 [
5]; Burke 2015 [
14]; Dressing and Charbonneau 2015 [
15]; Silburt et al., 2015 [
16]; Morton et al., 2016 [
17]).
In this paper, we took the exoplanet sample and its corresponding survey completeness from Petigura et al., 2013 [
5]. In the proposed methodology, we defined the planet occurrence to be based on the detection of a planet within a specified range of orbital period and orbital radius. To consider the spatial dependences of the data, we applied a spatial Poisson regression model (e.g., Besag et al., 1991 [
18]; Chen and Yang 2011 [
19]; Cressie 2015 [
20]) to model the detection probability of an exoplanet. Further, to infer the posterior probability of detecting an exoplanet, a stochastic algorithm based on Markov chain Monte Carlo (MCMC) under the Bayesian framework was designed. Finally, the posterior inferences can simultaneously describe the number of exoplanets and the corresponding occurrence rate in the study region.
The remainder of this paper is organized as follows. In
Section 2, we introduce a joint modeling methodology and present how to estimate parameters in the proposed model.
Section 3 applies the proposed model to determine the occurrence rate of the Kepler planet. We conclude the paper with a discussion in
Section 4.
2. Methodology
Let
D be a bounded continuous random field in the
, which is partitioned into
regular grids
with
and
for
. Let
, be a random variable that counts the number of exoplanets in grid
. For grid
, the expected number,
E, of exoplanets can be easily evaluated by:
Motivated by the concept of a standardized mortality ratio in epidemiology (see Kelsall and Wakefield 2002 [
21]; Lawson 2018 [
3]), a standardized occurrence ratio of exoplanets for the grid
is defined by
In general, one can simply use
as the occurrence rate of exoplanets in grid
. Here, one potential influential factor is that a large amount of gravity generally exists among planets, and the correlation of the data set among grids should be considered in estimating such an occurrence rate. Obviously, the quantity
does not take into account the dependence among
. Thus, using
to estimate the occurrence rate of exoplanets of the grid
may yield inaccurate results. Motivated from existing works (see Kelsall and Wakefield 2002 [
21]; Chen and Yang 2011 [
19]; Moraga and Lawson 2012 [
1]; Lawson 2018 [
3]; Baer and Lawson 2019 [
4]), a spatial conditional autoregressive (CAR) model (see Moraga and Lawson 2012 [
1]; Cressie 2015 [
20]; Lawson 2018 [
3]; Baer and Lawson 2019 [
4]) was applied, which was used to describe possible spatial correlations among
. The estimates of the occurrence rate of exoplanets in the grid
, were then proposed.
2.1. Spatial Poisson Regression Model
For
, let
be the occurrence rate of exoplanets in grid
. Then, an intuitive model for
given
;
, is a Poisson distribution as follows:
In Equation (
1),
represents the intensity rate of the Poisson process and
is the main parameter of interest in this research. In this paper, our goal was to incorporate the spatial dependence of
to estimate the unobserved variables
. Suppose that there are
p grid-level covariates observed in grid
denoted together with 1 for the intercept by
. As suggested in Basag et al. (1991) [
18], the occurrence rate
of interest can be modeled in the following manner:
where
is the vector of regression coefficients and
is a spatial random error process. In spatial statistics, the spatial random errors
capture the spatial variation and can offer a local adjustment to the mean trend due to unobserved covariates. In general, we assume that
follows a multivariate Gaussian process as follows:
where the
matrix
is a spatial correlation matrix,
is an unknown parameter, and
is a variance component. According to the CAR model,
given in Equation (
3) can be further decomposed as
where
is an
spatial association matrix,
is an identity matrix, and
. Under these settings, we have the following facts: (i)
is nonsingular; (ii) when
,
is symmetric and positive-definite, where the upper and lower limits of
are evaluated by the inverses of the smallest and the largest eigenvalues of the spatial association matrix. For the sake of simplicity, in this paper, we constructed
according to the rook contiguity structure; that is, the
th element of
is of the following form:
Note that
in Equation (
4) represents that
and
are neighbors with a common boundary.
We define
to be the neighborhood set of grid
and
; then, the conditional distribution of
conditioned on
is given by
for
. Note that the joint distribution of
can be shown to be a multivariate Gaussian distribution as in Equation (
3) based on the factorization theorem of Besag (1974) [
22] and the properties of multivariate Gaussian distributions. Readers can better understand the correctness of Equation (
5) by referring to De Oliveira (2012) [
23] for a comprehensive and systematic introduction to the CAR model. It is obvious from Equation (
5) that the spatial dependence is considered through the information derived from neighbors. Notice that the spatial Poisson regression model offers the advantage of incorporating information from neighboring regions to enhance parameter estimation and prediction. Additionally, it is worth noting that the consideration of data correlation in recent literature is still relatively uncommon, as observed in studies such as Petigura et al. (2018) [
24].
2.2. Prior Specifications and Posterior Distribution
Using the Bayesian approach, we set mutually independent prior distributions on parameters
,
, and
as shown in
Table 1. For
and
, the hyper-parameters are pre-specified constants such that the corresponding priors are nearly flat. Based on the CAR model, the spatial dependence parameter
must fall within
to ensure that
is a positive-definite matrix. However,
can be less than zero, leading to a negative spatial correlation, which is rare in practice. Hence, we further restricted the spatial correlation parameter
domain to
, ensuring positive spatial correlation. This modification ensures that the model captures the desired spatial dependence structure and aligns with common practices in the field. According to the priors in
Table 1, the joint prior distribution of
,
, and
, denoted as
, is given by
Combining Equations (
1)–(
3) and Equation (
6), the joint posterior distribution of
,
,
, and
conditioned on observed data
satisfies:
Because the joint posterior distribution in Equation (
7) cannot be applied directly to generate posterior samples of model parameters, an alternative method called a Markov chain Monte Carlo (MCMC) method will be introduced in the following to generate posterior samples of model parameters.
2.3. Posterior Inferences of Model Parameters
To generate posterior samples of
,
,
, and
, the conditional posterior distributions of each parameter given all of the others are needed. One can then successively sample these conditional posterior distributions and obtain Markov chains in the parameter spaces that will converge to the joint posterior distribution of Equation (
7) under Tierney’s conditions (1994) [
25].
Next, we summarize all necessary conditional posterior distributions for
,
,
, and
, based on Equations (
1)–(
7) as follows:
We notice that
is an inverse gamma distribution; that is,
. Therefore, a Gibbs sampling algorithm (see Geman and Geman 1984 [
26]) can be used to generate the posterior samples of
. However,
,
, and
, are not all standard distributions; hence, a Metropolis–Hastings algorithm (see Chib and Greenberg 1995 [
27]) can be applied to
,
, and
, respectively, to iteratively generate an ergodic Markov chain that yields the corresponding posterior samples. In particular, generating the posterior samples of
is relatively difficult because
appears in the covariance matrix
. In this paper, we treated
as a discrete random variable that is defined on finite grid points from 0 to
; hence, the values of matrix
on these finite grid points can be computed in advance. For each step, the posterior sample of
is generated from a probability mass function, which is based on the values of
evaluated on the finite grid points of
.
Based on the posterior samples of , , , and , the inferences of model parameters and the occurrence rate of exoplanets in grid , can be obtained.
3. Application of the Proposed Methodology
To model the occurrence distribution of planets as a function of the planet period and radius, Petigura et al. (2013) [
5] considered transiting planets that are all hosted by GK-type stars. They defined GK-type stars as those with surface temperatures of
and gravities of
. Furthermore, these planets are restricted to the brightest GK-type stars observed by
Kepler (
–
). These 42,557 stars have the lowest photometric noise in the Kepler survey, thereby maximizing the detectability of Earth-size planets. In the present work, we mainly studied the occurrence rate of planets based on the catalog by Petigura et al. (2013) [
5], which can compare our findings with their seminal work by adopting the same study region.
Figure 1 shows the scatter plot of the data. Let
be the orbital period (days),
be the planet size (Earth radii), and
be the region of interest for this work; it is divided into the
grids shown in
Figure 2. Let
record the number of events in grid
for
. Please note that the region
D is the same as in Petigura et al. (2013) [
5].
We applied the linear regression model illustrated in Equation (
2) of
Section 2.1 to model the occurrence rate
and considered two grid-level covariates,
and
, in the model, where
and
are, respectively, defined by the central points of the orbital period (days) and planet size (earth radii) of the grid
(i.e., the central coordinate of the grid
) for
. As a result, the used model, called Model 1, is given by
where
,
, and
are unknown regression coefficients, and
is a spatial random error process. Based on the Bayesian approach in
Section 2.3, prior distributions of parameters in
are, respectively, set as follows:
Note that is 0.29 because the smallest eigenvalue of C is 3.42. Since we lacked additional information about the central tendencies of the parameters, we selected the hyper-parameter values for the prior distributions based on the preference for larger variances. Although larger variances may result in a slower convergence, the MCMC algorithm can still converge. Additionally, the larger variances allow for more flexibility and variation in the MCMC updates, enhancing the parameter space exploration.
Next, we first examined the hypothesized model (i.e., Equations (
1)–(
3)) that is suitable for analyzing the occurrence rate of Earth-size planets in the Kepler survey. In this paper, we conducted a simulation study based on the Pearson chi-squared test to illustrate the goodness of fit of the used model (i.e., Equation (
8)); Model 1). In addition, as listed in the bottom of
Table 2, a model (i.e., Model 2) with only the regressors and a model (i.e., Model 3) with only the spatial random error process were also used for comparison. Let
;
, be independently generated from
, with
E being the expected number of exoplanets evaluated according to the observed data
, where
is an estimate of the occurrence rate
based on the posterior medians of
under the used model (i.e., Model 1, Model 2, or Model 3) and
, represents the
t-th simulation. For each simulation replicate, the goodness-of-fit test statistic is computed in the following manner:
where
is the expected number of exoplanets evaluated based on the
t-th simulated data
;
. The simulation results are displayed in
Table 2. First, we notice that Model 2 with only the regressors has a large
value for each simulation replicate. This indicates that Model 2 without considering the spatial correlation of the data is very inappropriate. Comparing the proposed model (i.e., Model 1) versus Model 3, they have relatively small
values and hence Model 1 and Model 3 are both appropriate for the analysis of the occurrence rate of Earth-size planets. Overall, the
values of Model 1 are slightly smaller than those of Model 3, which further suggests to us to use Model 1 (i.e., Equation (
8)) to analyze the data set. Even if all the estimated regression coefficients are not significant (see
Table 3), in general, the regressors should slightly contribute to evaluating the occurrence rate. Moreover,
Figure 3 shows
credible intervals of
;
, for Model 1, Model 2, and Model 3. The results are in accord with
Table 2; that is,
Figure 3 reveals that Model 2 performs poorly and that Model 1 and Model 3 are fairly comparable. On the other hand, we notice that the data may contain potential biases that may arise from observational precision that results in inaccurate estimates of the underlying occurrence rates. In our proposed methodology, the random effects describe the spatial correlation in the data and are a suitable remedy for missing explanatory variables, addressing the limitations caused by uncollected vital variables. The simulation results indicate the effectiveness of our approach in mitigating potential biases and enhancing the model’s explanatory power. Based on the results in
Table 2 and
Figure 3, Model 1 in Equation (
8) is acceptable and hence we used it to analyze the occurrence rate of Earth-size planets in the next content.
We implemented 200,000 iterations for the posterior calculations to obtain a convergent sequence and approximately independent posterior samples. The first 100,000 iterations were discarded as burn-in. Then, one has an approximately independent joint posterior sample size of 100,000 by subsampling every 10th scan. The execution time for 200,000 MCMC iterations was 56.26471 s on an i7-12700 2.10 GHz PC. The system environment was R language version 4.2.3 lined to Intel’s Math Kernal Library (MKL) on Windows 11. The core codes of the MCMC process were implemented using custom-written code without relying on external packages.The trace plot in
Figure 4 displays the logarithm values of Equation (
7) for the 200,000 MCMC iterations. Given that the proposed model incorporates multiple parameters and random effect terms, we assessed the overall convergence of the MCMC process using these logarithm values. Notably, the trace plot reveals that it belongs to an interval within the 200,000 iterations, implying that the MCMC process has reached convergence.
Table 3 presents posterior inferences based on 10,000 posterior samples for model parameters. Furthermore, the posterior means of
for
, are shown in
Figure 5.
Figure 6 displays the results with estimated occurrence rates
,
in each grid.
Next, we considered the variable detection efficiency (or completeness) in order to identify realistic occurrence rates. After obtaining the estimated occurrence rates in each cell shown in
Figure 6, we further considered the survey completeness in order to identify realistic occurrence rates. The values of completeness function used here were constructed by Foreman-Mackey et al. (2014) [
28]. We can thus obtain the true occurrence rates
,
in each cell, as shown in
Figure 7. Because the method proposed in this paper is presented as a totally different approach to that of Petigura et al. (2013) [
5], we need to make a comparison with Petigura et al.’s method (2013) [
5]. We computed realistic occurrence rates with different values of orbital period (
P) and planet radius (
R) and the corresponding realistic occurrence rates, as shown in
Table 4. Note that case (i) in
Table 4 corresponds to Jupiter-size planets.
From
Table 4, we find that (1) for cases (ii), (iii), (iv), (vii), and (ix), the occurrence rates obtained from the proposed method are larger than those of Petigura et al. (2013) [
5] by approximately a factor of two; (2) for cases (i), (v), and (vi), the occurrence rates obtained the proposed method are almost the same as Petigura et al.’s (2013) [
5]; and (3) for cases (viii) and (x), the occurrence rates obtained by Petigura et al. (2013) [
5] are larger than the proposed method herein. Because the proposed model considers the information of neighbors, the grid density is high, which will produce higher occurrence rates. On the contrary, if the grid density is low, lower rates will occur. Furthermore, both methods confirm the occurrence rates of planets with (i)
5–100 d and size 8–16
; (ii)
25–50 d and size 1–16
; and (iii)
50–100 d and size 1–16
.
Furthermore, we are interested in the occurrence rate of Earth analogs hosted by GK dwarf stars, i.e.,
200–400 d and size 1–2
. From the scatter plot shown in
Figure 1, there are no planets in this grid, and there are few planets in the neighborhood of this grid. Thus, it is reasonable that the occurrence rate of this grid is very small. After completeness correction, we find the occurrence rate to be 0.18% (please see case (viii) in
Table 4), whereas the values obtained by Petigura et al. (2013) [
5], Foreman-Mackey et al. (2014) [
28], and Chen and Hung (2019) [
29] are 5.7%, 1.9%, and 2.5%, respectively. The proposed method indicates that 46% of Sun-like stars have an Earth-size (1–2
) planet with
5–100 d. This value is higher than Petigura et al.’s (2013) [
5] due to the spatial model considering the information of neighbors. We further conducted an additional extrapolation of the hot Jupiter occurrence rate (i.e., the occurrence rate of 1–10 days and 8–24
) and compared it to the findings of Petigura et al. (2018) [
24]. Their study reported a hot Jupiter occurrence rate of 0.57%, whereas our extrapolated estimate stands at 4.17%. According to the scalability of our proposed model, it provides an extrapolation with new data. To the best of our knowledge, utilizing neighboring data information for occurrence rate estimation in astronomy is a novel approach that has not been previously observed. According to the inference of Petigura et al. (2013) [
5], we may imply that the nearest Earth-size planets in habitable zones of Sun-like stars are expected to orbit a star further than 12 light-years from Earth because we adopted the 46% occurrence rate.