*Article* **On Study of the Occurrence of Earth-Size Planets in Kepler Mission Using Spatial Poisson Model**

**Hong-Ding Yang 1,\*, Yun-Huan Lee <sup>2</sup> and Che-Yang Lin <sup>3</sup>**


**Abstract:** The problem of determining the occurrence rate for Earth-size planets orbiting Sun-like stars is emerging in the universe. We propose a methodology based on a spatial Poisson regression model with model parameters being inferred by the Bayesian framework to investigate this occurrence rate. We analyzed an exoplanet sample and its corresponding survey completeness data. Our results suggest that 46% of Sun-like stars have an Earth-size (i.e., 1–2 times Earth radii) planet with an orbital period of 5–100 days. Furthermore, we are also interested in the occurrence rate of Earth analogs hosted by GK dwarf stars (i.e., orbital period of 200–400 days and size 1–2 times Earth radii). After completeness correction, we obtained an occurrence rate of 0.18% based on the proposed methodology.

**Keywords:** conditional autoregressive model; Markov chain Monte Carlo; occurrence rate; spatial Poisson model

**MSC:** 62J05; 62J12; 62J20; 85-10; 85A35

#### **1. Introduction**

In spatial epidemiology, the spatial distribution of diseases is used to construct disease maps for finding the complex spatial patterns of interesting diseases. When Bayesian hierarchical models are used to investigate the disease mapping, various spatially structured random effects can be considered in models. Recently, we have been witnessing a resurgence of interest in disease mapping, and many efficient methods have been proposed in the literature (see Moraga and Lawson 2012 [1]; Duncan et al., 2017 [2]; Lawson 2018 [3]; Baer and Lawson 2019 [4]). To the best of our knowledge, the application of disease mapping concepts to explore related issues in astronomy within the context of spatial regression remains unaddressed. This knowledge gap is the driving force behind our investigation into whether spatial disease mapping techniques can be utilized to examine the occurrence of Earth-size planets in the Kepler survey. Disease mapping leverages neighboring region information for parameter estimation in epidemiology, leading to more accurate spatial predictions. In this study, we extended this approach by incorporating spatial random effects to capture the spatial correlation in the data. Interestingly, the incorporation of neighboring region information is still relatively unexplored in astronomy (e.g., Petigura et al., 2018 [5]).

To the best of our knowledge, however, how to apply the concepts of disease mapping to discuss the related issues in astronomy has not been adequately addressed under the spatial regression settings. This motivates us to explore whether the techniques of spatial disease mapping can be applied to investigate the occurrence of Earth-size planets in the Kepler survey.

The Kepler mission aims to explore the diversity of planets and planetary systems. The discovery of thousands of transiting planets and planet candidates by the Kepler mission drastically broadens our knowledge of exoplanets, especially in the category of

**Citation:** Yang, H.-D.; Lee, Y.-H.; Lin, C.-Y. On Study of the Occurrence of Earth-Size Planets in Kepler Mission Using Spatial Poisson Model. *Mathematics* **2023**, *11*, 2508. https:// doi.org/10.3390/math11112508

Academic Editors: Wen Zhang, Xiaofeng Xu, Jun Wu and Kaijian He

Received: 9 April 2023 Revised: 26 May 2023 Accepted: 29 May 2023 Published: 30 May 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

close-in (-1 AU) and small (-4 earth radii) planets around main-sequence dwarf stars (see Batalha 2014 [6]; Burke et al., 2014 [7]; Mullally et al., 2015 [8]). The inference of the occurrence of Earth-size planets is an interesting problem that has attracted the attention of astronomers because of the important theories regarding planet formation and evolution models (see Benz et al., 2014 [9]). Owing to the low false positive rate of the survey (see Fressin et al., 2013 [10]; Lissauer et al., 2014 [11]) while seeing different results from Santerne et al. (2016) [12] for giant-planet candidates, numerous works offered a window into the statistical studies of planet occurrence rates in terms of orbital periods and planet radius (see Dong and Zhu 2013 [13]; Fressin et al., 2013 [10]; Petigura et al., 2013 [5]; Burke 2015 [14]; Dressing and Charbonneau 2015 [15]; Silburt et al., 2015 [16]; Morton et al., 2016 [17]).

In this paper, we took the exoplanet sample and its corresponding survey completeness from Petigura et al., 2013 [5]. In the proposed methodology, we defined the planet occurrence to be based on the detection of a planet within a specified range of orbital period and orbital radius. To consider the spatial dependences of the data, we applied a spatial Poisson regression model (e.g., Besag et al., 1991 [18]; Chen and Yang 2011 [19]; Cressie 2015 [20]) to model the detection probability of an exoplanet. Further, to infer the posterior probability of detecting an exoplanet, a stochastic algorithm based on Markov chain Monte Carlo (MCMC) under the Bayesian framework was designed. Finally, the posterior inferences can simultaneously describe the number of exoplanets and the corresponding occurrence rate in the study region.

The remainder of this paper is organized as follows. In Section 2, we introduce a joint modeling methodology and present how to estimate parameters in the proposed model. Section 3 applies the proposed model to determine the occurrence rate of the Kepler planet. We conclude the paper with a discussion in Section 4.

#### **2. Methodology**

Let *D* be a bounded continuous random field in the -2, which is partitioned into *<sup>n</sup>* = *<sup>n</sup>*<sup>1</sup> × *<sup>n</sup>*<sup>2</sup> regular grids *<sup>D</sup>*1, ... , *Dn* with *<sup>D</sup>* = *n i*=1 *Di* and *Di Dj* = ∅ for *i* = *j*. Let *Yi*, *i* = 1, ... , *n*, be a random variable that counts the number of exoplanets in grid *Di*. For grid *Di*, the expected number, *E*, of exoplanets can be easily evaluated by:

$$E = \frac{1}{n} \sum\_{i=1}^{n} \mathbb{Y}\_i.$$

Motivated by the concept of a standardized mortality ratio in epidemiology (see Kelsall and Wakefield 2002 [21]; Lawson 2018 [3]), a standardized occurrence ratio of exoplanets for the grid *Di* is defined by

$$r\_i = \frac{\chi\_i}{E}, \ i = 1, \dots, n.$$

In general, one can simply use *ri* as the occurrence rate of exoplanets in grid *Di*. Here, one potential influential factor is that a large amount of gravity generally exists among planets, and the correlation of the data set among grids should be considered in estimating such an occurrence rate. Obviously, the quantity *ri* does not take into account the dependence among {*Y*1, ... ,*Yn*}. Thus, using *ri* to estimate the occurrence rate of exoplanets of the grid *Di* may yield inaccurate results. Motivated from existing works (see Kelsall and Wakefield 2002 [21]; Chen and Yang 2011 [19]; Moraga and Lawson 2012 [1]; Lawson 2018 [3]; Baer and Lawson 2019 [4]), a spatial conditional autoregressive (CAR) model (see Moraga and Lawson 2012 [1]; Cressie 2015 [20]; Lawson 2018 [3]; Baer and Lawson 2019 [4]) was applied, which was used to describe possible spatial correlations among {*Y*1,...,*Yn*}. The estimates of the occurrence rate of exoplanets in the grid *Di*, *i* = 1, ··· , *n*, were then proposed.

*2.1. Spatial Poisson Regression Model*

For *i* = 1, ... , *n*, let *Ri* be the occurrence rate of exoplanets in grid *Di*. Then, an intuitive model for *Yi* given *Ri*; *i* = 1, . . . , *n*, is a Poisson distribution as follows:

$$\mathcal{Y}\_i \mid \mathcal{R}\_i \sim \text{Poi}(\mathcal{R}\_i E). \tag{1}$$

In Equation (1), *RiE* represents the intensity rate of the Poisson process and *Ri* > 0 is the main parameter of interest in this research. In this paper, our goal was to incorporate the spatial dependence of *Y*1, ... ,*Yn* to estimate the unobserved variables *R*1, ... , *Rn*. Suppose that there are *p* grid-level covariates observed in grid *Di* denoted together with 1 for the intercept by *x<sup>i</sup>* = (1, *xi*1, ... , *xip*) . As suggested in Basag et al. (1991) [18], the occurrence rate *Ri* of interest can be modeled in the following manner:

$$\ln(R\_i) = \mathbf{x}\_i^\prime \mathcal{B} + \delta\_i; \ i = 1, \ldots, n,\tag{2}$$

where *β* = (*β*0, *β*1, ... , *βp*) is the vector of regression coefficients and *δ<sup>i</sup>* is a spatial random error process. In spatial statistics, the spatial random errors *δ* = (*δ*1, ... , *δn*) capture the spatial variation and can offer a local adjustment to the mean trend due to unobserved covariates. In general, we assume that *δ* follows a multivariate Gaussian process as follows:

$$\delta \mid \sigma^2, \phi \sim N\left(\mathbf{0}, \sigma^2 V(\phi)\right),\tag{3}$$

where the *n* × *n* matrix *V*(*φ*) is a spatial correlation matrix, *φ* is an unknown parameter, and *σ*<sup>2</sup> is a variance component. According to the CAR model, *σ*2*V*(*φ*) given in Equation (3) can be further decomposed as

$$
\sigma^2 V(\phi) = (I - \phi \mathcal{C})^{-1} M\_{\prime \prime}
$$

where *C* = (*cij*) is an *n* × *n* spatial association matrix, *I* is an identity matrix, and *M* = *σ*<sup>2</sup> *I*. Under these settings, we have the following facts: (i) (*I* − *φC*) is nonsingular; (ii) when *φ* ∈ (*φ*min, *φ*max), (*I* − *φC*)−1*M* is symmetric and positive-definite, where the upper and lower limits of *φ* are evaluated by the inverses of the smallest and the largest eigenvalues of the spatial association matrix. For the sake of simplicity, in this paper, we constructed *C* according to the rook contiguity structure; that is, the (*i*, *j*)th element of *C* is of the following form:

$$\mathcal{L}\_{ij} = \begin{cases} 1, & i \sim j; \\ 0, & \text{otherwise.} \end{cases} \tag{4}$$

Note that *i* ∼ *j* in Equation (4) represents that *Di* and *Dj* are neighbors with a common boundary.

We define *Ni* ≡ {*j* | *j* ∼ *i*} to be the neighborhood set of grid *Di* and *δ*−*<sup>i</sup>* ≡ (*δ*1, ... , *δi*<sup>−</sup>1, *δi*<sup>+</sup>1, ... , *δn*) ; then, the conditional distribution of *δ<sup>i</sup>* conditioned on *δ*−*<sup>i</sup>* is given by

$$\delta\_i \mid \sigma^2, \phi, \delta\_{-i} \sim N\left(\phi \sum\_{j \in N\_i} c\_{ij} \delta\_{j\prime} \sigma^2\right),\tag{5}$$

for *i* = 1, ... , *n*. Note that the joint distribution of *δ<sup>i</sup>* | *σ*2, *φ*, *δ*−*i*, *i* = 1, ... , *n* can be shown to be a multivariate Gaussian distribution as in Equation (3) based on the factorization theorem of Besag (1974) [22] and the properties of multivariate Gaussian distributions. Readers can better understand the correctness of Equation (5) by referring to De Oliveira (2012) [23] for a comprehensive and systematic introduction to the CAR model. It is obvious from Equation (5) that the spatial dependence is considered through the information derived from neighbors. Notice that the spatial Poisson regression model offers the advantage of incorporating information from neighboring regions to enhance parameter estimation and prediction. Additionally, it is worth noting that the consideration of data correlation in recent literature is still relatively uncommon, as observed in studies such as Petigura et al. (2018) [24].

#### *2.2. Prior Specifications and Posterior Distribution*

Using the Bayesian approach, we set mutually independent prior distributions on parameters *β*, *σ*2, and *φ* as shown in Table 1. For *β* and *σ*2, the hyper-parameters are pre-specified constants such that the corresponding priors are nearly flat. Based on the CAR model, the spatial dependence parameter *φ* must fall within (*φmin*, *φmax*) to ensure that (*I* − *φC*)−<sup>1</sup> is a positive-definite matrix. However, *φmin* can be less than zero, leading to a negative spatial correlation, which is rare in practice. Hence, we further restricted the spatial correlation parameter *φ* domain to (0, *φmax*), ensuring positive spatial correlation. This modification ensures that the model captures the desired spatial dependence structure and aligns with common practices in the field. According to the priors in Table 1, the joint prior distribution of *β*, *σ*2, and *φ*, denoted as *π*(*β*, *σ*2, *φ*), is given by

$$
\pi(\mathfrak{g}, \sigma^2, \phi) = \pi(\mathfrak{g})\pi(\sigma^2)\pi(\phi) \approx \sigma^{-2(a+1)} \exp\left(-\frac{1}{b\sigma^2}\right); \text{ } \sigma > 0, \text{ } \phi \in (0, \phi\_{\text{max}}).\tag{6}
$$

Combining Equations (1)–(3) and Equation (6), the joint posterior distribution of *σ*2, *φ*, *β*, and *δ* conditioned on observed data *Y* = (*Y*1,...,*Yn*) satisfies:

$$\begin{split} p(\boldsymbol{\sigma}^{2}, \boldsymbol{\phi}, \boldsymbol{\theta}, \boldsymbol{\delta} \mid \mathbf{Y}) &= \quad \frac{p(\boldsymbol{\sigma}^{2}, \boldsymbol{\phi}, \boldsymbol{\theta}, \boldsymbol{\delta}, \boldsymbol{Y})}{p(\boldsymbol{Y})} \\ &\propto \quad \prod\_{i=1}^{n} p(Y\_{i} \mid \boldsymbol{R}\_{i}) p(\boldsymbol{\delta} \mid \boldsymbol{\sigma}^{2}, \boldsymbol{\phi}) \pi(\boldsymbol{\theta}, \boldsymbol{\sigma}^{2}, \boldsymbol{\phi}) \\ &\propto \quad \exp\left(\sum\_{i=1}^{n} Y\_{i} (\mathbf{x}\_{i}^{\prime} \boldsymbol{\theta} + \boldsymbol{\delta}\_{i}) - E \sum\_{i=1}^{n} \exp(\mathbf{x}\_{i}^{\prime} \boldsymbol{\theta} + \boldsymbol{\delta}\_{i})\right) \\ &\propto \Big( \det\Big(\boldsymbol{\sigma}^{2} \boldsymbol{V}(\boldsymbol{\phi}) \Big) \Big)^{-1/2} \exp\Big( -\frac{1}{2\sigma^{2}} \boldsymbol{\delta}^{\prime} \boldsymbol{V}(\boldsymbol{\phi})^{-1} \boldsymbol{\delta} \Big) \\ &\propto \boldsymbol{\sigma}^{-2(a+1)} \exp\Big( -\frac{1}{b\sigma^{2}} \Big). \end{split} \tag{7}$$

Because the joint posterior distribution in Equation (7) cannot be applied directly to generate posterior samples of model parameters, an alternative method called a Markov chain Monte Carlo (MCMC) method will be introduced in the following to generate posterior samples of model parameters.

**Table 1.** Priors for model parameters *β*, *σ*2, and *φ*.


#### *2.3. Posterior Inferences of Model Parameters*

To generate posterior samples of *σ*2, *φ*, *β*, and *δ*, the conditional posterior distributions of each parameter given all of the others are needed. One can then successively sample these conditional posterior distributions and obtain Markov chains in the parameter spaces that will converge to the joint posterior distribution of Equation (7) under Tierney's conditions (1994) [25].

Next, we summarize all necessary conditional posterior distributions for *σ*2, *φ*, *β*, and *δi*, *i* = 1, . . . , *n*, based on Equations (1)–(7) as follows:

$$\begin{split} p(\sigma^{2} \mid \phi, \mathfrak{H}, \delta, Y) &\quad \propto \quad p(\delta \mid \sigma^{2}, \phi) \pi(\sigma^{2})\\ &\propto \quad \propto \quad \sigma^{-2(n/2+\varrho+1)} \exp\left(-\frac{1}{\sigma^{2}} \left(\frac{1}{b} + \frac{1}{2}\delta'(I-\mathfrak{g}\mathscr{C})\delta\right)\right)\\ p(\phi \mid \sigma^{2}, \mathfrak{H}, \delta, Y) &\propto \quad p(\delta \mid \sigma^{2}, \phi) \pi(\phi)\\ &\propto \quad \text{(det}(V(\phi)))^{-1/2} \exp\left(\frac{\oint p(\mathbf{x}\_{i} \mid \mathbf{R}\_{i}) \pi(\mathbf{g})}{2\sigma^{2}}\right)\\ p(\varPhi \mid \sigma^{2}, \phi, \delta, Y) &\propto \quad \prod\_{i=1}^{n} p(Y\_{i} \mid R\_{i}) \pi(\mathfrak{g})\\ &\propto \quad \exp\left(\sum\_{i=1}^{n} Y\_{i} \mathbf{x}\_{i}^{\prime} \mathfrak{g} - E \sum\_{i=1}^{n} \exp\left(\mathbf{x}\_{i}^{\prime} \mathfrak{g} + \delta\_{i}\right)\right)\\ p(\delta\_{i} \mid \sigma^{2}, \phi, \delta, \delta\_{-i}, Y) &\propto \quad p(Y\_{i} \mid R\_{i}) p(\delta\_{i} \mid \sigma^{2}, \phi, \delta\_{-i})\\ &\propto \quad \exp\left(Y\_{i} \delta\_{i} - E \exp\left(\mathbf{x}\_{i}^{\prime} \mathfrak{g} + \delta\_{i}\right) - \frac{1}{2\sigma^{2}} \left(\delta\_{i}^{2} - 2\delta\_{i} \Phi \sum\_{j \in N\_{i}} c\_{ij} \delta\_{j}\right)\right) \end{split}$$

We notice that *p*(*σ*<sup>2</sup> | *φ*, *β*, *δ*, *Y*) is an inverse gamma distribution; that is, *σ*<sup>2</sup> | *φ*, *β*, *δ*, *Y* ∼ *IG n*/2 + *a*,(1/*b* + *δ* (*I* − *φC*)*δ*/2)−<sup>1</sup> . Therefore, a Gibbs sampling algorithm (see Geman and Geman 1984 [26]) can be used to generate the posterior samples of *σ*2. However, *p*(*φ* | *σ*2, *β*, *δ*, *Y*), *p*(*β* | *σ*2, *φ*, *δ*, *Y*), and *p*(*δ<sup>i</sup>* | *σ*2, *φ*, *β*, *δ*−*i*, *Y*), *i* = 1, ... , *n*, are not all standard distributions; hence, a Metropolis–Hastings algorithm (see Chib and Greenberg 1995 [27]) can be applied to *φ*, *β*, and *δi*, respectively, to iteratively generate an ergodic Markov chain that yields the corresponding posterior samples. In particular, generating the posterior samples of *φ* is relatively difficult because *φ* appears in the covariance matrix *V*(*φ*). In this paper, we treated *φ* as a discrete random variable that is defined on finite grid points from 0 to *φmax*; hence, the values of matrix *V*(*φ*) on these finite grid points can be computed in advance. For each step, the posterior sample of *φ* is generated from a probability mass function, which is based on the values of (det(*V*(*φ*)))−1/2 exp *<sup>φ</sup>* <sup>2</sup>*σ*<sup>2</sup> *<sup>δ</sup> Cδ* 

evaluated on the finite grid points of *φ* ∈ (0, *φmax*).

Based on the posterior samples of *σ*2, *φ*, *β*, and *δi*, *i* = 1, ... , *n*, the inferences of model parameters and the occurrence rate of exoplanets in grid *Di*, *i* = 1, . . . , *n*, can be obtained.

#### **3. Application of the Proposed Methodology**

To model the occurrence distribution of planets as a function of the planet period and radius, Petigura et al. (2013) [5] considered transiting planets that are all hosted by GK-type stars. They defined GK-type stars as those with surface temperatures of 4100 K ≤ Teff ≤ 6100 K and gravities of 4.0 cm/s<sup>2</sup> ≤ log *g* ≤ 4.9 cm/s2. Furthermore, these planets are restricted to the brightest GK-type stars observed by *Kepler* (*Kp* = 10–15 mag). These 42,557 stars have the lowest photometric noise in the Kepler survey, thereby maximizing the detectability of Earth-size planets. In the present work, we mainly studied the occurrence rate of planets based on the catalog by Petigura et al. (2013) [5], which can compare our findings with their seminal work by adopting the same study region. Figure 1 shows the scatter plot of the data. Let *x*<sup>1</sup> be the orbital period (days), *x*<sup>2</sup> be the planet size (Earth radii), and *D* = [6.25, 400] × [1, 16] be the region of interest for this work; it is divided into the 6 × 4 grids shown in Figure 2. Let *Yi* record the number of events in grid *Di* for *i* = 1, ... , 24. Please note that the region *D* is the same as in Petigura et al. (2013) [5].

**Figure 1.** The scatterplot of exoplanets in the *x*1–*x*<sup>2</sup> space and the 24 subregions *Di*, *i* = 1, ··· , 24.

**Figure 2.** The values of *Yi* for *i* = 1, ··· , 24.

We applied the linear regression model illustrated in Equation (2) of Section 2.1 to model the occurrence rate *Ri* and considered two grid-level covariates, *xi*<sup>1</sup> and *xi*2, in the model, where *xi*<sup>1</sup> and *xi*<sup>2</sup> are, respectively, defined by the central points of the orbital period (days) and planet size (earth radii) of the grid *Di* (i.e., the central coordinate of the grid *Di*) for *i* = 1, . . . , 24. As a result, the used model, called Model 1, is given by

$$\ln(R\_i) = \beta\_0 + \beta\_1 \mathbf{x}\_{1,i} + \beta\_2 \mathbf{x}\_{2,i} + \delta\_i; \ i = 1, \ldots, 24,\tag{8}$$

where *β*0, *β*1, and *β*<sup>2</sup> are unknown regression coefficients, and *δ<sup>i</sup>* is a spatial random error process. Based on the Bayesian approach in Section 2.3, prior distributions of parameters in *θ* = (*σ*2, *φ*, *β*0, *β*1, *β*2) are, respectively, set as follows:

$$\begin{array}{rcl} \sigma^2 & \sim & IG(3, 0.001) \\ \Phi & \sim & \mathcal{U}(0, \phi\_{\text{max}}) \\ \beta\_0 & \sim & \mathcal{U}(-20, 20) \\ \beta\_1 & \sim & \mathcal{U}(-20, 20) \\ \beta\_2 & \sim & \mathcal{U}(-20, 20) \end{array}$$

Note that *φ*max is 0.29 because the smallest eigenvalue of *C* is 3.42. Since we lacked additional information about the central tendencies of the parameters, we selected the hyper-parameter values for the prior distributions based on the preference for larger variances. Although larger variances may result in a slower convergence, the MCMC algorithm can still converge. Additionally, the larger variances allow for more flexibility and variation in the MCMC updates, enhancing the parameter space exploration.

Next, we first examined the hypothesized model (i.e., Equations (1)–(3)) that is suitable for analyzing the occurrence rate of Earth-size planets in the Kepler survey. In this paper, we conducted a simulation study based on the Pearson chi-squared test to illustrate the goodness of fit of the used model (i.e., Equation (8)); Model 1). In addition, as listed in the bottom of Table 2, a model (i.e., Model 2) with only the regressors and a model (i.e., Model 3) with only the spatial random error process were also used for comparison. Let *<sup>Y</sup>*∗(*t*) *<sup>i</sup>* ; *i* = 1, ... , 24, be independently generated from Poi *<sup>R</sup>*∗(*t*) *<sup>i</sup> E* , with *E* being the expected number of exoplanets evaluated according to the observed data *<sup>Y</sup>*, where *<sup>R</sup>*∗(*t*) *<sup>i</sup>* is an estimate of the occurrence rate *Ri* based on the posterior medians of *θ* under the used model (i.e., Model 1, Model 2, or Model 3) and *t* = 1, ... , 5, represents the *t*-th simulation. For each simulation replicate, the goodness-of-fit test statistic is computed in the following manner:

$$\chi^{2(t)} \equiv \sum\_{i=1}^{24} \frac{\left(Y\_i^{\*(t)} - R\_i^{\*(t)} E^{\*(t)}\right)^2}{R\_i^{\*(t)} E^{\*(t)}}$$

where *E*∗(*t*) is the expected number of exoplanets evaluated based on the *t*-th simulated data *<sup>Y</sup>*∗(*t*) *<sup>i</sup>* ; *i* = 1, ... , 24. The simulation results are displayed in Table 2. First, we notice that Model 2 with only the regressors has a large *χ*2(*t*) value for each simulation replicate. This indicates that Model 2 without considering the spatial correlation of the data is very inappropriate. Comparing the proposed model (i.e., Model 1) versus Model 3, they have relatively small *χ*2(*t*) values and hence Model 1 and Model 3 are both appropriate for the analysis of the occurrence rate of Earth-size planets. Overall, the *χ*2(*t*) values of Model 1 are slightly smaller than those of Model 3, which further suggests to us to use Model 1 (i.e., Equation (8)) to analyze the data set. Even if all the estimated regression coefficients are not significant (see Table 3), in general, the regressors should slightly contribute to evaluating the occurrence rate. Moreover, Figure 3 shows 95% credible intervals of *Yi*; *i* = 1, ... , 24, for Model 1, Model 2, and Model 3. The results are in accord with Table 2; that is, Figure 3 reveals that Model 2 performs poorly and that Model 1 and Model 3

are fairly comparable. On the other hand, we notice that the data may contain potential biases that may arise from observational precision that results in inaccurate estimates of the underlying occurrence rates. In our proposed methodology, the random effects describe the spatial correlation in the data and are a suitable remedy for missing explanatory variables, addressing the limitations caused by uncollected vital variables. The simulation results indicate the effectiveness of our approach in mitigating potential biases and enhancing the model's explanatory power. Based on the results in Table 2 and Figure 3, Model 1 in Equation (8) is acceptable and hence we used it to analyze the occurrence rate of Earth-size planets in the next content.

**Figure 3.** The 95% credible intervals for Models 1–3. Model 1: a model with the regressors and the spatial component; Model 2: a model with only the regressors; Model 3: a model with only the spatial component.

**Table 2.** The expected numbers and the values of chi-squared test statistics for Model 1, Model 2, and Model 3 based on the observed data *Y* and the simulated data *Y*∗(*t*) = *<sup>Y</sup>*∗(*t*) <sup>1</sup> , ... ,*Y*∗(*t*) 24 , where *t* represents the *t*-th simulation with *t* = 1, . . . , 5.


Note: Model 1: ln(*Ri*) = *β*<sup>0</sup> + *β*1*x*1,*<sup>i</sup>* + *β*2*x*2,*<sup>i</sup>* + *δi*; Model 2: ln(*Ri*) = *β*<sup>0</sup> + *β*1*x*1,*<sup>i</sup>* + *β*2*x*2,*i*; Model 3: ln(*Ri*) = *β*<sup>0</sup> + *δi*; *i* = 1, . . . , 24.


**Table 3.** Summary of posterior inferences for model parameters.

Note: S.D. represents the standard deviations for each model parameter.

We implemented 200,000 iterations for the posterior calculations to obtain a convergent sequence and approximately independent posterior samples. The first 100,000 iterations were discarded as burn-in. Then, one has an approximately independent joint posterior sample size of 100,000 by subsampling every 10th scan. The execution time for 200,000 MCMC iterations was 56.26471 s on an i7-12700 2.10 GHz PC. The system environment was R language version 4.2.3 lined to Intel's Math Kernal Library (MKL) on Windows 11. The core codes of the MCMC process were implemented using custom-written code without relying on external packages.The trace plot in Figure 4 displays the logarithm values of Equation (7) for the 200,000 MCMC iterations. Given that the proposed model incorporates multiple parameters and random effect terms, we assessed the overall convergence of the MCMC process using these logarithm values. Notably, the trace plot reveals that it belongs to an interval within the 200,000 iterations, implying that the MCMC process has reached convergence. Table 3 presents posterior inferences based on 10,000 posterior samples for model parameters. Furthermore, the posterior means of *Ri* = exp(*x i β* + *δi*) for *i* = 1, ... , 24, are shown in Figure 5. Figure 6 displays the results with estimated occurrence rates *Pi*, *i* = 1, . . . , 24 in each grid.

**Figure 4.** The logarithm trace plot of Equation (7) for the 200,000 MCMC iterations.

**Figure 5.** The posterior means of *Ri* = exp(*x i β* + *δi*) for *i* = 1, ··· , 24.

**Figure 6.** The estimated occurrence rates *Pi* = *Ri*/ ∑<sup>24</sup> *<sup>j</sup>*=<sup>1</sup> *Rj* for *i* = 1, ··· , 24 without using completeness.

Next, we considered the variable detection efficiency (or completeness) in order to identify realistic occurrence rates. After obtaining the estimated occurrence rates in each cell shown in Figure 6, we further considered the survey completeness in order to identify realistic occurrence rates. The values of completeness function used here were constructed by Foreman-Mackey et al. (2014) [28]. We can thus obtain the true occurrence rates *Ptr i* , *i* = 1, ... , 24 in each cell, as shown in Figure 7. Because the method proposed in this paper is presented as a totally different approach to that of Petigura et al. (2013) [5], we need to make a comparison with Petigura et al.'s method (2013) [5]. We computed realistic occurrence rates with different values of orbital period (*P*) and planet radius (*R*) and the corresponding realistic occurrence rates, as shown in Table 4. Note that case (i) in Table 4 corresponds to Jupiter-size planets.

From Table 4, we find that (1) for cases (ii), (iii), (iv), (vii), and (ix), the occurrence rates obtained from the proposed method are larger than those of Petigura et al. (2013) [5] by approximately a factor of two; (2) for cases (i), (v), and (vi), the occurrence rates obtained the proposed method are almost the same as Petigura et al.'s (2013) [5]; and (3) for cases (viii) and (x), the occurrence rates obtained by Petigura et al. (2013) [5] are larger than the proposed method herein. Because the proposed model considers the information of neighbors, the grid density is high, which will produce higher occurrence rates. On the contrary, if the grid density is low, lower rates will occur. Furthermore, both methods confirm the occurrence rates of planets with (i) *P* = 5–100 d and size 8–16 *R*⊕; (ii) *P* = 25–50 d and size 1–16 *R*⊕; and (iii) *P* = 50–100 d and size 1–16 *R*⊕.

**Figure 7.** The true occurrence rates *Ptr <sup>i</sup>* for *i* = 1, ··· , 24.

Furthermore, we are interested in the occurrence rate of Earth analogs hosted by GK dwarf stars, i.e., *P* = 200–400 d and size 1–2 *R*⊕. From the scatter plot shown in Figure 1, there are no planets in this grid, and there are few planets in the neighborhood of this grid. Thus, it is reasonable that the occurrence rate of this grid is very small. After completeness correction, we find the occurrence rate to be 0.18% (please see case (viii) in Table 4), whereas the values obtained by Petigura et al. (2013) [5], Foreman-Mackey et al. (2014) [28], and Chen and Hung (2019) [29] are 5.7%, 1.9%, and 2.5%, respectively. The proposed method indicates that 46% of Sun-like stars have an Earth-size (1–2 *R*⊕) planet with *P* = 5–100 d. This value is higher than Petigura et al.'s (2013) [5] due to the spatial model considering the information of neighbors. We further conducted an additional extrapolation of the hot Jupiter occurrence rate (i.e., the occurrence rate of 1–10 days and 8–24 *R*⊕) and compared it to the findings of Petigura et al. (2018) [24]. Their study reported a hot Jupiter occurrence rate of 0.57%, whereas our extrapolated estimate stands at 4.17%. According to the scalability of our proposed model, it provides an extrapolation with new data. To the best of our knowledge, utilizing neighboring data information for occurrence rate estimation in astronomy is a novel approach that has not been previously observed. According to the inference of Petigura et al. (2013) [5], we may imply that the nearest Earth-size planets in habitable zones of Sun-like stars are expected to orbit a star further than 12 light-years from Earth because we adopted the 46% occurrence rate.

**Table 4.** Comparison of realistic occurrence rates with different values of orbital period (*P*) and planet radius (*R*).


\* represents the occurrence rates obtained by the proposed method are larger than Petigura et al. (2013) by approximately a factor of 2.

#### **4. Discussion**

Motivated by the study of Petigura et al. (2013) [5] on the prevalence of Earth-size planets orbiting Sun-like stars, we adopted a joint modeling approach to investigate the occurrence rates of planets around GK dwarfs. The inferred occurrence rate of Earth analogs around GK dwarfs increases to 46%. Compared with that of Petigura et al. (2013) [5], our approach increases the occurrence rate of Earth analogs by approximately a factor of two. Nevertheless, our model suggests that the occurrence rate for Kepler planets with radii between 1 and 2 earth radii and orbital periods between 50 and 400 days is 0.1451. Similar to most of the results in the literature, our occurrence rate of 0.1451 is also larger than the results computed by Petigura et al. (2013) [5], Dong and Zhu (2013) [13], and Foreman-MacKey et al. (2014) [28]. We cautiously contend that our proposed model exhibits a higher occurrence rate compared to other methods, attributed to the incorporation of spatial random effects. These effects effectively capture the spatial correlation in the data and moderately compensate for any missing explanatory variables. Applying our analysis to the entire Kepler planet sample (*Q*1 − *Q*16) will be left to future work. On the other hand, the current approach does not consider the influence of time on occurrence rates. All the data are treated as being from the same time point. Given the flexible nature of the proposed model, we plan to incorporate the effect of time using a Poisson process in future expansions. The expanded model will allow for dynamic predictions of variables over time.

Taking into account the survey incompleteness, we confirm the study of Petigura et al. (2013) [5]: the occurrence rates of planets with (i) *P* = 5–100 d and size 8–16 *R*⊕; (ii) *P* = 25–50 d and size 1–16 *R*⊕; and (iii) *P* = 50–100 d and size 1–16 *R*⊕. The inferred occurrence rates of Kepler planets suffer severely from systematic uncertainties (see Burke 2015 [14]). Follow-up spectroscopic observations of host stars will refine some of these

uncertainties, providing a planet sample with better stellar parameters and pipeline completeness for our model and others to revise the proposed model, and thus present better constraining theories of planet formation and evolution.

**Author Contributions:** Methodology, H.-D.Y.; software, H.-D.Y.; formal analysis, Y.-H.L. and C.-Y.L.; resources, H.-D.Y.; writing—original draft, H.-D.Y., Y.-H.L. and C.-Y.L.; writing—review and editing, Y.-H.L. and C.-Y.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research was funded by the National Science and Technology Council, R.O.C., grant number MOST 109-2118-M-390-001, and MOST 111-2118-M-390-003.

**Institutional Review Board Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
