*Article* **Deep Q-Learning-Based Transmission Power Control of a High Altitude Platform Station with Spectrum Sharing**

**Seongjun Jo <sup>1</sup> , Wooyeol Yang <sup>1</sup> , Haing Kun Choi <sup>2</sup> , Eonsu Noh <sup>3</sup> , Han-Shin Jo 1,\* and Jaedon Park 3,\***


**\*** Correspondence: hsjo@hanbat.ac.kr (H.-S.J.); jaedon2@add.re.kr (J.P.)

**Abstract:** A High Altitude Platform Station (HAPS) can facilitate high-speed data communication over wide areas using high-power line-of-sight communication; however, it can significantly interfere with existing systems. Given spectrum sharing with existing systems, the HAPS transmission power must be adjusted to satisfy the interference requirement for incumbent protection. However, excessive transmission power reduction can lead to severe degradation of the HAPS coverage. To solve this problem, we propose a multi-agent Deep Q-learning (DQL)-based transmission power control algorithm to minimize the outage probability of the HAPS downlink while satisfying the interference requirement of an interfered system. In addition, a double DQL (DDQL) is developed to prevent the potential risk of action-value overestimation from the DQL. With a proper state, reward, and training process, all agents cooperatively learn a power control policy for achieving a near-optimal solution. The proposed DQL power control algorithm performs equal or close to the optimal exhaustive search algorithm for varying positions of the interfered system. The proposed DQL and DDQL power control yields the same performance, which indicates that the actional value overestimation does not adversely affect the quality of the learned policy.

**Keywords:** Deep Q-learning (DQL); Double Deep Q-learning (DDQL); dynamic spectrum sharing; High Altitude Platform Station (HAPS); cellular communications; power control; interference management

#### **1. Introduction**

A High Altitude Platform Station (HAPS) is a network node operating in the stratosphere at an altitude of approximately 20 km. The International Telecommunication Union (ITU) defines a HAPS in Article 1.66A as "A station on an object at an altitude of 20 to 50 km and a specified, nominal, fixed point relative to the Earth". Various studies have been performed on HAPS in recent years, and the commercial applications of HAPS have significantly increased [1]. In addition, the HAPS has potential as a significant component of wireless network architectures [2]. It is also an essential component of next-generation wireless networks, with considerable potential as a wireless access platform for future wireless communication systems [3–5].

Because the HAPS is located at high altitudes ranging from 20 to 50 km, the HAPSto-ground propagation generally experiences lower path loss and a higher line-of-sight probability than typical ground-to-ground propagation. Thus, the HAPS can provide a high data rate for wide coverage; however, it is likely to interfere with various other terrestrial services, e.g., fixed, mobile, and radiolocation. The World Radiocommunication Conference 2019 (WRC-19) adopted a HAPS as the IMT Base Station (HIBS) in the frequency bands below 2.7 GHz previously identified for IMT by Resolution 247 [6], which addresses the potential interference of HAPS with an existing service. In such a situation,

**Citation:** Jo, S.; Yang, W.; Choi, H.K.; Noh, E.; Jo, H.-S.; Park, J. Deep Q-Learning-Based Transmission Power Control of a High Altitude Platform Station with Spectrum Sharing. *Sensors* **2022**, *22*, 1630. https://doi.org/10.3390/s22041630

Academic Editor: Margot Deruyck

Received: 12 January 2022 Accepted: 18 February 2022 Published: 19 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

if the existing service is not safe from HAPS interference, the two systems cannot coexist. Therefore, the HAPS transmitter is requested to reduce its transmission power to satisfy the interference–to–noise ratio (*INR*) requirement for protecting the receiver of the existing service. However, if the HAPS transmission power is excessively reduced, the signal–to– interference–plus–noise ratio (SINR) of the HAPS downlink decreases; thus, the outage probability may exceed the desired level. Herein, a HAPS transmission power control algorithm is proposed that aims to minimize the outage probability of the HAPS downlink while satisfying the *INR* requirement for protecting incumbents.

#### *1.1. Related Works*

Studies have been performed on improving the performance of HAPS. In [7], resource allocation for an Orthogonal Frequency Division Multiple Access (OFDMA)-based HAPS system that uses multicasting in the downlink to maximize the number of user terminals by maximizing the radio resources was studied. The authors of [8] proposed a wireless channel allocation algorithm for a HAPS 5G massive multiple-input multiple-output (MIMO) communication system based on reinforcement learning. Combining Q-learning and backpropagation neural networks allows the algorithm to learn intelligently for varying channel load and block conditions. In [9], a criterion for determining the minimum distance in a mobile user access system was derived, and a channel allocation approach based on predicted changes in the number of users and the call volume was proposed.

Additionally, spectrum sharing studies on HAPS have been performed. In [10], a spectrum sharing study was conducted to share a fixed service using a HAPS with other services in the 31/28-GHz band. Interference mitigation techniques were introduced, e.g., increasing the minimum operational elevation angle or improving the antenna radiation pattern to facilitate sharing with other services. In addition, the possibility of dynamic channel allocation was analyzed. In [11], sharing between a HAPS and a fixed service in the 5.8-GHz band was investigated using a coexistence methodology based on a spectrum emission mask.

In contrast to previous studies in which HAPS communication improvement and spectrum sharing were dealt with separately, in the present study, a combination of spectrum sharing with other systems and HAPS downlink coverage improvement is considered. In this regard, this study is more advanced than previous HAPS-related studies.

Deep Q-learning (DQL) is a reinforcement learning algorithm that applies deep neural networks to reinforcement learning to solve complex problems in the real world. DQL is widely used in various fields, including UAV, drone, and HAPS. In [12], the optimal UAV-BS trajectory was presented using a DQL for optimal placement of UAVs, and the author of [13] used a DQL to determine the optimal link between two UAV nodes. In [14], a DQL is used to find the optimal flight parameters for the collision-free trajectory of the UAV. In [15], two-hop communication was considered to optimize the drone base station trajectory and improve network performance, and a DQL was used to solve the joint twohop communication scenario. In [16], a DQL was used for multiple-HAPS coordination for communications area coverage. A Double Deep Q-learning (DDQL) is an algorithm developed to prevent the overestimation of a DQL and shows better performance than the DQL in various fields [17].

#### *1.2. Contributions*

The contributions of the present study are as follows. (1) For the first time, a multiagent DQL was used to improve the HAPS outage performance and solve the problem of spectrum sharing with existing services. (2) We defined the power control optimization problem to minimize the outage probability of the HAPS downlink under the interference constraint for protecting the existing system. The state and reward for the training agent were designed to consider the objective function and constraints of the optimization problem. (3) Because the HAPS has a multicell structure, the number of power combinations increases exponentially as the number of cells (*Ncell*) and power levels increase linearly. Thus, the optimal exhaustive search method requires an impractically long computation time to solve the multicell power optimization problem. The proposed DQL algorithm performs comparably to an optimal exhaustive search with a feasible computation time. (4) Even for varying positions of the interfered system, the proposed DQL produces a proper power control policy, maintaining stable performance. (5) Comparing the proposed DQL algorithm with the DDQL algorithm shows no performance degradation due to overestimation in the proposed DQL. The remainder of this paper is organized as follows. proper power control policy, maintaining stable performance. (5) Comparing the proposed DQL algorithm with the DDQL algorithm shows no performance degradation due to overestimation in the proposed DQL. The remainder of this paper is organized as follows. Section 2 presents the system model, including the system deployment model, HAPS model, interfered system model, and path loss model. In Section 3, the downlink SINR

increases exponentially as the number of cells () and power levels increase linearly. Thus, the optimal exhaustive search method requires an impractically long computation time to solve the multicell power optimization problem. The proposed DQL algorithm performs comparably to an optimal exhaustive search with a feasible computation time. (4) Even for varying positions of the interfered system, the proposed DQL produces a

*Sensors* **2022**, *22*, 1630 3 of 21

Section 2 presents the system model, including the system deployment model, HAPS model, interfered system model, and path loss model. In Section 3, the downlink SINR and *INR* are calculated. In Section 4, a DQL-based HAPS power control algorithm is proposed. Section 5 presents the simulation results, and Section 6 concludes the paper. and are calculated. In Section 4, a DQL-based HAPS power control algorithm is proposed. Section 5 presents the simulation results, and Section 6 concludes the paper. **2. System Model** 

#### **2. System Model** *2.1. System Deployment Model*

#### *2.1. System Deployment Model* HAPS communication networks are assumed to consist of a single HAPS, multiple

HAPS communication networks are assumed to consist of a single HAPS, multiple ground user equipment (*UE*) devices (referred to as *UE*s hereinafter), and a ground interfered receiver. The HAPS, *UE*, and interfered receiver are distributed in the threedimensional Cartesian coordinate system, as shown in Figure 1. The coordinates of the HAPS antenna and the interfered receiver antenna are (0, 0, *hHAPS*) and (X, Y, *hV*), respectively. The *NUE UE* devices with an antenna height of *hUE* are uniformly distributed within the circular HAPS area. ground user equipment (UE) devices (referred to as UEs hereinafter), and a ground interfered receiver. The HAPS, UE, and interfered receiver are distributed in the three-dimensional Cartesian coordinate system, as shown in Figure 1. The coordinates of the HAPS antenna and the interfered receiver antenna are (0, 0, ℎுௌ) and (X, Y, ℎ), respectively. The ா UE devices with an antenna height of ℎா are uniformly distributed within the circular HAPS area.

**Figure 1.** System deployment model. **Figure 1.** System deployment model.

#### *2.2. HAPS Model*

*2.2. HAPS Model*  We modeled the HAPS cell deployment and system parameters with reference to the working document for a HAPS coexistence study performed in preparation for WRC-23 [18]. As shown in Figure 2, a single HAPS serves multiple cells that consist of one 1st layer cell denoted as \_1 and six 2nd layer cells denoted as \_2 to \_7. The six cells of the 2nd layer are arranged at intervals of 60° in the horizontal direction. Figure 3 presents a typical HAPS antenna design for seven-cell structures [4], where seven phased-array antennas conduct beamforming toward the ground to form seven cells, as shown in Figure We modeled the HAPS cell deployment and system parameters with reference to the working document for a HAPS coexistence study performed in preparation for WRC-23 [18]. As shown in Figure 2, a single HAPS serves multiple cells that consist of one 1st layer cell denoted as *Cell*\_1 and six 2nd layer cells denoted as *Cell*\_2 to *Cell*\_7. The six cells of the 2nd layer are arranged at intervals of 60◦ in the horizontal direction. Figure 3 presents a typical HAPS antenna design for seven-cell structures [4], where seven phased-array antennas conduct beamforming toward the ground to form seven cells, as shown in Figure 2. The 1st layer cell has an antenna tilt of 90◦ , i.e., perpendicular to the ground; the 2nd layer cell has an antenna tilt of 23◦ .

2. The 1st layer cell has an antenna tilt of 90°, i.e., perpendicular to the ground; the 2nd layer cell has an antenna tilt of 23°. The antenna pattern of the HAPS was designed using the antenna gain formula presented in Recommendation ITU-R M.2101 [19]. The transmitting antenna gain is calculated as the sum of the gain of a single element and the beamforming gain of a multi-antenna array. The single element antenna gain is determined by the azimuth angle (*φ*) and the elevation angle (*θ*) between the transmitter and receiver and is calculated as follows:

$$A\_E(\phi, \theta) = G\_{E, \text{max}} - \min \left\{ -[A\_{E, H}(\phi) + A\_{E, \text{\textquotedblleft}}(\theta)], A\_m \right\}, \tag{1}$$

where *GE*,*max* represents the maximum antenna gain of a single element, *AE*,*H*(*φ*) represents the horizontal radiation pattern calculated using Equation (2), and *AE*,*v*(*θ*) represents the vertical radiation pattern calculated using Equation (3).

$$A\_{E,H}(\phi) = -\min\left[12\left(\frac{\phi}{\Phi \text{sB}}\right)^2, A\_{\text{rt}}\right] \tag{2}$$

Here, *φ*3dB represents the horizontal 3 dB beamwidth of a single element, and *A<sup>m</sup>* represents the front-to-back ratio.

**Figure 2.** HAPS seven-cell layout. **Figure 2.** HAPS seven-cell layout. **Figure 2.** HAPS seven-cell layout.

**Figure 3.** Typical antenna structure for multi-cell HAPS communication. **Figure 3.** Typical antenna structure for multi-cell HAPS communication. **Figure 3.** Typical antenna structure for multi-cell HAPS communication.

The antenna pattern of the HAPS was designed using the antenna gain formula pre-

ா(, ) = ா,௫ − ൛−ൣா,ு() + ா,௩()൧, ൟ, (1)

ா(, ) = ா,௫ − ൛−ൣா,ு() + ா,௩()൧, ൟ, (1)

ଷௗ ൰ ଶ

, (2)

, (2)

sented in Recommendation ITU-R M.2101 [19]. The transmitting antenna gain is calculated as the sum of the gain of a single element and the beamforming gain of a multi-

The antenna pattern of the HAPS was designed using the antenna gain formula pre-

sented in Recommendation ITU-R M.2101 [19]. The transmitting antenna gain is calculated as the sum of the gain of a single element and the beamforming gain of a multi-

and the elevation angle () between the transmitter and receiver and is calculated as fol-

where ா,௫ represents the maximum antenna gain of a single element, ா,ு() represents the horizontal radiation pattern calculated using Equation (2), and ா,௩() repre-

Here, ଷௗ represents the horizontal 3 dB beamwidth of a single element, and

ଷௗ ൰ ଶ

ா,ு() = − ቈ12 ൬

Here, ଷௗ represents the horizontal 3 dB beamwidth of a single element, and

ா,ு() = − ቈ12 ൬

where ா,௫ represents the maximum antenna gain of a single element, ா,ு() represents the horizontal radiation pattern calculated using Equation (2), and ா,௩() repre-

represents the front-to-back ratio.

represents the front-to-back ratio.

sents the vertical radiation pattern calculated using Equation (3).

sents the vertical radiation pattern calculated using Equation (3).

lows:

lows:

$$A\_{E,V}(\theta) = -\min\left[12\left(\frac{\theta - 90}{\theta\_{\text{3dB}}}\right)^2, SLA\_v\right] \tag{3}$$

ଷୢ

Here, ଷୢ represents the vertical 3 dB bandwidth of a single element, and ௩

The transmitting antenna gain of the HAPS is calculated using the antenna arrangement and spacing, as well as the target beamforming direction. The gain for beam *i* is

where ு and represent the number of antennas in the horizontal and vertical directions, respectively. , is the superposition vector that overlaps the beams of the an-

൰ ଶ

ୀଵ ∑ ,, ⋅ , ேೇ

ୀଵ ห

, ௩ (3)

ଶ

ቁ, (4)

(6)

Here, *θ*3dB represents the vertical 3 dB bandwidth of a single element, and *SLA<sup>v</sup>* represents the front-to-back ratio. tion (6). = 1,2, … ; = 1,2, . . . ு

ா,() = − ቈ12 ൬ − 90

*Sensors* **2022**, *22*, 1630 5 of 21

,(, ) = ா(, ) + 10 ଵ ቀห∑ேಹ

represents the front-to-back ratio.

calculated as follows:

The transmitting antenna gain of the HAPS is calculated using the antenna arrangement and spacing, as well as the target beamforming direction. The gain for beam *i* is calculated as follows: , = ቌ√−1 ⋅ 2 ൭(−1) ⋅ ⋅ () + ( − 1) ⋅ ு ⋅ ( ) ⋅ ( )൱ቍ (5)

$$A\_{A, \text{Bami}}(\theta, \phi) = A\_E(\theta, \phi) + 10 \log\_{10} \left( \left| \sum\_{m=1}^{N\_H} \sum\_{n=1}^{N\_V} w\_{i,n,m} \cdot v\_{n,m} \right|^2 \right), \tag{4}$$

where *N<sup>H</sup>* and *N<sup>V</sup>* represent the number of antennas in the horizontal and vertical directions, respectively. *vn*,*<sup>m</sup>* is the superposition vector that overlaps the beams of the antenna elements, which is calculated using Equation (5), and *wi*,*n*,*<sup>m</sup>* is the weight that directs the antenna element in the beamforming direction, which is calculated using Equation (6). arrays, respectively, and represents the wavelength. ,, <sup>=</sup> <sup>1</sup> ቌ√−1

$$\begin{aligned} m &= 1, 2, \dots \newline N\_V; m = 1, 2, \dots \newline N\_H \\ m\_{n,m} &= \exp\left(\sqrt{-1} \cdot 2\pi \left( (n-1) \cdot \frac{d\mu}{\lambda} \cdot \cos(\theta) + (m-1) \cdot \frac{d\mu}{\lambda} \cdot \sin(\theta) \cdot \sin(\phi) \right) \right) \\ &\quad \cdot \quad \cdot \quad \cdot \quad \cdot \quad \cdot \quad \cdot \quad \cdot \quad \cdot \quad \cdot \quad \cdot \quad \cdot \quad \cdot \quad \cdot \end{aligned} \tag{5}$$

Here, *d<sup>H</sup>* and *d<sup>V</sup>* represent the intervals between the horizontal and vertical antenna arrays, respectively, and *λ* represents the wavelength.

$$w\_{l,n,m} = \frac{1}{\sqrt{N\_l N\_V}} \cdot \exp\left(\sqrt{-1}\right)$$

$$\begin{array}{ll} \cdot 2\pi \left( (n-1) \cdot \frac{d\underline{\nu}}{\lambda} \cdot \sin(\theta\_{l,xtill}) - (m-1) \cdot \frac{d\underline{\mu}}{\lambda} \cdot \cos(\theta\_{l,xill}) \\ \cdot \sin(\phi\_{l,xzm})) \end{array} \tag{6}$$

Here, *φi*,*escan* and *θi*,*etilt* represent the *φ* and *θ* of the main beam direction, respectively. The 1st layer cell of the HAPS uses a 2 × 2 antenna array, and the 2nd layer cell uses a 4 × 2 antenna array. Figure 4 shows the antenna pattern of the 1st layer cell, and Figure 5 shows the antenna pattern of the 2nd layer cell. The 1st layer cell of the HAPS uses a 2 × 2 antenna array, and the 2nd layer cell uses a 4 × 2 antenna array. Figure 4 shows the antenna pattern of the 1st layer cell, and Figure 5 shows the antenna pattern of the 2nd layer cell.

**Figure 4.** 1st layer cell antenna pattern.

**Figure 5.** 2nd layer cell antenna pattern.

**Figure 4.** 1st layer cell antenna pattern.

#### **Figure 5.** 2nd layer cell antenna pattern. *2.3. Interfered System Model*

*2.3. Interfered System Model*  Various interfered systems, e.g., fixed, mobile, and radiolocation services, can be considered for the interference scenario involving a HAPS. We adopted a ground IMT base station (BS) for the interfered system, referring to the potential interference scenario [6]. Various interfered systems, e.g., fixed, mobile, and radiolocation services, can be considered for the interference scenario involving a HAPS. We adopted a ground IMT base station (BS) for the interfered system, referring to the potential interference scenario [6]. The antenna pattern of the interfered system was applied by referring to Recommendation ITU-R F.1336 [20]. The receiving antenna gain is calculated as follows:

$$\mathbf{G}(\boldsymbol{\phi}, \boldsymbol{\theta}) = \mathbf{G}\_0 + \mathbf{G}\_{\mathrm{fr}}(\mathbf{x}\_h) + \mathbf{R} \cdot \mathbf{G}\_{\mathrm{vr}}(\mathbf{x}\_{\overline{v}}),\tag{7}$$

(, ) = + () +௩(௩), (7) where represents the maximum gain in the azimuth plane; () represents the relative reference antenna gain in the azimuth plane in the normalized direction of (, 0), which is calculated using Equation (8); and ௩(௩) represents the relative reference anwhere *G*<sup>0</sup> represents the maximum gain in the azimuth plane; *Ghr*(*xh*) represents the relative reference antenna gain in the azimuth plane in the normalized direction of (*x<sup>h</sup>* , 0), which is calculated using Equation (8); and *Gvr*(*xv*) represents the relative reference antenna gain in the elevation plane in the normalized direction of (0, *xv*), which is calculated using Equation (9). *R* represents the horizontal gain compression ratio when the azimuth angle is shifted from 0◦ to *φ*, which is calculated using Equation (10).

ITU-R F.1336 [20]. The receiving antenna gain is calculated as follows:

$$\begin{array}{ll} \text{G}\_{\text{hr}}(\mathbf{x}\_{\hbar}) = -12\mathbf{x}\_{\hbar}^{2} & \text{for } \mathbf{x}\_{\hbar} \le 0.5\\ \text{G}\_{\text{hr}}(\mathbf{x}\_{\hbar}) = -12\mathbf{x}\_{\hbar}^{(2-k\_{\hbar})} - \lambda\_{\text{kh}} & \text{for } 0.5 < \mathbf{x}\_{\hbar} \\ \text{G}\_{\text{hr}}(\mathbf{x}\_{\hbar}) \ge \mathbf{G}\_{180} \end{array} \tag{8}$$

$$\begin{array}{ll} \mathbf{G}\_{\rm vir}(\mathbf{x}\_{\upsilon}) = -12\mathbf{x}\_{\upsilon}^{2} & \text{for } \mathbf{x}\_{\upsilon} < \mathbf{x}\_{k} \\ \mathbf{G}\_{\rm vir}(\mathbf{x}\_{\upsilon}) = -15 + 10\log(\mathbf{x}\_{\upsilon}^{-1.5} + k\_{\upsilon}) & \text{for } \mathbf{x}\_{k} \le \mathbf{x}\_{\upsilon} < 4 \\ \mathbf{G}\_{\rm vir}(\mathbf{x}\_{\upsilon}) = -\lambda\_{\rm kv} - 3 - \mathbb{C}\log(\mathbf{x}\_{\upsilon}) & \text{for } 4 \le \mathbf{x}\_{\upsilon} < 90/\theta\_{3} \\ \mathbf{G}\_{\rm vir}(\mathbf{x}\_{\upsilon}) = \mathbf{G}\_{180} & \text{for } \mathbf{x}\_{\upsilon} \ge 90/\theta\_{3} \end{array} \tag{9}$$

$$\mathbf{R}\_{\rm tr}(\omega\_{\rm r}) \qquad \mathbf{r}\_{180} \tag{18}$$

$$\mathbf{R} = \frac{\mathbf{G}\_{\rm lr}(\mathbf{x}\_{\rm h}) - \mathbf{G}\_{\rm lr}(180^{\circ}/\phi\_{3})}{\mathbf{G}\_{\rm lr}(0) - \mathbf{G}\_{\rm lr}(180^{\circ}/\phi\_{3})} \tag{10}$$

hr() − hr(180°/ଷ)

Here, and kh are given by Equations (11) and (12), respectively; ଷ represents the 3 dB beamwidth in the azimuth plane; and is an azimuth pattern adjustment factor based on the leaked power. The relative minimum gain ଵ଼ was calculated using Equa-

 vr(௩) = −15 + 10 ( ௩ ିଵ.ହ) + ௩) ≤ ௩ < 4 vr(௩) = −kv − 3 − ( ௩) 4 ≤ ௩ < 90˚/ଷ Here, *x<sup>h</sup>* and *λkh* are given by Equations (11) and (12), respectively; *φ*<sup>3</sup> represents the 3 dB beamwidth in the azimuth plane; and *k<sup>h</sup>* is an azimuth pattern adjustment factor based on the leaked power. The relative minimum gain *G*<sup>180</sup> was calculated using Equation (13).

=

$$\mathbf{x}\_{\hbar} = |\phi\rangle\langle\phi\_{\hbar}|\tag{11}$$

hr(0) − hr(180°/ଷ) (10)

(8)

(9)

tion (13).

$$
\lambda\_{\rm kh} = 3 \left( 1 - 0.5^{-k\_h} \right) \tag{12}
$$

= ||/ଷ (11)

180° ଷ

) (13)

(18)

kh = 3 (1 − 0. 5ି) (12)

$$G\_{180} = -15 + 10\log(1 + 8k\_4) - 15\log\left(\frac{180^\circ}{\theta\_3}\right) \tag{13}$$

Returning to Equation (9), ௩ is given by Equation (14), and the 3-dB beamwidth in

the elevation plane ଷ is calculated using Equation (15), where represents the maximum gain in the azimuth plane. In addition, is calculated using Equation (16), where ௩ is an elevation pattern adjustment factor based on the leaked power. kv was calcu-

ଵ଼ = −15 + 10log (1 + 8) − 15 (

Returning to Equation (9), *x<sup>v</sup>* is given by Equation (14), and the 3-dB beamwidth in the elevation plane *θ*<sup>3</sup> is calculated using Equation (15), where *G*<sup>0</sup> represents the maximum gain in the azimuth plane. In addition, *x<sup>k</sup>* is calculated using Equation (16), where *k<sup>v</sup>* is an elevation pattern adjustment factor based on the leaked power. *λ*kv was calculated using Equation (17), and the attenuation inclination factor *C* was calculated using Equation (18). Figure 6 shows the antenna pattern of the interfered system calculated using Equation (7), which is the pattern for a typical terrestrial BS with a broad beamwidth in the azimuth plane but a narrow beamwidth in the elevation plane. ௩ = ห ห/ଷ (14) ଷ = 107.6 × 10ି.ଵீబ (15) = ඥ1.33 − 0.33௩ (16)

*Sensors* **2022**, *22*, 1630 7 of 21

$$\mathfrak{x}\_{\upsilon} = |\theta \mid / \theta\_{\mathfrak{z}}| \tag{14}$$

$$\theta\_3 = 107.6 \times 10^{-0.1G\_0} \tag{15}$$

$$\mathbf{x}\_k = \sqrt{1.33 - 0.33k\_v} \tag{16}$$

$$\mathcal{C} \log(4) \quad \text{101a.} \left( {}^{A-1.5}\mathbf{1}\_{-1.5} \right) \tag{17}$$

$$
\lambda\_{\rm kv} = 12 - \mathcal{C} \log(4) - 10 \log \left( 4^{-1.5} + k\_v \right) \tag{17}
$$

$$C = \frac{10\log\left(\frac{\left(\frac{180^{\circ}}{\theta\_3}\right)^{1.5}\cdot\left(4^{-1.5} + k\_v\right)}{1 + 8k\_p}\right)}{\log\left(\frac{22.5^{\circ}}{\theta\_3}\right)}\tag{18}$$

**Figure 6.** Interfered system antenna pattern.

#### **Figure 6.** Interfered system antenna pattern. *2.4. Path Loss Model*

The path loss model of Recommendation ITU-R P.619 [21] was applied to the working document for the HAPS coexistence study performed in preparation for WRC-23 [22]. The total path loss that occurs when the HAPS signal reaches the *UE* and the IMT BS is expressed as follows:

$$L\_p = FSL + A\_{xp} + A\_{\mathcal{g}} + A\_{\text{bs}\prime} \tag{19}$$

where *FSL* represents the free-space path loss calculated using Equation (20), which occurs in a straight path from a transmitting antenna to a receiving antenna in a vacuum state, and *Axp* is assumed to be 3 dB for depolarization attenuation. *A<sup>g</sup>* represents the attenuation loss due to atmospheric gases. *Abs* represents the resistive loss due to the spread of the antenna beam as the beam spreads attenuation. *A<sup>g</sup>* and *Abs* were calculated using the formulae in P.619.

$$FSL = 92.45 + 20\log(f \cdot d) \tag{20}$$

Here, *f* represents the carrier frequency (in GHz), and *d* represents the distance (in km) between the transmitter and receiver.

#### **3. Calculation of Downlink SINR and** *INR*

#### *3.1. Calculation of Downlink SINR*

The signal received by the *UE* from the HAPS transmission for the *i*th cell (*Cell*\_*i*) is calculated as follows:

$$\mathbf{S\_{Cell\\_i}} = \mathbf{P\_{Cell\\_i}} + \mathbf{G\_{Cell\\_i}} + \mathbf{G\_p} + \mathbf{G\_{r,IIE}} - L\_p - L\_{ohm} \tag{21}$$

where *PCell*\_*<sup>i</sup>* represents the HAPS transmission power for *Cell*\_*i*, *GCell*\_*<sup>i</sup>* represents the transmitting antenna gain of *Cell*\_*i*, *G<sup>p</sup>* represents the polarization gain, *Gr*,*UE* represents the receiving antenna gain, and *Lohm* represents the ohmic loss. The *UE* receives signals from all *Ncell* cells and considers the remaining signals (except for the strongest *Cell j* signal) as interference. Equation (22) is used to calculate the signal and interference, and the receiver noise is calculated using Equation (23).

$$\begin{aligned} j &= \underset{i}{\text{argmax}} \, \mathbf{S}\_{\text{Cell\\_i}}\\ &\quad \mathbf{S}\_{\text{HAPS}} = \mathbf{S}\_{\text{Cell\\_j}}\\ &\quad I\_{\text{HAPS}} = 10 \log(\sum\_{i=1 \atop i \neq j}^{N\_{\text{Cell\\_i}}} 10^{\frac{\mathbf{S}\_{\text{Cell\\_i}}}{10}}) \end{aligned} \tag{22}$$

$$N = 10\log(k \times T \times BW) + N\_f \tag{23}$$

Here, *k* and *T* represent the Boltzmann constant and noise temperature, respectively, and *BW* represents the channel bandwidth. *N<sup>f</sup>* represents the noise figure. Finally, the downlink SINR is calculated as follows:

$$\eta = 10 \log \left( \frac{10^{\frac{S\_{HAPS}}{10}}}{10^{\frac{I\_{HAPS}}{10}} + 10^{\frac{N}{10}}} \right). \tag{24}$$

#### *3.2. Calculation of INR*

The interference power received by the interfered receiver from the HAPS transmitter servicing *Cell i* is calculated as follows:

$$I\_{\rm Cell\\_i} = P\_{\rm Cell\\_i} + G\_{\rm Cell\\_i} + G\_p + G\_{r,V} - L\_p - L\_{\rm ohm\nu} \tag{25}$$

where *Gr*,*<sup>V</sup>* represents the antenna gain of the interfered receiver. The aggregated interference power at the interfered receiver is calculated as follows:

$$I\_{HAPS,V} = 10\log\left(\sum\_{i=1}^{N\_{cell}} 10^{\frac{l\_{Cell,j}}{10}}\right). \tag{26}$$

Finally, after converting the aggregated interference into *INR* form in accordance with Equation (27) and comparing it with the protection criteria (*INRth*) of the interfered receiver, it is possible to check whether the interfered receiver is protected from the interference of the HAPS.

$$INR = I\_{HAPS} - N \tag{27}$$

#### **4. DQL-Based HAPS Transmission Power Control Algorithm**

#### *4.1. Problem Formulation*

To satisfy the *INRth* of the interfered system, the transmission power of the HAPS must be reduced. However, as the power of the HAPS is reduced, the *η* of the *UE* decreases, and the outage probability *Pout* increases. Thus, the objective of this study was to find a HAPS transmission power set for each cell, i.e., *P* = {*PCell*\_*<sup>i</sup>* |*i* = 1, · · · , *Ncell*}, that satisfies the *INRth* of the interfered system while minimizing *Pout*. The optimization problem of the HAPS transmission power can be formulated as follows:

$$\begin{array}{c} \min\_{\mathbf{P}} P\_{\text{out}} = \frac{N\_{\text{lIE},o}(\mathbf{P})}{N\_{\text{lIE}}}\\ \text{s.t.} \qquad \text{C1}: \quad \text{INR} \le \text{INR}\_{\text{ltl}}\\ \text{C2}: \quad P\_{\text{min}} \le P\_{\text{Cell}\_i} \le P\_{\text{max}} \quad \forall i \in \{1, \dots, N\_{\text{cell}}\}, \end{array} \tag{28}$$

where *NUE*,*o*(*P*) represents the number of UEs that do not satisfy the minimum required SINR *η<sup>o</sup>* for a given HAPS transmission power set *P*.

#### *4.2. Proposed Algorithm*

To control the HAPS transmission power, it is necessary to independently determine the power level of each cell. Accordingly, the total number of HAPS transmission power sets increases exponentially to *N Ncell <sup>p</sup>* as the number of selectable powers *N<sup>p</sup>* increases linearly. Although an exhaustive search algorithm can be used to find optimal solutions, this incurs excessive complexity and a long computation time. To solve this problem, we propose a DQL-based power optimization algorithm that can find a near-optimal *P* with low complexity. In the proposed DQL model, each agent functions as the power controller of a cell; accordingly, the number of agents is *Ncell*.

The agent—the subject of learning—learns a deep neural network called Deep Q Network (DQN) and selects an action using this network. DQL is an improved Q-learning method. Q-learning is a method for selecting the best action in a specific state through the Q-table of a state-action pair. As the state–action space grows in Q-learning, creating a Q-table and finding the best policy become highly complex. In addition, the use of Q-learning is limited because learning in the Q-table format becomes more complex when multiple agents are used. In contrast, a DQL is a promising way to solve the curse of dimensionality by approximating a Q function using a deep neural network instead of a Q-table. The proposed algorithm uses a method in which each agent learns a policy based on its observation and action while treating all other agents as part of the environment to solve the multiple-agent problem.

The basic DQL parameters (state, action, and reward) are presented below. Each agent learns the policy independently using the training data at each timestep *t*. The state space of the *<sup>m</sup>*th agent comprises a set of (*Ncell* <sup>−</sup> 1) interferences that the agent provides to UEs located at the centers of other cells and the agent's interference to the interfered receiver, which is expressed as

$$\mathbf{S}\_{l} = \{ I\_{\text{\textquotedblleft}l} \{ I\_{\text{UE\textquotedblright}i} | i = 1, \dots, N\_{\text{cell}} \text{ and } i \neq m \} \}. \tag{29}$$

Two power sets configure the action space of an agent: *A*<sup>1</sup> = {29, 31, 33, 35, 37} and *A*<sup>2</sup> = {26, 28, 30, 32, 34} (unit: dBm). The agent of *Cell*\_1 in the 1st layer cell selects an action from *A*1, and the agents of the 2nd layer cell select an action from *A*2. All agent actions are initialized to the minimum power value to minimize the interference to the interfered receiver at the beginning of the learning process. The reward is calculated as follows. First, because the interfered receiver must be safe from HAPS interference, an agent receives a fixed *r<sup>t</sup>* of −100 (deficient value) for *INR* > *INRth*. In contrast, for *INR* ≤ *INRth*, an agent receives *r<sup>t</sup>* computed according to the lower 5% downlink SINR of each cell {*η*ˆ*<sup>i</sup>* |*i* = 1, 2, · · · , *Ncell*} and the required SINR *ηo*. The reward can be expressed as

$$r\_t = \begin{cases} r\_{1,t} + r\_{2,t} & \text{for } INR \le INR\_{th} \\ r\_t = -100 & \text{otherwise} \end{cases} \tag{30}$$

where

$$\begin{array}{ll}r\_{1,t} = 10 \cdot \left(\sum(\mathfrak{f}\_{i} - \mathfrak{v}\_{0})\right) & \text{for } \mathfrak{f}\_{i} \ge \mathfrak{v}\_{0} \\ r\_{2,t} = \sum(\mathfrak{f}\_{i} + \mathfrak{v}\_{0}) & \text{for } \mathfrak{f}\_{i} < \mathfrak{v}\_{0}. \end{array} \tag{31}$$

Figure 7 shows the structure of the proposed DQL-based HAPS transmission power control algorithm. Each agent learns its DQN, and one DQN consists of the main network, target network, and replay memory. The main network estimates the *Q*-value *Q*(*s*, *a*; *w*) corresponding to the state–action pair through a deep neural network with a weight *w*. The main network consists of an input layer composed of seven neurons, a hidden layer consisting of 24 neurons, and an output layer consisting of five neurons. It is a fully connected network. *w* is updated every *t* in the direction that minimizes the loss function *L*(*w*) = E h *y<sup>j</sup>* − *Q*(*s*, *a*; *w*) 2 i . The target network calculates the target value *y<sup>j</sup>* = *r<sup>j</sup>* + *γ*max *a* 0 *Q*ˆ(*s* 0 , *a* 0 ; *w* −), where *γ* is the discount factor; *s* 0 and *a* 0 denotes the state and action, respectively, in the next step; and *Q*ˆ(*s* 0 , *a* 0 ; *w* −) is the *Q*-value estimated through the target network with weight *w* −. The agent's transition tuple (*s<sup>t</sup>* , *a<sup>t</sup>* , *r<sup>t</sup>* , *st*+1) is piled in the replay memory, from which a minibatch (size of 512 tuples) are randomly sampled at each step. The minibatch data are used to compute the target value *y<sup>j</sup>* . In a DQL, learning is stabilized, and the learning performance is improved through replay memory and a separate target network [23]. *Sensors* **2022**, *22*, 1630 11 of 21

Algorithm 1 describes the proposed DQL-based HAPS transmission power control algorithm. For DQN training, was set as 100,000, and the minibatch size was set as 512. was set as 500, and was set as 10. The Adam optimizer was used to minimize (),

used to balance exploration and exploitation; was initially set as 1 and was reduced by

(௧, ; )

12: Assign the selected power to the *m*th cell and compute and

**Algorithm 1** Training Process for the DQL-Based HAPS Power Control Algorithm

**Figure 7.** DQL-based HAPS power control architecture. **Figure 7.** DQL-based HAPS power control architecture.

1: Initialize the replay memory to capacity 2: Initialize the -function with random weights

8: Calculate ௧ via Equations (21) and (25)

10: With probability, select a random action ௧

14: Store the experience in (௧, ௧, ௧, ௧ାଵ) in

15: Sample a random minibatch of experiences from

17: Perform optimization via () and update

, <sup>ᇱ</sup> ; ି)

18: Update the target network with ି = every 4 steps

0.01 for every episode.

4: **for** episode = 1, **do** 

7: **if** =1

9: **end if**

5: Initialize action = min 6: **for** timestep = 1, **do** 

11: Otherwise, select ௧ = argmax

13: Observe the reward ௧ and ௧ାଵ

16: Set = + max <sup>ᇲ</sup> (ᇱ

3: Initialize the target -function with the same weights: ି =

Algorithm 1 describes the proposed DQL-based HAPS transmission power control algorithm. For DQN training, *N* was set as 100,000, and the minibatch size was set as 512. *M* was set as 500, and *T* was set as 10. The Adam optimizer was used to minimize *L*(*w*), and the learning rate and *γ* were 0.01 and 0.995, respectively. An *ε*-greedy policy was used to balance exploration and exploitation; *ε* was initially set as 1 and was reduced by 0.01 for every episode.

**Algorithm 1.** Training Process for the DQL-Based HAPS Power Control Algorithm


A DDQL is a reinforcement learning algorithm to improve performance degradation due to the overestimation of the DQL. Action-value can be overestimated by the maximization step in line 16 of Algorithm 1. Therefore, the DDQL calculates the target value as

*y<sup>j</sup>* = *r<sup>j</sup>* + *γQ*ˆ *s* 0 , argmax *a* 0 *Q*(*s* 0 , *a* 0 ; *w*); *w* − ! to eliminate the maximization step. The DDQLbased HAPS power control algorithm proceeds the same way as Algorithm 1 except for calculating the target value.

#### **5. Simulation Results**

#### *5.1. Simulation Configuration*

The simulation was conducted using MATLAB for three positions of the interfered receiver, and the learning order of the agent was randomly set for each *t*. Subsequently, the simulation proceeded according to Algorithm 1. When all *M* episodes were finished, the simulation ended, and the set *P<sup>c</sup>* composed of the power selected by each agent was calculated as the simulation result. Finally, the performance of the simulation was verified by comparing *P<sup>c</sup>* with the optimal power set *P* ∗ obtained via an exhaustive search algorithm considering all *N Ncell <sup>p</sup>* cases. The total elapsed time of the DQL and exhaustive search was about 7500 s and 21,000 s, respectively. The total elapsed time of the exhaustive search increased exponentially with the rise of N, but the DQL did not. Therefore, the computational efficiency of the DQL is more remarkable as the number of cells and power levels increase. In this simulation, performance comparison with the DDQL was additionally performed to check performance degradation due to overestimation of the DQL.

We applied the HAPS parameters and interfered system parameters, referring to the working document for the HAPS coexistence study performed in preparation for WRC-23 [18,24]. The simulation parameters of the two systems are presented in Tables 1 and 2, respectively.


**Table 1.** HAPS system parameters.

**Table 2.** Interfered system (IMT BS) parameters.


**Interfered** 

**Table 2.** *Cont.*


#### *5.2. Numerical Analysis*

Figure 8 shows the SINR maps obtained using *Pmax* = {37, 34, 34, 34, 34, 34, 34} and *Pmin* = {29, 26, 26, 26, 26, 26, 26} for all cells, that is, with no power control. We considered the three positions of the interfered receiver that do not satisfy the *INRth* of −6 dB for the use of *Pmax*. In addition, the three locations were designed considering the representative interference power, which can accurately reflect the operating characteristics of the proposed power control algorithm. Interfered receiver <sup>1</sup> was located in the main beam direction for *Cell*\_3 and received the highest interference from *Cell*\_3. Therefore, the minimum power use of only *Cell*\_3 satisfied an *INRth* of −6 dB. Interfered receiver <sup>2</sup> was placed on the boundary between *Cell*\_3 and *Cell*\_4 and thus received equal (and the strongest) interference from these two cells. Interfered receiver <sup>3</sup> was located in the main beam direction for *Cell*\_3, as the interfered receiver. However, the minimum power use of only *Cell*\_3 could not satisfy the *INRth* of −6 dB, and at least one other cell had to use less than the maximum power. *Sensors* **2022**, *22*, 1630 14 of 21

**Figure 8.** (**a**) SINR map for ௫; (**b**) SINR map for . **Figure 8.** (**a**) SINR map for *Pmax*; (**b**) SINR map for *Pmin*.

**Table 3.**  and ௨௧ for the interfered receiver locations. **Receiver Location (km) for (dB) for (dB) for (%) for (%)**  ① 100, 0, 0.02 −3.01 −11.01 0 43.7 ② 77.9, 45, 0.02 −4.08 −12.08 0 43.7 Table 3 presents the *INR* and *Pout* for *Pmax* and *Pmin* with varying interfered receiver locations. The results confirm that the *Pout* and *INR* had a tradeoff relationship. The same *Pout* is shown regardless of the interference receiver position because of the absence of power control. Next, we compared the simulation results of the optimal exhaustive search and the proposed DQL-based power control algorithm for the three positions of the interfered receiver.

> Figure 9 shows the SINR map based on the acquired using the proposed DQLbased power control algorithm for interfered receiver ①. Table 4 presents a performance comparison of the ∗ values obtained via an exhaustive search and and a comparison of DQL and DDQL results. As shown, was equal to the optimal value ∗, providing the same ௨௧ and performance. Because the interfered receiver was located in

> interfered receiver. Even though all other cells used the maximum power, their interference was negligible. Therefore, all the cells except for \_3 used the maximum power

> **Figure 9.** SINR map based on the obtained using the proposed DQL-based power control algo-

for minimizing ௨௧, as shown in Table 4.

rithm for interfered receiver ①.

5.2.1. Simulation Results for Interfered Receiver ①

③ 65.8, 0, 0.02 1.81 −6.19 0 43.7


(**a**) (**b**)

*Sensors* **2022**, *22*, 1630 14 of 21

**Table 3.** *INR* and *Pout* for the interfered receiver locations. **Figure 8.** (**a**) SINR map for ௫; (**b**) SINR map for . **Table 3.**  and ௨௧ for the interfered receiver locations.

5.2.1. Simulation Results for Interfered Receiver <sup>1</sup> 5.2.1. Simulation Results for Interfered Receiver ①

Figure 9 shows the SINR map based on the *P<sup>c</sup>* acquired using the proposed DQLbased power control algorithm for interfered receiver <sup>1</sup> . Table 4 presents a performance comparison of the *P* ∗ values obtained via an exhaustive search and *P<sup>c</sup>* and a comparison of DQL and DDQL results. As shown, *P<sup>c</sup>* was equal to the optimal value *P* ∗ , providing the same *Pout* and *INR* performance. Because the interfered receiver was located in the azimuth main beam direction of *Cell*\_3, the power of *Cell*\_3 significantly affected the interfered receiver. Even though all other cells used the maximum power, their interference was negligible. Therefore, all the cells except for *Cell*\_3 used the maximum power for minimizing *Pout*, as shown in Table 4. Figure 9 shows the SINR map based on the acquired using the proposed DQLbased power control algorithm for interfered receiver ①. Table 4 presents a performance comparison of the ∗ values obtained via an exhaustive search and and a comparison of DQL and DDQL results. As shown, was equal to the optimal value ∗, providing the same ௨௧ and performance. Because the interfered receiver was located in the azimuth main beam direction of \_3, the power of \_3 significantly affected the interfered receiver. Even though all other cells used the maximum power, their interference was negligible. Therefore, all the cells except for \_3 used the maximum power for minimizing ௨௧, as shown in Table 4.

**Figure 9.** SINR map based on the obtained using the proposed DQL-based power control algorithm for interfered receiver ①. **Figure 9.** SINR map based on the *Pc* obtained using the proposed DQL-based power control algorithm for interfered receiver <sup>1</sup> .


**Table 4.** Performance comparison for interfered receiver <sup>1</sup> .

Figure 10 presents the *INR* and *pout* for each learning episode. As shown, the *INR* and *pout* converged to the optimal values of the exhaustive search algorithm as the number of learning episodes increased. The *INR* started at −11.01 dB, which was the value for the use of *Pmin*, as shown in Table 3, and converged to the optimal value of −6.93 dB. Similarly, *pout* started at 43.7% and converged to 0.6%. A large variance due to frequent exploration was observed at the beginning of the learning, but it gradually decreased and converged as the learning progressed. Figure 11 presents the cumulative and average rewards for each learning episode. As shown, the reward rapidly increased and then gradually converged at

\_ **(dBm)** 

\_ **(dBm)**  \_ **(dBm)** 

\_ **(dBm)** 

\_ **(dBm)** 

\_ **(dBm)** 

*Sensors* **2022**, *22*, 1630 15 of 21

\_ **(dBm)** 

\_ **(dBm)** 

\_ **(dBm)** 

\_ **(dBm)** 

Figure 10 presents the and ௨௧ for each learning episode. As shown, the and ௨௧ converged to the optimal values of the exhaustive search algorithm as the number of learning episodes increased. The started at −11.01 dB, which was the value for the use of , as shown in Table 3, and converged to the optimal value of −6.93 dB. Similarly, ௨௧ started at 43.7% and converged to 0.6%. A large variance due to frequent exploration was observed at the beginning of the learning, but it gradually decreased and converged as the learning progressed. Figure 11 presents the cumulative and average rewards for each learning episode. As shown, the reward rapidly increased and then grad-

Figure 10 presents the and ௨௧ for each learning episode. As shown, the and ௨௧ converged to the optimal values of the exhaustive search algorithm as the number of learning episodes increased. The started at −11.01 dB, which was the value for the use of , as shown in Table 3, and converged to the optimal value of −6.93 dB. Similarly, ௨௧ started at 43.7% and converged to 0.6%. A large variance due to frequent exploration was observed at the beginning of the learning, but it gradually decreased and converged as the learning progressed. Figure 11 presents the cumulative and average rewards for each learning episode. As shown, the reward rapidly increased and then grad-

\_ **(dBm)** 

\_ **(dBm)** 

 **(dB)** 

 **(dB)**   **(%)** 

 **(%)** 

**Table 4.** Performance comparison for interfered receiver ①.

*Sensors* **2022**, *22*, 1630 15 of 21

Optimal 37 34 30 34 34 34 34 –6.93 0.6 DQL 37 34 30 34 34 34 34 –6.93 0.6 DDQL 37 34 30 34 34 34 34 -6.93 0.6

Optimal 37 34 30 34 34 34 34 –6.93 0.6 DQL 37 34 30 34 34 34 34 –6.93 0.6 DDQL 37 34 30 34 34 34 34 -6.93 0.6

\_ **(dBm)** 

\_ **(dBm)** 

**Table 4.** Performance comparison for interfered receiver ①.

approximately 300 episodes, indicating that the proposed DQL training process allowed the agent to learn the power control algorithm quickly and stably. ually converged at approximately 300 episodes, indicating that the proposed DQL training process allowed the agent to learn the power control algorithm quickly and stably. ually converged at approximately 300 episodes, indicating that the proposed DQL training process allowed the agent to learn the power control algorithm quickly and stably.

**Figure 10.** (**a**) and (**b**) ௨௧ for each learning episode for interfered receiver ①. **Figure 10.** (**a**) *INR* and (**b**) *pout* for each learning episode for interfered receiver <sup>1</sup> . **Figure 10.** (**a**) and (**b**) ௨௧ for each learning episode for interfered receiver ①.

**Figure 11.** Reward for each learning episode for interfered receiver ①. **Figure 11.** Reward for each learning episode for interfered receiver ①. **Figure 11.** Reward for each learning episode for interfered receiver <sup>1</sup> .

We compared the learning results of the DQL and DDQL. Even when the DDQL is used, the results are the same as in Table 4 and Figures 10 and 11, which shows that the overestimation of the DQL did not occur. As a result, it was confirmed that performance We compared the learning results of the DQL and DDQL. Even when the DDQL is used, the results are the same as in Table 4 and Figures 10 and 11, which shows that the overestimation of the DQL did not occur. As a result, it was confirmed that performance We compared the learning results of the DQL and DDQL. Even when the DDQL is used, the results are the same as in Table 4 and Figures 10 and 11, which shows that the overestimation of the DQL did not occur. As a result, it was confirmed that performance degradation due to overestimation did not happen, and sufficient learning is possible only with DQL.

## 5.2.2. Simulation Results for Interfered Receiver <sup>2</sup>

Figure 12 shows the SINR map based on *P<sup>c</sup>* acquired using the proposed DQL-based power control algorithm for interfered receiver <sup>2</sup> . Table 5 presents a performance comparison of the *P* ∗ values obtained via an exhaustive search and *P<sup>c</sup>* and a comparison of the DQL and DDQL results. As shown, *P<sup>c</sup>* was equal to the optimal value *P* ∗ , providing the same *Pout* and *INR* performance. The interfered receiver was located on the boundary between *Cell*\_3 and *Cell*\_4 and, thus, received equal (and the strongest) interference from these two cells. In addition, even though all the cells other than *Cell*\_3 and *Cell*\_4 used the maximum power, their interference was marginal. Therefore, in the optimal power control, *Cell*\_3 and *Cell*\_4 reduced the power required to satisfy the *INRth*, whereas all the other cells used the maximum power for minimizing *Pout*, as shown in Table 5.

with DQL.

shown in Table 5.

5.2.2. Simulation Results for Interfered Receiver ②

**Figure 12.** SINR map based on the obtained using the proposed DQL-based power control algorithm for the interfered receiver ②. **Figure 12.** SINR map based on the *P<sup>c</sup>* obtained using the proposed DQL-based power control algorithm for the interfered receiver <sup>2</sup> .

degradation due to overestimation did not happen, and sufficient learning is possible only

Figure 12 shows the SINR map based on acquired using the proposed DQLbased power control algorithm for interfered receiver ②. Table 5 presents a performance comparison of the ∗ values obtained via an exhaustive search and and a comparison of the DQL and DDQL results. As shown, was equal to the optimal value ∗, providing the same ௨௧ and performance. The interfered receiver was located on the boundary between \_3 and \_4 and, thus, received equal (and the strongest) interference from these two cells. In addition, even though all the cells other than \_3 and \_4 used the maximum power, their interference was marginal. Therefore, in the optimal power control, \_3 and \_4 reduced the power required to satisfy the ௧ , whereas all the other cells used the maximum power for minimizing ௨௧ , as



As shown in Figure 13, the and ௨௧ converged to the optimal values of the exhaustive search algorithm. Similar to the case of receiver ①, as the learning progressed, the converged from −12.08 to −6.08 dB, and the ௨௧ converged from 43.7% to 0.2%. Figure 14 shows that the reward gradually converged at approximately 300 episodes, indicating that the proposed DQL training process allowed the agent to quickly and stably learn the power control algorithm. We compared the learning results of the DQL and DDQL. Even when the DDQL was used, the results were the same as in Table 5 and Figures 13 and 14, verifying that the desired learning is attainable with the DQL only. As shown in Figure 13, the *INR* and *pout* converged to the optimal values of the exhaustive search algorithm. Similar to the case of receiver <sup>1</sup> , as the learning progressed, the *INR* converged from −12.08 to −6.08 dB, and the *pout* converged from 43.7% to 0.2%. Figure 14 shows that the reward gradually converged at approximately 300 episodes, indicating that the proposed DQL training process allowed the agent to quickly and stably learn the power control algorithm. We compared the learning results of the DQL and DDQL. Even when the DDQL was used, the results were the same as in Table 5 and Figures 13 and 14, verifying that the desired learning is attainable with the DQL only. *Sensors* **2022**, *22*, 1630 17 of 21

**Figure 13.** (**a**) and (**b**) ௨௧ for each learning episode for interfered receiver ②. **Figure 13.** (**a**) *INR* and (**b**) *pout* for each learning episode for interfered receiver <sup>2</sup> .

**Figure 14.** Reward for each learning episode for interfered receiver ②.

trol algorithm achieved outstanding performance close to the optimal value.

Figure 15 shows the SINR map based on obtained using the proposed DQLbased power control algorithm for interfered receiver ③. The interfered receiver was located in the azimuth main lobe direction of \_3. It was closer to the HAPS than the receiver considered in Section 5.2.1 and was more severely affected by \_3; ௧ was not satisfied even for the minimum power of \_3. Thus, the optimal power control adjusted the power of \_2 and \_4, which caused the second-most interference. Table 6 presents a comparison of the ∗ values obtained using an exhaustive search and and a comparison of the DQL and DDQL results. Although the ௨௧ of was 0.6% higher than that of ∗, it corresponded to the third-smallest value among the 78,125 values generated by the exhaustive search algorithm. In summary, the proposed power con-

5.2.3. Simulation Results for Interfered Receiver ③

(**a**) (**b**)

**Figure 13.** (**a**) and (**b**) ௨௧ for each learning episode for interfered receiver ②.

**Figure 14.** Reward for each learning episode for interfered receiver ②. **Figure 14.** Reward for each learning episode for interfered receiver <sup>2</sup> .

#### 5.2.3. Simulation Results for Interfered Receiver ③ 5.2.3. Simulation Results for Interfered Receiver <sup>3</sup>

Figure 15 shows the SINR map based on obtained using the proposed DQLbased power control algorithm for interfered receiver ③. The interfered receiver was located in the azimuth main lobe direction of \_3. It was closer to the HAPS than the receiver considered in Section 5.2.1 and was more severely affected by \_3; ௧ was not satisfied even for the minimum power of \_3. Thus, the optimal power control adjusted the power of \_2 and \_4, which caused the second-most interference. Table 6 presents a comparison of the ∗ values obtained using an exhaustive search and and a comparison of the DQL and DDQL results. Although the ௨௧ of was 0.6% higher than that of ∗, it corresponded to the third-smallest value among the 78,125 values generated by the exhaustive search algorithm. In summary, the proposed power control algorithm achieved outstanding performance close to the optimal value. Figure 15 shows the SINR map based on *P<sup>c</sup>* obtained using the proposed DQL-based power control algorithm for interfered receiver <sup>3</sup> . The interfered receiver was located in the azimuth main lobe direction of *Cell*\_3. It was closer to the HAPS than the receiver considered in Section 5.2.1 and was more severely affected by *Cell*\_3; *INRth* was not satisfied even for the minimum power of *Cell*\_3. Thus, the optimal power control adjusted the power of *Cell*\_2 and *Cell*\_4, which caused the second-most interference. Table 6 presents a comparison of the *P* ∗ values obtained using an exhaustive search and *P<sup>c</sup>* and a comparison of the DQL and DDQL results. Although the *pout* of *P<sup>c</sup>* was 0.6% higher than that of *P* ∗ , it corresponded to the third-smallest value among the 78,125 values generated by the exhaustive search algorithm. In summary, the proposed power control algorithm achieved outstanding performance close to the optimal value. *Sensors* **2022**, *22*, 1630 18 of 21

**Figure 15.** SINR map based on obtained using the proposed DQL-based power control algorithm for interfered receiver ③. **Figure 15.** SINR map based on *Pc* obtained using the proposed DQL-based power control algorithm for interfered receiver <sup>3</sup> .

As shown in Figure 16, the and ௨௧ converged to the optimal values of the exhaustive search algorithm, with slight gaps. Similar to the results presented in Section 5.2.1, as the learning progressed, the converged from −6.19 to −6.06 dB, and the ௨௧

each learning episode. The reward exhibited no noticeable improvement until approximately 130 episodes, after which it rapidly increased and then gradually converged at approximately 350 episodes. This is because to satisfy the ௧, more agents had to take action, and the actions had to be more diverse. Nonetheless, the proposed DQL training process allowed the agent to learn the power control algorithm quickly and stably. We compared the learning results of the DQL and DDQL. Even when the DDQL was used, the results were the same as in Table 6 and Figures 16 and 17, verifying that the desired

**Table 6.** Performance comparison for interfered receiver ③. **Table 6.** Performance comparison for interfered receiver <sup>3</sup> .


learning is attainable with the DQL only.

As shown in Figure 16, the *INR* and *pout* converged to the optimal values of the exhaustive search algorithm, with slight gaps. Similar to the results presented in Section 5.2.1, as the learning progressed, the *INR* converged from −6.19 to −6.06 dB, and the *pout* converged from 43.7% to 5.7%. Figure 17 shows the cumulative and average rewards for each learning episode. The reward exhibited no noticeable improvement until approximately 130 episodes, after which it rapidly increased and then gradually converged at approximately 350 episodes. This is because to satisfy the *INRth*, more agents had to take action, and the actions had to be more diverse. Nonetheless, the proposed DQL training process allowed the agent to learn the power control algorithm quickly and stably. We compared the learning results of the DQL and DDQL. Even when the DDQL was used, the results were the same as in Table 6 and Figures 16 and 17, verifying that the desired learning is attainable with the DQL only. *Sensors* **2022**, *22*, 1630 19 of 21 *Sensors* **2022**, *22*, 1630 19 of 21

**Figure 16.** (**a**) and (**b**) ௨௧ for each learning episode for interfered receiver ③. **Figure 16.** (**a**) *INR* and (**b**) *pout* for each learning episode for interfered receiver <sup>3</sup> . **Figure 16.** (**a**) and (**b**) ௨௧ for each learning episode for interfered receiver ③.

**Figure 17.** Reward for each learning episode for interfered receiver ③. **Figure 17.** Reward for each learning episode for interfered receiver ③. **Figure 17.** Reward for each learning episode for interfered receiver <sup>3</sup> .

#### **6. Conclusions 6. Conclusions 6. Conclusions**

studying.

studying.

manuscript.

manuscript.

This paper proposed a DQL-based transmission power control algorithm for multicell HAPS communication that involved spectrum sharing with existing services. The proposed algorithm aimed to find a solution to the power control optimization problem for minimizing the outage probability of the HAPS downlink under the interference constraint to protect existing systems. We compared the solution with the optimal solution acquired using the exhaustive search algorithm. The simulation results confirmed that the proposed algorithm was comparable to the optimal exhaustive search. This paper proposed a DQL-based transmission power control algorithm for multicell HAPS communication that involved spectrum sharing with existing services. The proposed algorithm aimed to find a solution to the power control optimization problem for minimizing the outage probability of the HAPS downlink under the interference constraint to protect existing systems. We compared the solution with the optimal solution acquired using the exhaustive search algorithm. The simulation results confirmed that the proposed algorithm was comparable to the optimal exhaustive search. This paper proposed a DQL-based transmission power control algorithm for multicell HAPS communication that involved spectrum sharing with existing services. The proposed algorithm aimed to find a solution to the power control optimization problem for minimizing the outage probability of the HAPS downlink under the interference constraint to protect existing systems. We compared the solution with the optimal solution acquired using the exhaustive search algorithm. The simulation results confirmed that the proposed algorithm was comparable to the optimal exhaustive search.

Future work will include various power levels and expanding to multiple-HAPS communication in spectrum sharing with multiple interference systems. Since the increase Future work will include various power levels and expanding to multiple-HAPS communication in spectrum sharing with multiple interference systems. Since the increase Future work will include various power levels and expanding to multiple-HAPS communication in spectrum sharing with multiple interference systems. Since the increase

in the power level could reveal a value-based algorithm's limit, it is preferred to apply the policy-based algorithm. Given that multiple-HAPS communication could lead to the non-

in the power level could reveal a value-based algorithm's limit, it is preferred to apply the policy-based algorithm. Given that multiple-HAPS communication could lead to the non-

**Author Contributions:** Conceptualization and methodology, S.J. and H.-S.J.; software, S.J. and W.Y.; validation, formal analysis, and investigation, S.J. and W.Y. and H.-S.J.; resources and data curation, H.K.C. and E.N.; writing—original draft preparation, S.J. and W.Y.; writing—review and editing, S.J., H.-S.J. and J.P.; visualization, W.Y. and H.K.C.; supervision, J.P.; project administration, H.-S.J. and J.P.; funding acquisition, J.P. All authors have read and agreed to the published version of the

**Author Contributions:** Conceptualization and methodology, S.J. and H.-S.J.; software, S.J. and W.Y.; validation, formal analysis, and investigation, S.J. and W.Y. and H.-S.J.; resources and data curation, H.K.C. and E.N.; writing—original draft preparation, S.J. and W.Y.; writing—review and editing, S.J., H.-S.J. and J.P.; visualization, W.Y. and H.K.C.; supervision, J.P.; project administration, H.-S.J. and J.P.; funding acquisition, J.P. All authors have read and agreed to the published version of the

**Funding:** This work was supported by the Agency for Defense Development (ADD).

**Funding:** This work was supported by the Agency for Defense Development (ADD).

in the power level could reveal a value-based algorithm's limit, it is preferred to apply the policy-based algorithm. Given that multiple-HAPS communication could lead to the non-stationarity problem of multiagent reinforcement learning, its solution would be worth studying.

**Author Contributions:** Conceptualization and methodology, S.J. and H.-S.J.; software, S.J. and W.Y.; validation, formal analysis, and investigation, S.J. and W.Y. and H.-S.J.; resources and data curation, H.K.C. and E.N.; writing—original draft preparation, S.J. and W.Y.; writing—review and editing, S.J., H.-S.J. and J.P.; visualization, W.Y. and H.K.C.; supervision, J.P.; project administration, H.-S.J. and J.P.; funding acquisition, J.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Agency for Defense Development (ADD).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

