Joint Power and Channel Optimization of Agricultural Wireless Sensor Networks Based on Hybrid Deep Reinforcement Learning

Han, Xiao; Wu, Huarui; Zhu, Huaji; Chen, Cheng

doi:10.3390/pr9111919

Open AccessArticle

Joint Power and Channel Optimization of Agricultural Wireless Sensor Networks Based on Hybrid Deep Reinforcement Learning

¹

National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China

²

Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China

³

Key Laboratory of Agri-Informatics, Ministry of Agriculture, Beijing 100097, China

^*

Author to whom correspondence should be addressed.

Processes 2021, 9(11), 1919; https://doi.org/10.3390/pr9111919

Submission received: 23 August 2021 / Revised: 26 September 2021 / Accepted: 11 October 2021 / Published: 27 October 2021

Download

Browse Figures

Versions Notes

Abstract

:

The reduction of maintenance costs in agricultural wireless sensor networks (WSNs) requires reducing energy consumption. At the same time, care should be taken not to affect communication quality and network lifetime. This paper studies a joint optimization algorithm for transmitted power and channel allocation based on deep reinforcement learning. First, an optimization model to measure network reward was established under the constraint of the signal-to-interference plus-noise-ratio (SINR) threshold, which includes continuous power variables and discrete channel variables. Secondly, considering the dynamic changes of agricultural WSNs, the network control is described as a Markov decision process with continuous state and action space. A deep deterministic policy gradient (DDPG) reinforcement learning scheme suitable for mixed variables was designed. This method could obtain a control scheme that maximizes network reward by means of black-box optimization for continuous transmitted power and discrete channel allocation. Experimental results indicated that the studied algorithm has stable convergence. Compared with traditional protocols, it can better control the transmitted power and allocate channels. The joint power and channel optimization provides a reference solution for constructing an energy-balanced network.

Keywords:

power control; channel allocation; DDPG; mixed variable

1. Introduction

With the development of Internet of things (IoT), WSNs could realize the data interaction between the sensing and application layers, which provide a basis for various IoT systems [1]. The application of WSN in intelligent agriculture is an inevitable trend [2]. WSN carries out the real-time monitoring and scientific regulation of precision agriculture production through sensors in various terminal equipment. It helps to reasonably allocate agricultural resources, reduce production costs, and improve the yield and quality of agricultural products.

Agricultural WSN have the characteristics of various monitoring events, different data types, and large network scales [3]. However, wireless sensor nodes have limited energy. It is difficult to charge and replace the batteries [4]. Finite energy would affect network lifetime and communication quality. In addition, the dense deployment of IoT makes spectrum resources an important factor hindering the expansion of agricultural WSNs [5]. Optimizing network structure, reducing node energy consumption, and making full use of network spectrum resources are the keys to promoting the WSN in smart agriculture.

Power control could reasonably adjust the transmitted power and improve energy consumption per unit time. In a single-channel network, power control was mainly studied from two aspects: improving network SINR and reducing energy consumption. Constrained by the interference power and SINR of each node, an adaptive power control model of the wireless network was studied in [6]. The price function introduced the SINR threshold, which reduced peak power and energy consumption but also reduced the network capacity. Mehmood et al. considered the lifetime of a single node and the overall network lifetime [7]. Combined with SINR and empirical interference power, a new utility function for reducing network energy consumption was proposed according to the stochastic differential game, which was then transformed into a mean-field game model by mean-field approximation technology. Finally, the transmitted power was obtained by a finite difference algorithm. For wireless networks with selfish nodes [8], link rewards and costs depend on node hops, node forwarding time, node energy consumption, SINR, etc. Then, an improved local multi-objective power control algorithm was studied to obtain a network topology with less communication overhead. Du et al. applied the Theil index to measure the energy imbalance of WSNs and constructed a network topology [9]. To maximize the SINR, the power control problem was modeled as a hybrid game [10]. The existence and uniqueness of the Nash equilibrium solution in the game model were proved by theoretical derivation. However, this method did not consider the energy efficiency of nodes. In addition, some researchers have also applied deep learning and intelligent algorithms to power control [11].

When the network scale increases, spectrum resources cannot meet the equipment requirements because when there is only one channel in the network, the node only transmits data in that channel. Nodes in the same channel will interfere with each other, resulting in degraded network performance. Multi-channel technology was proposed to solve the problem of unfair resource competition caused by spectrum shortages. It allocates links in different channels, which effectively reduces interference [12]. Multi-channel allocation combined with power control could further improve energy efficiency. The traditional method was to discuss power control and channel allocation separately. Shokrnezhad et al. described the minimum transmitted power problem as a mixed integer linear programming (MILP) problem that meets SINR requirements [13,14]. It was divided into two sub-problems: channel allocation based on auction and power control based on non-cooperative games. After channel allocation, the power control subproblem was solved. In addition, evolutionary algorithms could be used for the joint optimization of network power and channels. A hybrid particle swarm optimization (HPSO) algorithm was proposed to maximize network capacity [15]. Through the Lambert W function, Zhou et al. derived the transmitted power corresponding to the maximum energy efficiency and allocated channels based on the Kuhn–Munkres algorithm [16]. These methods divided the joint optimization problem into two subproblems which were solved singly or sequentially without sufficient consideration of the interaction feedback between them. Hao et al. established the reward function based on a non-cooperative game by comprehensively considering node interference, transmitted power, and residual energy [17,18]. Then, power and channels were allocated to the nodes. Although traditional game theory can improve reliability and energy consumption, it is mainly aimed at static optimization and was difficult to adapt to dynamic environments.

The above studies showed that power control could reduce network energy consumption. Compared with power control in a single channel, a multi-channel network could further decrease interference and improve network performance. Many researchers have systematically studied the joint optimization of power control and channel allocation to build optimization models based on a static network, adopting optimization or heuristic algorithms to solve them [6,19]. However, traditional methods cannot accurately model the complex environment by accumulating experience. To adapt to different environments, the reinforcement learning algorithm (RL) senses the changes of environment and learns action strategies according to experience. The RL algorithm was designed to obtain the optimal joint optimization strategy through offline training that reduced the computing resource consumption [20]. Asuhaimi et al. combined RL with the deep neural network [21]. They solved the joint optimization problem about network power and channel based on an artificial intelligence agent. In [22], deep reinforcement learning (DRL) with nonconvex and nonlinear characteristics was used to maximize network capacity. The experimental results showed that the DRL method had good convergence and could achieve a more energy-efficient joint optimization with less communication information. These DRLs used the deep neural network to mine potentially important information from the network, a technique which can improve the performance of wireless networks [23]. However, this method cannot optimize continuous variables or aim at ad hoc networks and mesh networks with infinite energy. DRL is not suitable for WSNs with finite energy.

An RL has great potential in automatic environment exploration, self-decision-making and self-optimization. The optimization goal is a long-term reward rather than an instant reward. For a farm WSN with unstable communications, an algorithm that considers an instant reward may choose the transmitted power and channel that maximize the current network reward. However, this decision may reduce the total network reward in a certain period. Therefore, adopting RLs have been considered to make decisions according to long-term rewards. In addition, the RL can get an approximately optimal solution without prior information about the environment. An RL algorithm has an exploration mechanism. Through automatic exploration, an RL algorithm gradually understands the environment and makes the decision. Therefore, this paper combines deep learning, RL, and the agricultural WSN control model to establish a new reward function, which realizes black-box optimization for continuous power control and discrete channel selection. We design a joint optimization algorithm based on the improved DDPG to describe a continuous power space and a discrete channel space. The network control problem is transformed into a sequential decision problem to solve. Experimental results show that, compared with the traditional methods, our model can ensure good communication quality and lower the average transmitted power, which is conducive to reducing network energy consumption. It is suitable for agricultural WSNs.

2. Agricultural WSN Model

We assume that an agricultural wireless network contains N sensors working together to form a set V = {1, 2, …, N} and that they are randomly distributed in farmland. The set of node energy is E = {E₁, E₂, …, E_N}, where E_i represents the residual energy of node i. U is the set of all links. Any link u_ij∈U represents a one-way transmission path from node i to node j, and node j is called the next-hop of node i. Suppose that the number of orthogonal channels in the network is M (N > M), and the channel set is C = {channel 1, channel 2, …, channel M}. The channel selected by each sensor in the network constitutes a set K = {c₁, c₂, …, c_N}. That is, the channel selected by node i is c_i, c_i ∈ C. Each node only can work on one channel at any time. The centralized central controller coordinates and manages network nodes, allocates idle channels and optimizes transmitted power when the network begins to transmit data. Finally, the results are notified to the nodes [6]. Figure 1 shows the power control and channel allocation in an agricultural WSN.

The SINR γ_i(t) of node i at time instant t can be expressed as follows [24]:

γ_{i} (t) = \frac{W}{R} \frac{p_{i} (t) g_{i}}{N_{σ} + \sum_{j \in I, j \neq i} p_{j} (t) g_{j}} = G \frac{p_{i} (t) g_{i}}{N_{σ} + \sum_{j \in I, j \neq i} p_{j} (t) g_{j}}

(1)

where, W is the spread spectrum bandwidth of the network in Hz, and R is the transmit rate in bit/s. The spread spectrum gain G = W/R. p_i(t) represents the transmitted power of node i at the time instant t, and its working channel is c_i (t) (c_i (t)∈C). The set I is composed of interfering nodes sharing the channel c_i(t). g_i is the path gain of node i. N_σ is the background noise power.

Increasing the transmitted power and reducing the interference can improve SINR, which will influence the performance such as network capacity [21]. Due to the limited node energy, if the transmitted power is high, it will accelerate the death of the node. Therefore, energy consumption should be considered in power control and channel allocation. We design the following price function to quantify the energy efficiency in the network:

η (t) = \sum_{i \in V} (\frac{1}{p_{i} (t)} \ln (\frac{γ_{i} (t)}{γ_{t h}}) - β p_{i} (t) g_{i} \frac{E_{a v e}}{E_{i}^{}})

(2)

where, γ_i(t) is the SINR of node i obtained by Equation (1). γ_th is the SINR threshold. β is the pricing factor. The initial energy (i.e., the maximum energy) of all nodes in the network is defined as E₀, and data transmission will consume energy. The residual energy of node i is E_i, E_i∈E. The other nodes in the communication range of node i are the neighbor nodes of node i. E_ave is the average residual energy of node i and its neighbor nodes.

When the SINR γ_i(t) of node i is less than a threshold, the first term will increase sharply with the increase of γ_i(t). When γ_i(t) is greater than the threshold, the function value is more affected by the second item. It can be seen from the second item that when the residual energy gradually decreases and the residual energy is unbalanced, the network reward shows a downward trend. It is necessary to reduce the transmitted power to prolong the network lifetime.

When node i adopts a relatively high transmitted power, it will bring negative factors to other nodes and lead to worse SINR. The neighbor nodes, therefore, increase transmitted power to compensate for the desired SINR, which makes for more serious interference between the network links. Each node’s transmitted power is affected by its neighbors, which is not a problem of local optimization. To optimize the network SINR and keep the levels of energy balanced, a power and channel joint optimization model composed of related factors is established by taking the SINR threshold and the transmitted power as constraints:

\begin{array}{l} \arg \max & \frac{1}{T} [\sum_{t = 1}^{T} (\sum_{i \in V} (\frac{1}{p_{i} (t)} \ln (\frac{γ_{i} (t)}{γ_{t h}}) - β p_{i} (t) g_{i} \frac{E_{a v e}}{E_{i}^{}}))] \\ s . t . & C 1 : c_{i} (t) \in C, i \in V \\ C 2 : p_{\min} \leq p_{i} (t) \leq p_{\max}, i \in V \\ C 3 : γ_{i} (t) \geq γ_{t h}, i \in V \end{array}

(3)

where, T represents the period, and the optimization objective of Equation (3) is that the transmitted energy efficiency is the largest in period T. The variables we need to optimize are the transmission channels and transmitted powers of the nodes in the set V at any time t. That is to say, the variable matrix can be expressed as P and K, where P = {P(t = 1), P(t = 2),…, P(t = T)}, K = {K(t = 1), K(t = 2),…, K(t = T)}, P(t) = {p₁(t), p₂(t), …, p_N(t)} is the power strategy in the network at time instant t, and K(t) = {c₁(t), c₂(t), …, c_N(t)} is the channel strategy in the network at time instant t. C1 means that the transmission channel selected by node i belongs to set C, and each node only selects one channel at time t. C2 gives the range of transmitted power. C3 ensures that the SINR is greater than the threshold. C3 is limited by C2, and C2 priority is greater than C3. This problem is a typical NP problem. Because transmitted power is a continuous variable, there will be a problem of dimension disaster. It is very important to find an appropriate method to obtain an effective solution.

3. A Joint Optimization Strategy for Power Control and Channel Allocation

An RL is widely applied to deal with Markov dynamic programming. Considering the storage difficulties caused by the excessive state and action spaces of WSNs, a joint optimization method for power control and channel allocation in farmland WSNs based on DRL was studied by introducing deep learning. In this paper, the dynamic problem shown in Equation (3) was modeled as a Markov decision process, and an online learning algorithm based on an improved DDPG was proposed. Through the combination of neural networks and RL, the historical experience obtained by the algorithm was transformed into data samples. The neural network model was trained in real-time to continuously optimize the performance of the model. Descriptions of the system state space, action space, and reward function of reinforcement learning are explained below.

3.1. Reinforcement Learning Function

In this paper, power control and channel allocation were optimized to maximize WSN energy efficiency. The agent needs to obtain information about channel allocation, transmitted power, and residual energy in the network. The system state s(t) at time instant t can be expressed as:

s (t) = \{γ (t), H (t), P (t), K (t)\}

(4)

At time t, γ(t) = {γ₁(t), γ₂(t), …, γ_N(t)} is the SINR of each node. P(t) = {p₁(t), p₂(t), …, p_N(t)} is the transmitted power of each node. K(t) = {c₁(t), c₂(t), …, c_N(t)} is composed of channels selected by each node at time t. H(t) = {h₁(t), h₂(t), …, h_N(t)} represents the comparison between SINR and SINR threshold, which is expressed as:

h_{i} (t) = \{\begin{array}{l} 1, γ_{i} (t) \geq γ_{t h} \\ 0, e l s e \end{array}

(5)

Action space a(t): in order to obtain better network performance, it is necessary to determine the channel allocation and transmitted power of each node.

a (t) = \{P (t), K (t)\}

(6)

Instant reward function: the reward represents the agent goal. In this problem, the instant reward function r(t) represents the sum of the transmitted energy efficiency of all nodes at time instant t:

r (t) = \sum_{i \in V} (\frac{1}{p_{i} (t)} \ln (\frac{γ_{i} (t)}{γ_{t h}}) - β p_{i} (t) g_{i} \frac{E_{a v e}}{E_{i}^{}}

(7)

To describe the algorithm more conveniently, we represent the time instant t with subscripts below, for example, s(t), a(t), and r(t) as s_t, a_t, and r_t.

3.2. Power control and Channel Allocation Based on DDPG

The strategy π determines the action that the agent performs, which is the probability s → pa(a) that the state maps to each operation. pa represents the probability of state transition. The goal of RL is to obtain the optimal strategy π^*. In the case of following the π^* strategy, the maximum cumulative reward

G = \sum_{t = 1}^{T} {(λ)}^{t - 1} r (t)

can be obtained. λ is the discount factor. The state action value function V^π(s_t, a_t) represents the expected cumulative reward obtained by the agent acting a and following the strategy π in the current state s_t:

V^{π} (s_{t}, a_{t}) = E [G_{t} | s = s_{t}, a = a_{t}]

(8)

The purpose of RL is to find an optimal selection strategy π*, which satisfies V^π^*(s_t, a_t) > V^π(s_t, a_t) for any selection strategy π. In Q-learning, a Q(s_t, a_t)-value table is maintained internally, which represents the accumulative discount reward for performing the action a_t in the state s_t [25]. By interacting with the environment and feedback information, the agent continuously updates the Q-value table online, and finally obtains the optimal strategy. According to the Bellman equation [22], the Q value is updated as follows:

Q (s_{t}, a_{t}) \leftarrow (1 - l e) Q (s_{t}, a_{t}) + l e (r_{t} + λ \max_{a_{t}} Q (s_{t}, a_{t}))

(9)

where, le is the learning rate.

In farmland, the transmission environment is complex, and the number of nodes is large, making the state and action spaces increase exponentially. It is difficult to find the optimal strategy by looking up the Q-value table. Therefore, a deep neural network (DNN) was introduced into the RL framework to establish a deep Q-learning network (DQN) [26]. Through RL online learning and DNN network offline training, a DQN can effectively solve the problem of state space explosion.

The RL generates the training data, and then DNN performs offline training to fit the best Q-value table. The agent selects a corresponding action based on the Q-value table that the DNN outputs. Finally, the following optimal selection strategy is obtained:

π^{*} (s) = \arg \max Q^{*} (s_{t}, a_{t} | θ)

(10)

where Q*(s_t, a_t|θ) is the best Q value approximated by DNN, and θ is the main parameter of the neural network. In order to make Q*(s_t, a_t|θ) stable, it is necessary to calculate the target Q-network error. The approximate target Q-network value y_t can be defined as:

y_{t} = r_{t} + λ Q (s_{t + 1}, π^{*} (s_{t + 1}) | θ^{-})

(11)

where, y_t is the target Q-network and θ⁻ is the network parameter. The target Q-network and DNN have the same neural network structure. In each training step, neural network parameters are updated by minimizing the loss function:

L (θ) = E [(y_{t} - {(Q (s_{t}, a_{t} | θ))}^{2}]

(12)

The agent first collects network environment information to obtain the current network state s_t. Then, action a_t is selected depending on the greedy strategy, which determines the channel allocation and power control in the WSN network. The agent then performs specific power and channel allocation actions. In this way, it obtains the immediate reward r_t, while the network changes to the next state s_t₊₁. The experience vectors (s_t, a_t, r_t, s_t₊₁) are then stored in the experience pool. Through continuous interaction, offline training data is generated.

The Improved DDPG with Mixed Variables

The above DQN method can only deal with discrete and low-dimensional action space. It cannot be applied to continuous data directly. For a continuous action space, DQN has no way to obtain each continuous action output. A simple method is to discretize the action space, but the action space increases exponentially with the action freedom degree. Therefore, this method is unrealistic for most tasks. A DQN algorithm cannot effectively deal with the continuous transmitted power control in agricultural WSNs.

The DDPG is based on the actor–critic algorithm framework [27]. It combines DQN experience playback and double network structure to improve the deterministic strategy gradient (DPG) algorithm [28]. The actor network generates agent’s behavior strategy, and the critic network judges the action and guides action updates. In the DDPG, there is an actor network with parameter θ^π and a critic network with parameter θ^Q to calculate the deterministic strategy a = π(s|θ^π) and the action value function. Both the actor network and the critic network have an evaluated network and a target network. The evaluated network and the target network have the same structure, and the target network parameters are updated by the evaluated net with a certain frequency. The loss function of the critic evaluated network is:

L (θ^{Q}) = E [{(y_{t} - Q (s_{t}, a_{t} | θ^{Q}))}^{2}]

(13)

where,

y_{t} = r_{t} + λ Q (s_{t + 1}, π^{*} (s_{t + 1}) | θ^{Q^{'}})

. Q(s_t, a^t|θ^Q) is the action value calculated by the critic evaluated net in state s_t, and it takes action a_t. y_t is the target action value calculated by the critic target network based on samples. The loss function of the actor evaluated network is:

L (θ^{π}) = - E [Q (s_{t}, a_{t} | θ^{Q})]

(14)

The gradient descent method is applied to minimize the loss function L(θ^π), which is equivalent to maximizing the action value. The parameters of the critic target network and the actor target network are updated as follows:

\begin{array}{l} θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}} \\ θ^{π^{'}} \leftarrow τ θ^{π} + (1 - τ) θ^{π^{'}} \end{array}

(15)

In order to get more rewards for the action output by agent, we add an exploration mechanism for node actor selection:

a_{t} = π (s_{t} | θ^{π}) + n o i s e_{t}

(16)

noise_t is random noise. Since DDPG output includes the WSN’s continuous variables (i.e., transmitted power) and integer variables (i.e., selected channel), it is feasible to adopt the above method to enhance algorithm exploration. The random noise is set to obey normal distribution:

n o i s e_{t} ~ N (μ, σ_{t}^{2})

(17)

The expected value of noise μ = 0, and the variances σ_t related to the iteration number. With the increase of the iteration number, σ will gradually decrease. For nodes with large residual energy, the exploration noise should be increased to avoid falling into the local optimal. For nodes with small residual energy, a precise action search by reducing the exploration noise is more conducive to transmitted power control. Therefore, this paper studies the exploration noise variance oriented to residual energy:

σ_{0} = \frac{n_{i n i t i a l}}{E_{0}} (E - n_{o f f s e t})

(18)

σ_{t} = (1 - K_{d e c a y}) σ_{t - 1}

(19)

where, E is the set of node energy. n_initial and n_offset are constants to adjust the initial noise variance. K_decay is the variance decay rate.

The DDPG model is mainly used for continuous variable optimization, which is suitable for power control. However, the variables in the channel allocation are integers. We designed a hybrid DDPG algorithm as shown in Figure 2.

With each iteration of the hybrid model, the environment model is reset, and the new state s_t is obtained. In the internal cycle of each iteration, the actor is first used to calculate the corresponding action value a_t according to s_t. Then the variables [P_t₊₁, K_t₊₁] in the optimization model are calculated according to a_t and s_t, and the action variables are mapped to Γ([P_t₊₁, K_t₊₁]]) in the restricted domain C1~C3 by the projection method. In this case, it is necessary to integer the channel variables, and then execute a_t to obtain the state value s_t₊₁ at the next time. [s_t, a_t, r_t, s_t₊₁] is stored in the experience pool. After the experience pool is full, the target network of small-batch training data training will be randomly sampled until the termination condition is met. The pseudo code of hybrid DDPG algorithm is shown in Algorithm 1.

Algorithm 1 Power control and channel allocation based on hybrid DDPG.

1. Input: Learn rate le, discount factor λ, variance decay rate K_decay, initial noise variance, and so on

2. Initialization: Randomly generate the parameters θ^Q, θ^Q′, θ^π, θ^π′, and set the experience pool to empty, whose size is pool_size

3. for episode = 0 to Max_episode do

4. Obtain the initial state s₀ and make the initial reward r₀ = 0

5. for t = 0 to Max_step do

6. Returns an action a_t = [P_t, K_t] based on the input s_t

7. The optimization variable is projected into the limit field (C1~C3) in Equation (3), and the channel allocation matrix is corrected to an integer matrix. Get the modified action a_t: a_t = Γ([P_t₊₁, K_t₊₁]).

8. Environment to perform an action a_t, thereby feedback reward r_t and the next state s_t₊₁

9. Storage state transition process (s_t, a_t, r_t, s_t₊₁) to experience pool

10. if experience pool is full do

11. Small batch state transition samples M^*(s_i, a_i, r_i, s_i₊₁) are extracted from the experience pool for training

12. The target-Q value is obtained from the critic target network: y_i

13. Update critic network by minimizing loss

14. Update actor network through policy gradient

15. Update target network

16. θ^Q^′←τθ^Q + (1 − τ)θ^Q^′

17. θ^π^′←τθ^π + (1 − τ)θ^π^′

18. end

19. end

20. end

4. Experiment and Analysis

In this section, the size of the field where the WSN was deployed as set as 40 × 40 m. The experiment was carried out based on a fixed topology shown in Figure 3. Each node corresponds to a transmission link. The dotted circles indicate the communication range. Other specific experimental parameters are shown in Table 1. d is the Euclidean distance between communication nodes.

4.1. Influence of Model Parameters on Network Performance

In this paper, the influence of different learning rates on the hybrid DDPG algorithm was tested. Figure 4 shows the actor–critic network structure. The actor network is a fully connected network. The network input was in the state s_t at time instant t, and the output action is a_t. The relu function was used between the hidden layers, and the tanh function was used in the output layer. The action matrix was input into the critic network. There were 300 nodes on the second layer. The state space matrix was input to the third layer, combined with the action matrix for linear transformation, and input to the fourth layer. There was a total of 300 neurons in this layer. Finally, the current strategy evaluation index Q was output.

As can be seen from Figure 5, our algorithm can converge when the learning rate is 1 × 10⁻¹, 1 × 10⁻², and 1 × 10⁻³, respectively. Due to the high exploration probability, the algorithm collects environmental sample data by exploring the environment in the early stage, and it is difficult to obtain better transmitted power and channel allocation results. The reward was also low. After 15 rounds, with the accumulation of samples, the relationship between reward function and action strategy was gradually established. The algorithm converged quickly. The WSN adjusted the transmitted power and channel close to the optimal network performance. However, when the learning rate was large, the algorithm convergence was not stable. When the learning rate = 1 × 10⁻³, the episode reward was obviously smaller than the others, and it was easy for the algorithm to fall into the local optimum. In the enlarged graph of Figure 5, when the learning rate was 1 × 10⁻², the algorithm could converge smoothly and get a relatively ideal reward. Therefore, this paper adopted the learning rate = 1 × 10⁻² for the comparison experiments.

4.2. Network Performance Analysis

In order to verify the performance of the algorithm in this paper, we utilized the RL method proposed in [20] and the HPSO algorithm proposed in [15] to jointly optimize the transmitted power and channel under the same experimental conditions. With the learning rate of 1 × 10⁻², the number of links increased from 12 to 20 evenly. Figure 6 shows the average power, power variance, SINR, and network reward of the three algorithms when the number of network channels was 5 and the number of links was 12~20. RL has the largest average power, and our algorithm had the smallest average power, which was closer to 0.0483 W. The transmitted power had a positive correlation with the network SINR. The SINR obtained by our algorithm in Figure 6c was not the best, but as the number of links changes, the SINR changes relatively smoothly. As the number of links increases, an SINR larger than the threshold can still be obtained. Compared with the other two algorithms, this algorithm has obvious advantages. In a WSN, data transmission between links is the main reason for network energy consumption. That is, the lower the average transmitted power, the less the energy consumption. The algorithm in this paper could be beneficial to prolong the network lifetime. In addition to the smaller average power, the power variance of this algorithm in Figure 6b is also much smaller than the other two algorithms, close to 2.67 × 10⁻⁵. Its energy consumption is also more balanced. According to Figure 6d, with a different number of links, the algorithm in this paper can maintain a good network reward. Therefore, our algorithm not only consumes less energy, but also guarantees network quality during data transmission and balances the energy consumption among different nodes. In order to verify the universality of the algorithm, 20 cases were randomly generated and DDPG training was performed on them. The nodes are randomly distributed in the simulation area, and the probability of each node falling on each position, which satisfies the condition of uniform distribution, was investigated.

It can be seen from Table 2 that the network performances are similar to the network performances when the number of links in Figure 6 is 20. They show the universality of our algorithm. In addition, the algorithm in this paper has always been better than the comparison algorithm in average power and average reward. The average values are close to the values when the number of links is 20. It is proven that the comparison of simulation results under the same topology is valid. We also compared the performance of the algorithms when 30~100 nodes (i.e., 8 cases), were randomly distributed in the farmland, as shown in Figure 7.

As the number of nodes increases, the value of the transmitted power increases. However, since the number of channels has not changed, the more nodes there are, the more the interference between nodes will increase. The SINR shows a downward trend. In contrast, our algorithm has better performance in average power, power variance, and average reward in different scale networks. When the number of sensor nodes is different, the algorithm in this paper has an obvious effect on network optimization and obtains stable solutions.

Keep the parameters unchanged with 20 links, and let the number of channels be 1, 2, 3, 4, and 5. Through simulation, the average power, power variance, SINR, and network reward are obtained under the different numbers of channels. The curve is shown in Figure 8. It can be seen from Figure 8d that when the number of channels increases, the transmitted power of this algorithm increases slightly. This is because the interference between nodes is reduced and the network SINR is easier to improve. A small increase in transmitted power can significantly improve the first part of the reward function. As the number of channels changes, our algorithm can get better and better SINR. The three algorithms have different effects on the improvement of SINR. Although the network SINR is smaller than the other two algorithms, the transmitted power of our algorithm is kept in a small range, around 0.0495 W. Compared with the RL and HPSO algorithm, it can reduce network energy consumption. Therefore, with the links used in this paper, the algorithm can still reduce network energy consumption while ensuring a certain communication quality.

With different numbers of links and channels, compared with the three other algorithms, this algorithm has obvious advantages in network reward. It can be seen from Figure 8 and Figure 9 that the algorithm integrates influencing factors such as channel allocation and power control, improves SINR and reduces transmission energy. In this paper, we used the residual network energy and interference to get the network profit, so as to realize the adjustment of the node transmitted power, and its power variance and energy consumption balance are better than that of the other two algorithms.

Figure 9 shows the channel allocation results of the three algorithms when the number of links is 20 and the number of channels is 5. The number next to the link in the figure represents the link ID. According to the legend, the channel selected for each link can be seen. Due to the many variables involved (including the transmitted power and transmission channels of multiple nodes), the algorithm search space is large. The HPSO algorithm is easy to fall into the local optimum, and the allocated channels are uneven, where channel 2 is not fully utilized. The RL algorithm can achieve a more uniform channel allocation, but it cannot optimize the power variables continuously, resulting in higher average transmitted power and energy consumption. The algorithm in this paper can make each link work in different idle channels evenly based on keeping a transmitted power small and realizing the joint optimization of transmitted power and channel allocation.

In the environment of an Intel^®Core™ i7-9700U [email protected] GHz with 16 GB RAM, this paper adopted matlab for algorithm simulation. For each network, our algorithm required model training. The training time was about 132 min. Once the agent training converged, the model parameters ware determined, and the approximate optimal solution was calculated in one step by matrix multiplication. The decision complexity was about O (|U|²), where |U| represents the number of network links. The test time was only 4.15 s. The running time of HPSO and RL were 161 s and 107 s, respectively. The trained model can adapt to farmland WSN requirements, but once the network layout or the number of nodes changes, the model needs to be retrained. Therefore, enhancing the scalability of the algorithm is a goal of further research.

5. Conclusions

The energy consumption of an agricultural WSN is affected by data transmission. By analyzing the influence of transmitted power and channel on communication quality and residual energy, a joint optimization method for power control and channel allocation was studied to balance network energy efficiency. We designed an optimization algorithm based on a hybrid DDPG. To make the link with less energy get more appropriate transmitted power, we improved the action noise with the network’s residual energy. In addition, the transmitted power was a continuous variable and the selected channel was an integer variable. This paper makes the DDPG algorithm model applicable to mixed variables by standardizing the action space. The simulation results demonstrate that the studied algorithm could effectively reduce network power, increase network SINR, and improve network energy balance.

There are contradictions between different WSN performance indicators, and it is difficult to adapt them to a dynamic agricultural sensing network through artificial weighting. How to adaptively adjust the weight of different network performances, based on network quality, to continue to optimize the utilization of agricultural WSN resources is a topic for our further work. Considering the long training time of the DDPG algorithm, optimizing the joint control algorithm to improve the algorithm scalability is also our research goal.

Author Contributions

Conceptualization, X.H. and H.W.; methodology, X.H. and H.W.; validation, H.Z. and C.C.; formal analysis, C.C.; investigation, H.Z.; writing—original draft, X.H., H.W. and H.Z.; funding acquisition, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61871041, the Beijing Municipal Science and Technology Project, grant number Z191100004019007 and the China Agriculture Research System of MOF and MARA, grant number CARS-23-C06.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The author would like to thank anonymous commentators and editors for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cheng, L.J.; Zhang, Y.B. Analysis of intelligent agricultural system and control mode based on fuzzy control and sensor network. J. Intell. Fuzzy Syst. 2019, 37, 6325–6336. [Google Scholar]
Somov, A.; Shadrin, D.; Fastovets, I.; Nikitin, A.; Matveev, S.; OSeledets, I.; Hrinchuk, O. Pervasive Agriculture: IoT-Enabled Greenhouse for Plant Growth Control. IEEE Pervasive Comput. 2018, 17, 65–75. [Google Scholar] [CrossRef]
Musat, G.-A.; Colezea, M.; Pop, F.; Negru, C.; Mocanu, M.; Esposito, C.; Castiglione, A. Advanced services for efficient management of smart farms. J. Parallel Distrib. Comput. 2018, 116, 3–17. [Google Scholar] [CrossRef]
Gong, C.; Guo, C.; Xu, H.T.; Zhou, C.C.; Yuan, X.T. A Joint Optimization Strategy of Coverage Planning and Energy Scheduling for Wireless Rechargeable Sensor Networks. Processes 2020, 8, 1324. [Google Scholar] [CrossRef]
Jiang, D.D.; Wang, Y.T.; Han, Y.; Lv, H.B. Maximum connectivity-based channel allocation algorithm in cognitive wireless networks for medical applications. Neurocomputing 2017, 220, 41–51. [Google Scholar] [CrossRef]
Yang, G.L.; Li, B.; Tan, X.Z.; Wang, X. Adaptive power control algorithm in cognitive radio based on game theory. IET Commun. 2015, 9, 1807–1811. [Google Scholar] [CrossRef] [Green Version]
Mehmood, K.; Niaz, M.T.; Kim, H.S. A Power Control Mean Field Game Framework for Battery Lifetime Enhancement of Coexisting Machine-Type Communications. Energies 2019, 12, 3819. [Google Scholar] [CrossRef] [Green Version]
Gui, J.S.; Hui, L.H.; Xiong, N.X. A Game-Based Localized Multi-Objective Topology Control Scheme in Heterogeneous Wireless Networks. IEEE Access 2017, 5, 2396–2416. [Google Scholar] [CrossRef]
Du, Y.W.; Gong, J.H.; Wang, Z.M.; Xu, N. A Distributed Energy-Balanced Topology Control Algorithm Based on a Noncooperative Game for Wireless Sensor Networks. Sensors 2018, 18, 4454. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Najeh, S.; Bouallegue, A. Game theory for SINR-based power control in device-to-device communications. Phys. Commun. 2019, 34, 135–143. [Google Scholar] [CrossRef]
Li, Y.Z.; Mehr, A.S.; Chen, T.W. Multi-sensor transmission power control for remote estimation through a SINR-based communication channel. Automatica 2019, 101, 78–86. [Google Scholar] [CrossRef]
Hao, X.C.; Liu, J.S.; Xie, L.X.; Chen, B.; Yao, N. Power control and channel allocation optimization game algorithm with low energy consumption for wireless sensor network. Chin. Phys. B 2018, 27, 080102. [Google Scholar] [CrossRef]
Shokrnezhad, M.; Khorsandi, S. Joint power control and channel assignment in uplink IoT Networks: A non-cooperative game and action based approach. Comput. Commun. 2018, 118, 1–13. [Google Scholar] [CrossRef]
Shokrnezhad, M.; Khorsandi, S. A decentralized channel assignment and power control approach using noncooperative game and Vickrey auction in relay-aided MC-CDMA IoT networks. Trans. Emerg. Telecommun. Technol. 2019, 30, e3543. [Google Scholar] [CrossRef] [Green Version]
Xu, J.; Guo, C.C.; Zhang, H. Joint channel allocation and power. control based on PSO for cellular networks with D2D communications. Comput. Netw. 2018, 133, 104–119. [Google Scholar] [CrossRef]
Zhou, L.; Wu, Y.C.; Yu, H.F. A Two-Layer, Energy-Efficient Approach for Joint Power Control and Uplink-Downlink Channel Allocation in D2D Communication. Sensors 2020, 20, 3285. [Google Scholar] [CrossRef]
Chen, H.X.; Xia, X.L.; Shuo, L.J.; Bai, C.; Ning, Y.; Yuan, W.L. Topology Control Algorithm and Channel Allocation Algorithm Based on Load Balancing in Wireless Sensor Network. Ad-Hoc Sens. Wirel. Netw. 2018, 40, 191–216. [Google Scholar]
Hao, X.C.; Liu, J.S.; Yao, N.; Xie, L.X.; Wang, L.Y. Research of Network Capacity and Transmission Energy Consumption in WSNs Based on Game Theory. J. Electron. Inf. Technol. 2018, 40, 1715–1722. [Google Scholar]
Chen, J.; Yu, Q.; Cheng, P.; Sun, Y.; Fan, Y.; Shen, X. Game Theoretical Approach for Channel Allocation in Wireless Sensor and Actuator Networks. IEEE Trans. Autom. Control 2011, 56, 2332–2344. [Google Scholar] [CrossRef] [Green Version]
Zhao, G.F.; Li, Y.; Xu, C.; Han, Z.Z.; Xing, Y.; Yu, S. Joint Power Control and Channel Allocation for Interference Mitigation Based on Reinforcement Learning. IEEE Access 2019, 7, 177254–177265. [Google Scholar] [CrossRef]
Asuhaimi, F.A.; Bu, S.R.; Klaine, P.V.; Imran, M.A. Channel Access and Power Control for Energy-Efficient Delay-Aware Heterogeneous Cellular Networks for Smart Grid Communications Using Deep Reinforcement Learning. IEEE Access 2019, 7, 133474–133484. [Google Scholar] [CrossRef]
Ding, H.; Zhao, F.; Tian, J.; Li, D.Y.; Zhang, H.X. A deep reinforcement learning for user association and power control in heterogeneous networks. Ad Hoc Netw. 2020, 102, 102069. [Google Scholar] [CrossRef]
Wang, Y.H.; Ye, Z.F.; Wan, P.; Zhao, J.J. A survey of dynamic spectrum allocation based on reinforcement learning algorithms in cognitive radio networks. Artif. Intell. Rev. 2019, 51, 493–506. [Google Scholar] [CrossRef]
Xia, W.; Qi, Z. Power Control for Cognitive Radio Base on Game Theory. In Proceedings of the 2007 International Conference on Wireless Communications, Networking and Mobile Computing, Shanghai, China, 21–25 September 2007. [Google Scholar]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Hu, B.; Yang, J.; Li, J.X.; Li, S.; Bai, H.T. Intelligent Control Strategy for Transient Response of a Variable Geometry Turbocharger System Based on Deep Reinforcement Learning. Processes 2019, 7, 601. [Google Scholar] [CrossRef] [Green Version]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Konda, V.R.; Tsitsiklis, J.N. On actor-critic algorithms. SIAM J. Control Optim. 2003, 42, 1143–1166. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of power control and channel allocation in an agricultural WSN.

Figure 2. Flowchart diagram of the hybrid DDPG.

Figure 3. The agricultural WSN topology.

Figure 4. Actor–critic network structure.

Figure 5. Algorithm convergence with different learning rates.

Figure 6. Network performance with different numbers of links.

Figure 7. Network performance with different node distributions.

Figure 8. Network performance with different numbers of channels.

Figure 9. Network channel allocation.

Table 1. Sensor node parameters.

Parameters	Value
Spreading gain G	500
Maximum transmitted power	1 (W)
Path gain	1/(1 + d)⁴
SINR threshold	5 (dB)
Maximum node energy	50 (J)
Episodes	90
Experience pool size	1 × 10⁶
Discount factor	0.99

Table 2. Comparison of network performances under different random distributions.

	Average Power (W)	Power Variance	Average SINR (dB)	SINR Variance	Average Reward	Reward Variance
Our algorithm	0.045	3.57 × 10⁻⁵	20.60	0.90	68.00	47.00
HPSO	0.55	0.04	20.90	1.59	20.91	13.47
RL	0.16	0.005	18.74	16.73	19.07	40.67

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, X.; Wu, H.; Zhu, H.; Chen, C. Joint Power and Channel Optimization of Agricultural Wireless Sensor Networks Based on Hybrid Deep Reinforcement Learning. Processes 2021, 9, 1919. https://doi.org/10.3390/pr9111919

AMA Style

Han X, Wu H, Zhu H, Chen C. Joint Power and Channel Optimization of Agricultural Wireless Sensor Networks Based on Hybrid Deep Reinforcement Learning. Processes. 2021; 9(11):1919. https://doi.org/10.3390/pr9111919

Chicago/Turabian Style

Han, Xiao, Huarui Wu, Huaji Zhu, and Cheng Chen. 2021. "Joint Power and Channel Optimization of Agricultural Wireless Sensor Networks Based on Hybrid Deep Reinforcement Learning" Processes 9, no. 11: 1919. https://doi.org/10.3390/pr9111919

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Joint Power and Channel Optimization of Agricultural Wireless Sensor Networks Based on Hybrid Deep Reinforcement Learning

Abstract

1. Introduction

2. Agricultural WSN Model

3. A Joint Optimization Strategy for Power Control and Channel Allocation

3.1. Reinforcement Learning Function

3.2. Power control and Channel Allocation Based on DDPG

The Improved DDPG with Mixed Variables

4. Experiment and Analysis

4.1. Influence of Model Parameters on Network Performance

4.2. Network Performance Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI