3.3.2. Deep-Reinforcement Learning

In recent years, much attention has been given to deep-reinforcement learning (DRL) in task offloading. Reinforcement learning, RL, is a branch of artificial intelligence in which an agent interacts with the environment and learns using two functions: reward and punishment. Punishment is a negative reward. In RL, the learning cycle is not based on a training dataset; instead, the agent interacts with the environments with no prior knowledge and obtains immediate feedback based on its performance. The environment is modeled as a Markovian Decision Process (MDP). In RL an experience is defined as the triple (*s<sup>t</sup>* , *a<sup>t</sup>* ,*rt*), where *s<sup>t</sup>* , *a<sup>t</sup>* , and *r<sup>t</sup>* are, respectively the state, action and reward at the time

*t*. The agent determines the action based on a policy, *π*(*s*). Q-learning algorithm is an offline policy that estimates *π*(*s*) with guaranteed convergence. The mapping between the policy and the state at a given time *t* is given by (3) [45]

$$Q(s\_{l\prime}, a\_l) = Q(s\_{l\prime}, a\_l) + \mathfrak{a}(r\_l + \gamma \max\_a Q(s\_{l+1\prime}, a\_l) - Q(s\_l, a\_l)),\tag{3}$$

where *α* is the learning rate, and *γ* is the discount rate. In RL, the agent tends to maximize the rewards. This concept is illustrated in Figure 5.

**Figure 5.** Principles of DRL.

Offloading algorithms-based Q-Learning has been devised in many published reports such [11,43,44,46–48].

In [46], the authors devised a dynamic computation-offloading strategy for an MEC system using Markov decision process theory. The authors considered IoT devices with energy-harvesting techniques. The optimal offloading is achieved using a low-complex after-state learning method.

The problem of task offloading in the context of MEC has been formulated in [47] as a constrained Markov decision process (CMDP). The authors applied Lagrangian primaldual optimization and devised a deep-reinforcement learning algorithm to solve the relaxed CMDP.

Dynamic computational offloading for MEC systems with EH-enabled IoT devices considering multiple offloading servers has been studied and solved in [48]. The authors elaborated an offloading algorithm using deep Q-learning techniques.

Hardware implementation of the Q-learning algorithm received scant attention. Most of the reported implementation focuses on designing an accelerator using FPGA technology. The lowest power consumption has been reported to be 37 mW for a Q-matrix of dimensions eight states and four policies [49].

The work of [41] considered task offloading with energy harvesting for an IoT MEC system. The offloading problem has been formulated as a decentralized partially observable Markov decision process. They further reduced the computational complexity by searching for an approximated solution using an RL decentralized offloading algorithm. The results, obtained using Matlab simulations, showed that the proposed reduces both average delay and average energy consumption.

To this end, it is crucial to identify the power consumption during different activities of the end devices, in particular data processing, data transmission and communication. In the following section, specific design considerations for task offloading are presented, which may influence the power consumption and the latency of the software and hardware components of the IoMT systems. The main focus is attributed to the choice of the communication protocols for IoT devices with consideration of the energy consumption.
