**2. Related Work**

IoT is one of the most promising technologies of the current era and interacts with sensors for observing the physical world [19–21]. These technologies have expanded into the real-time environment and support the applications to govern their operations. Recently, many solutions have presented to optimize the transmissions and increase the accuracy of online data retrieval systems. The authors of [22] determined the required resources of energy at the BS for IoT-enabled systems. Numerous agricultural sensors have utilized in precision agriculture for continually monitoring the field and communicating with the smart nodes. They presented a unique product density model for estimating the energy requirements for BS. Additionally, a method for Improved Duty Cycling was provided that makes use of the residual energy parameter. The proposed routing protocol [23] employs a region-based static clustering technique to efficiently cover the agricultural area while utilizing threshold-sensitive hybrid routing to send sensed data to the base station. In addition, the proposed protocol uses fuzzy logic to select the optimal cluster head (CH) among all sensor nodes in a given round, minimizing node energy usage during each

data transmission period. The suggested energy-efficient protocol is compared to establish benchmark protocols, such as energy-efficient heterogeneous clustering (EEHC), developed distributed energy-efficient clustering (DDEEC), and region-based hybrid routing (RBHR). The research and testing findings indicate that user-defined transmission thresholds substantially decrease the data transmission rate. Furthermore, the balanced employment of fuzzy logic, static clustering, and hybrid routing effectively reduces the energy consumption of sensor nodes throughout each data transmission round, therefore extending the network's total lifetime. In [24], the authors proposed PAwCOR to develop a distributed method for the selection of CH by using node energy, latency, and congestion characteristics. Energy saving is accomplished via the use of nodes that are selected depending on sensing inaccuracy. PAwCOR enables the application of periodic data with the least amount of delay possible via the use of various routing routes. Additionally, it fulfills the need for non-delay-tolerant applications by utilizing service differentiation to prioritize time-critical data transmission. By allocating at least one route for both essential and routine data transfers, the suggested method improves performance compared to current protocols. It improved the performance in terms of latency, average energy consumption, packet delivery ratio, and average residual energy to attain reliable transmission. The authors of [25] proposed CTEER, an energy-efficient routing protocol based on cluster trees, to address the fast energy loss experienced by ordinary nodes while using the conventional static routing tree method. This protocol is a rendezvous-based method with a low-delay characteristic. As a result, the protocol is well-suited for time-critical applications, such as network live broadcast systems, automated railway operation systems, ticketing software, and intelligent home systems. It creates a cross-routing tree in which the mobile sink serves as the central node. Clustering algorithms are used to group the ordinary nodes and aggregate the data packets based on the routing tree. The suggested approach outperforms RRP in terms of the network lifecycle, energy consumption, and data latency. The authors of [26] proposed a deep-reinforcement learning-based quality-of-service (QoS)-aware secure routing protocol (DQSP). It aims to ensure the QoS and extract knowledge from traffic history by cooperating with the observing environment. Moreover, the proposed protocol optimizes the policies of routing. It performs significant improvement under different network metrics and has proven high convergence and effectiveness. The authors of [27] presented QL-MAC based on Q-learning, which iteratively tweaks the MAC parameters through a trial-and-error process and attains energy-efficient communication. It offers minimization problems without predetermining the system model, and also provides a self-adaptive protocol in case of topological or any external events. It readjusts the duty cycle of nodes and explicitly minimizes the energy consumption. The large-scale simulation experiments demonstrate its efficacy over other schemes.

It was noticed that technologies of IoT and sensors are performing an extraordinary role in the development of smart communication. The sensors are widely used in different applications, including remote operations to observe the data and respond with a timely reaction [28–30]. However, they are bound in terms of resources and limit the online services for IoT networks. Moreover, transporting sensitive data from network devices towards the data centers is another important characteristic for any IoT-enabled system. It has also been seen that different solutions are discussed to improve the energy consumption and QoS parameters by using artificial intelligence and machine learning techniques for D2D communication; however, most of the reinforcement learning solutions lack the optimal consumption of resources, especially in the routing phase for mobile devices. In addition, they are not able to cope with the dynamic evaluation of routing links, and in such cases, sensors' data were frequently dropped. Moreover, it was also observed that a few solutions are still vulnerable to external attacks and not able to cope with data security under mobile nodes. Such solutions could not provide a robust mutual authentication system, and as a result, communication performance is non-collaborative and uncertain. Table 1 summarizes the discussion of the existing solutions.


**Table 1.** Summary of related discussion.

#### **3. Proposed Multi-Criteria Learning Algorithm Using Secured Sensors**

Sensors integrated with IoT objects are utilized in different domains to gather data and support the community using a smart communication system. IoT network provides the processes of data collection and assists the end-users in observing and optimizing the transmission based on environmental conditions. In this section, we present the details of the proposed algorithm and its working flow.

The proposed algorithm is comprised of two stages. In the first stage, D2D authentication is performed, and afterward, using the machine learning approach the optimal forwarding tables are established. The forwarding tables are updated based on the network conditions, which decreases the overheads in determining the optimal routes. The second stage provides the trustworthiness forwarding in terms of privacy and integrity from the observing field to network applications. In this stage, the proposed algorithm ensures the accuracy of the collected data and eliminates the number of attacks from unknown devices. Additionally, the proposed algorithm imposes the lowest computing cost and data diverting for ensuring security between mobile devices, with nominal communication delays. Figure 1 illustrates the development flow of the proposed algorithm.

The contributions of the proposed algorithm are as follows:


thentic and verifiable sessions between devices, gateways, and sink nodes with low-security costs.

**Figure 1.** Development flow of the proposed algorithm.

#### *3.1. D2D Authentication with Multi-Criteria Reinforcement Learning*

In the beginning, the devices build a table containing their neighbor information, which is saved in their memories. We consider that the devices are mobile, and they advertise their current address when they are away from their home network. In the table, each device maintains the neighbors' information, such as identity *id*, distance, *di* , residual energy, *ei* , and radio coverage limit, *CRi* , to next-level nodes. Moreover, as the devices are mobile, the proposed algorithm initiates the process of authentication using gateways, *wi* , by utilizing the session keys, *Ks* . All the nodes are required to distribute the tokens, *Tk*, at the beginning of data forwarding, which consists of identity, timestamp, and positioning coordinates. Additionally, the token is encrypted using the obtained session key, from device *x* to device *y*. The session keys are temporary for a specific authentication process, and when the positioning coordinates of the devices are changed, the generated keys are revoked. Afterward, device *x* has to obtain a new session key from the proximity gateway for communication with its other peer devices. Each device generates a request with its id to the nearest gateway for mutual communication with a peer device. Upon receiving this information, the gateway constructs a record inside its table and generates a symmetric key *sK* for the peer devices over the secured channel. Later, both devices perform an encryption function, *e*, to securely transmit the data packets *mi* as defined in Equations (1) and (2):

$$w\_i \to x: \ e\_{sK} \ (m\_i \ ) \ +d' \tag{1}$$

$$w\_i \to y : \text{ : } \mathfrak{e}\_{sK} \quad (m\_i \text{ : }) \quad + d' \tag{2}$$

where *d* shows the digital signatures. On the other hand, the devices first verify the validity of the encryption blocks using digital signature, and afterward, the peer nodes perform a decryption function to recover the data packets. In the proposed algorithm, each device

updates the information in the constructed table and makes an entry of the authorized device as well. In case any device is found faulty, then its entry is removed from the table by the source device.

Most of the solutions [31,32] utilize multiple parameters for data aggregation and route the data in the network system. The proposed algorithm uses the concept of multicriteria evaluation for data aggregation and optimizing the learning procedure in terms of constraint resources. The learning procedure also makes use of radio coverage, nodes' mobility, and link cost to attain an energy-efficient and stable end-to-end communication system. In the proposed algorithm, each node obtains the information of the neighbor and utilizes the reinforcement learning technique for optimizing the intelligence process with nominal resources' consumption. The source node initiates the process for the selection of the next-hop based on the highest rank. This route rank, *<sup>R</sup>*(*i*), denotes the most optimal neighbor, *i*, for decreasing the communication delay, energy consumption, and data disturbance, as defined in Equation (3):

$$R(i) = re\_i + \left(\frac{1}{s\_i}\right) + CR\_i + 1/l\cos t\_{i,j} \tag{3}$$

where *rei* is residual energy, *si* is speed, *CRi* is radio coverage, and *lcosti* denotes the link cost from node *i* to node *j*. *lcosti*,*<sup>j</sup>* is the integration of packet reception ratio, *PRR*, and average delay time, *avedtime* . To compute this, the source node distributes *n* number of probes' packets in a fixed time interval, *t*, and as a result, the neighboring node *j* determines the value of *lcosti*,*<sup>j</sup>* for node *i*, as defined in Equations (4) and (5):

$$dcost\_i = \left(PRR\_{(i,j)} + \frac{1}{ave\_{dttime}}\right) + 1/d\_{trr} \tag{4}$$

$$ave\_{dtime} = \frac{(p\_n - p\_i)}{\mathbf{t}} \tag{5}$$

where *pn* and *pi* denote the reception time for the first and last probe packets, *t* is the given time interval, and *derr* is the data error, used to measure the number of retransmissions.

The proposed algorithm utilizes reinforcement learning [33] for computing and selecting the routing states using network conditions and experiences. The reinforcement algorithm is comprised of agents, states, S, and a set of actions, A, per state. Using reinforcement learning, node *i* exploits the *R*(*i*) values and selects the next hop using energy, speed, radio coverage, and link cost metrics. On receiving the data, the next-hop performs the re-computation of the *R*(*i*) value and forwards the data through its selected routing states. This process is continued for each neighbor selection until network data are received at the sink node. Additionally, when device *i* needs to route the data at the time *t*0, it performs a set of actions and selects the neighbor node based on the computed route rank. The value of route rank is dynamically changed by evaluating the network and nodes' statistics. Later, the device *i* gains a reward, *Rwd*, and enters the next state, i.e., (*<sup>S</sup>*, *a*, *Rwd*). A node has only a single reward value at any time. If any node has no reward value at any moment, then it will not be allowed to participate in the routing phase. On entering into the next state, the device *i* updates its forwarding table by adding the value of the reward. Moreover, the preceding device retrieves the updated information of device *i*. This practice of reinforcement learning is exploited by the proposed algorithm for finding the most optimal routes for forwarding the IoT data towards the sink node. At the end of the learning period, the entries of forwarding tables are converged to a numeric value that indicates the optimal route from the source device to the sink node. Converged forwarding tables with computation of route rank not only decreases the unnecessary data diverting but also increases the packets reception ratio over the communication channels in the existence of malicious nodes. Figure 2 illustrates the flow of reinforcement learning by exploiting the computed route rank. It uses the multi-criteria of the nodes to determine its rank value and accordingly assign the reward. Based on the updated forwarding tables and

reward values the proposed algorithm offers convergence results and increases the route lifetime in terms of energy, speed, and link cost. The convergence levels depend on the number of iterations until end-to-end routes are established with the efficient distribution of constraint resources. Figure 3 shows the message flow for the selection of the next-hop between the source node and its neighbors. The source node floods the route request packet in its radio coverage and identifies the nearest neighbors. In a case when no reply has been received, then it resends the request packet. Once it has found the list of neighbors, then the process of data discovery is initiated, utilizing the node-level table to fetch the statistics. Based on the fetched data, the proposed algorithm computes the route rank using a multi-criteria process and the assigned reward value by exploiting reinforcement learning. Thus, selected nodes advertise their status for the connection in the routing phase, and sensors' data is forwarded to the sink node.

**Figure 2.** Route rank using reinforcement learning.

**Figure 3.** Next-hop selection procedure.

The format of the node-level table is presented in Table 2.

**Table 2.** Node level information.


#### *3.2. Secured Data Transmission Using a Secured Session-Oriented Scheme*

The proposed algorithm offers secure IoT-enabled smart data routing by utilizing the interaction of session keys between devices, gateways, and the sink node. This process is comprised of two levels. In the first level, the devices and gateways exchange their session keys and obtain the cipher information over the insecure channel. In the second level, the session keys are shared among the gateway and the sink node. Furthermore, session keys have an expiration time and are revoked after the completion of this time. However, we consider that the devices are mobile, so it might be a case that the device moves to another communication range, thus the session key is also revoked, and it sends a new request to the nearest gateway for providing the new session key and executes the authentication process. The session keys are encrypted using the public key. Let us consider that (*ksi*) *n* denotes the set of session keys. Then, data encryption, *E*, from the mobile network device *i* to the gateway *j* can be obtained as shown in Equation (6). Before this, device *i* to the gateway *j* performs an authentication function to validate the session key, as defined in Equation (6):

$$j \rightarrow j: E\left(ks\_{i\prime}\left[\left.m\_{i\prime}\right.\right.\right]\right) \tag{6}$$

where *ti* is a timestamp and *ni* is a nonce, also known as a random number. It is encrypted using the symmetric key of mobile device *i*. On receiving the encrypted session key, the gateway *j* includes its nonce, *nj*, along with the timestamp, *tj*, and sends back the confirmation message, as defined in Equation (7):

$$j \to i : E\left(k s\_{i\prime} \mid n\_{j\prime} \ t\_{j\rceil}\right) \tag{7}$$

Accordingly, both devices on the network authenticate themselves, and now the network messages, *mi*, can be ciphered using the encryption function, as provided in Equation (8):

$$i \to j \colon \text{xor } (m\_i k s\_i) \tag{8}$$

Finally, when data are received by gateways, they establish separate sessions with sink nodes using Equations (6) and (7). Afterward, the device data, *M*, are forwarded to the sink node, *sink*, including the digital signature, *MAC*, of the gateway with its private key, *Rj*, and ciphered data, *<sup>E</sup>*[*mi*, *ksi*], as shown in Equation (9):

$$M(j, \text{sink}) = MAC(R\_j, E[m\_i, ks\_i])\tag{9}$$

Figure 4a,b describes the flowcharts of the proposed algorithm. Initially, the network services and mobile devices gather the network data from the smart environment. Network keys are generated for D2D authentications, and after their verification, they can be a part of the routing. The proposed algorithm determines the value of route rank based on the multi-criteria and updates the nodes' tables. Afterward, it utilizes reinforcement learning to assign rewards for the nodes. These rewards significantly improve the training process for the devices to extract the optimal neighbors from the set of choices, and accordingly, offer energy-efficient, least error rate delivery paths. Moreover, the proposed algorithm also secures the sessions among the gateway and the sink node for data transfer. Both the gateway and the sink node established secure sessions for their direct communication and are valid for a fixed time interval. After the mutual authentication, the gateways interact with the sink node for forwarding the network data with nominal communication costs.

(**b**) Session generation with mutual forwarding among the gateway and the sink node 

**Figure 4.** Flowchart of the proposed algorithm.

Figure 5 shows the flow of messages between the gateway and sink node for the establishment of a secure session with encrypted data transfer. In the beginning, the gateway device transmits the route request packet along with its *id* towards the sink node. Upon successful verification, the sink node acknowledges it, and later the gateway device requests the session key. If the time expires, the gateway device resends the request for the session key. Once the sink node receives the request for the session key, it generates the key and sends it towards the gateway device in encrypted form. The gateway device decrypts it and sends an acknowledgment message to the sink node that it has received the session key. The sink node confirms the acknowledgment message and afterward, both devices use the same session key for data encryption and decryption.

**Figure 5.** Message flow between the gateway and the sink node.

Algorithm 1 explains the pseudocode for the proposed work. It has two main components: one for the authentication of mobile devices with the reinforcement learning technique to assign the rewards, and the other for session-oriented data encryption from mobile sensors towards the sink node. After the successful verification of mobile sensors, the proposed algorithm evaluates the route rank for the neighbors using multiple parameters, along with the link cost. Accordingly, the neighbor with the highest route rank is assigned a reward value and selected as a forwarder. Moreover, the proposed algorithm also established a secure session from mobile sensors towards the sink node using gateway services. In this case, only that node is allowed to send the route request to the sink node that has a valid session key. The secure session key is utilized by both the mobile sensor and the sink node for data encryption and decryption, respectively.


## **4. Simulation Setup**

This section presents the simulation configuration to evaluate the performance of the proposed algorithm. We experimented with the proposed algorithm, CTEER [25], and QL-MAC [27] solutions in terms of energy consumption, packet delivery ratio, packet disturbance, data latency, and computational complexity. The experiments were performed under varying rounds and the varying number of nodes using NS-3. Initially, nodes have homogeneous energy levels of 5 joules. The transmission range was set to 10 m. We deployed varying sensor nodes in the field of 300 × 300 m with a static sink. Sensor nodes are mobile with an installed GPS. Additionally, we assumed the number of malicious nodes to be 10. The data traffic between connected devices is a type of Constant Bit Rate (CBR). We assumed the energy model as discussed in [34,35]. Equations (10) and (11) define the energy consumption by exploiting the transmitted and received data bits:

$$E\_{tx}(k,d) = \begin{cases} \begin{array}{c} E\_{c1ct} \ast k + k \ast E\_{fs} \ast d^2 \text{ if } d < d\_0\\ E\_{c1ct} \ast k + k \ast E\_{amp} \ast d^4 \text{ if } d \ge d\_0 \end{array} \end{cases} \tag{10}$$

$$E\_{rx}(k) = E\_{left} \* k \tag{11}$$

where *Etx* and *Erx* are the transmitting and receiving energy, *k* is data bits, *d* is the distance among sensor nodes, *Eelect* is the amount of consumed energy per data bit, and the energy of the transmitting amplifier is denoted by *Ef s*. Table 3 illustrates the parameters for simulation configuration.

**Table 3.** Simulation configuration.

