1. Introduction
With the gradual increase in the demand for the development of marine resources and the rising frequency of various maritime activities, real-time sensing and monitoring of the marine environment, as well as the efficient communication of maritime equipment, have become crucial [
1,
2]. The characteristics of the Internet of Things (IoT), such as the comprehensive sensing, reliable transmission, and intelligent processing, are highly suitable for the requirements of marine environment monitoring and maritime communication. Consequently, the Ocean Mobile Internet of Things (OM-IoT) has gradually piqued the interest of researchers. The traditional OM-IoT mainly refers to Underwater Wireless Sensor Networks (UWSNs) composed of various sensor nodes in the target sea area [
3]. On the other hand, the generalized OM-IoT refers to a network that extends beyond traditional UWSNs, encompassing multiple areas and spaces. This network is established using new-generation information technologies like cloud computing, big data, and mobile Internet, and it is constructed across geographical areas, airspace, and sea areas [
4,
5].
A typical OM-IoT system is illustrated in
Figure 1. It comprises a wide range of sensor nodes, including ships, unmanned submersibles, and traditional underwater sensor nodes. The system integrates different types of IoT systems, such as shore-based networks, satellite networks, and UAV-assisted relays. The underwater segment of the OM-IoT network utilizes hydroacoustic communication as the primary mode of communication. However, due to the intricate and fluctuating oceanic environment, the hydroacoustic channel conditions are quite harsh. Furthermore, the movement of nodes, current movement, noise interference and signal collisions may result in the corruption of packets. Additionally, the transmission of packets along an incorrect path may also result in packet loss. Consequently, the issue of unreliable communication links underwater represents a significant challenge to underwater communications. Enhancing the reliability of underwater data transmission has been a prominent topic [
6,
7,
8]. In addition, there are a large number of mobile nodes in the OM-IoT that are influenced by environmental factors like the ocean currents and the technical characteristics of devices such as unmanned underwater vehicles (UUVs). These factors lead to unstable communication link quality between nodes, resulting in more serious issues such as packet corruption and loss. Therefore, the reliability of OM-IoT data transmission faces a significant challenge.
Researchers have conducted numerous studies to address this issue. Packet retransmission [
9] is one of the more representative methods for improving the reliability of underwater data transmission. When the receiver detects a missing packet, it will send a retransmission request to the transmitter. The transmitter will retransmit the missing packet based on the retransmission request until the missing packet is acknowledged as having been received. In a communication link with poor quality, the packet loss rate is high, leading to frequent data retransmissions. This situation results in the decreased data transmission efficiency of the system. In addition, under channel conditions with more severe noise interference, data packets are often transmitted incorrectly, making it challenging to achieve error correction through the typical retransmission mechanism. In order to ensure the reliability of data transmission, it must be combined with other error correction methods, which also inevitably introduce additional time costs and energy overheads. To address the shortcomings of the general retransmission mechanism, researchers propose another method to enhance the reliability of data transmission, namely redundant transmission [
10]. Redundant transmission can effectively reduce the interference of packet loss and erroneous transmission by transmitting the same packet multiple times. However, it inevitably increases the number of packets to be transmitted. In more complex network scenarios, general redundant transmission can easily cause network congestion and many other problems, reduce the efficiency of the system’s data transmission, and even affect the network lifetime.
Network coding (NC) [
11] has been widely studied and applied in underwater data transmission in recent years as a method to effectively improve the network throughput. Therefore, it has been more widely used in underwater data transmission. However, NC also faces challenges in scenarios with unreliable communication links. We take
Figure 2 as an example to illustrate such problems. As shown in
Figure 2a, node
S sends packets to node
D. In each time slot, node
S sends a coded packet. During the time slots
t1~
t3, the communication link between
S and
D is stable, and the coded packets can be transmitted sequentially. After receiving the coded packet, node
D can decode it based on previously transmitted data to generate a new packet. When the channel changes and the link quality deteriorates at the beginning of time slot
t4, the packet transmission fails. At this point, node
D is unable to decode the coded packet from the previous time slot, which subsequently affects the decoding process. In the general transmission process, a large number of packets need to be transmitted. It is often necessary to encode and transmit these packets in batches. However, transmitting random batches can lead to issues such as the excessive or low complexity of the encoding and decoding operations, coding redundancy, or missing packets. As illustrated in
Figure 2b, when the time slot
t3 commences, node
S transmits a coded packet to node
D. In the subsequent time slots, due to inappropriate coding combinations or an abnormal number of zero elements in the coding coefficients matrix, some coded packets may become undecodable, rendering the corresponding packets inaccessible to node
D. Moreover, various network coding algorithms primarily rely on random linear network coding (RLNC) [
12]. Since the coding coefficients of each packet in RLNC are randomly chosen from the random field
GF(2
q), the improper construction of the coding coefficient matrix can also negatively impact data transmission. Therefore, optimizing the network coding to achieve the adaptive selection of coding packets and coding coefficients is of great importance for enhancing the performance of underwater data transmission.
In this paper, we propose a data transmission method that integrates reinforcement learning and network coding for the data transmission problem under unreliable communication links in the OM-IoT and adaptive optimization for underwater network coding. The main contributions of this paper are as follows:
Establish a comprehensive Binary Erasure Channel (BEC) model for unreliable communication links affected by multiple factors by simulating the packet loss issue resulting from various causes using the channel erasure probability. Additionally, a method will be proposed to estimate the channel erasure probability, channel capacity, and other relevant metrics for the developed BEC model.
Address the issue of batch packet encoding by applying a dynamic adjustment method for the sliding coding window based on the channel conditions and decoding states. Additionally, we introduce a sliding rule adaptive optimization method based on the Q-learning algorithm. The method achieves packet batching for transmission and updates the packets for encoding in real-time by adjusting the sliding window size and sliding rules.
Address the issue of excessively high randomness in RLNC coding coefficients by utilizing a Deep Q-Network (DQN)-based adaptive optimization method for coding coefficients. The method adaptively selects the coding coefficients based on the current packets in the window and historical coding information. This approach restricts the complexity of the coding and decoding operations, thereby enhancing the probability of decoding.
Enhance the greedy strategy in the reinforcement-learning algorithm by introducing a time-varying exploration probability to improve the algorithm’s operational efficiency. Additionally, a sampling period optimization method based on the simulated annealing algorithm is proposed to improve the accuracy and timeliness of channel estimation.
The rest of this paper is organized as follows.
Section 2 introduces the related work and reviews the research results in recent years.
Section 3 describes the system model used in this paper, analyzes the underwater unreliable communication link problem, and establishes the BEC model.
Section 4 explains the principles and details of the algorithms proposed in this paper, while
Section 5 describes the simulation experiments and analyzes the results.
Section 6 summarizes the conclusions of the research in this paper and anticipates future research endeavors.
2. Related Work
Researchers have conducted numerous studies on the application of network coding in UWSNs. Cai et al. [
13] proposed a reliable data transmission protocol for UWSNs based on twin paths and network coding. They established twin paths and transmitted shareable redundant packets to enhance the reliability of data transmission. Feng et al. [
14] introduced an asynchronous duty cycle and network coding MAC protocol for UWSNs. This protocol is based on an asynchronous duty cycle to determine the rendezvous time of the exchanged data. It also suggests a coding node selection strategy and network-coding algorithm to enable the coding and forwarding of packets. Kulhandjian et al. [
15] presented a CDMA-based simulated network coding method for UWSNs. This method tackles the issue of mutual interference of different packets at the relay nodes in unidirectional multihop networks. It incorporates interference cancellation based on a priori information. In two-way relay networks, the superposition property of hydroacoustic signals is utilized. To treat the received interference packets as naturally coded packets and forward them, Hao et al. [
16] proposed a partial network coding-based geographic routing protocol for UWSNs. This protocol employs partial network coding to encode the packets and, based on the positional information of the sensor nodes, adopts a greedy strategy to forward the encoded packets to reduce the network delay and decrease the transmission energy consumption. Wang et al. [
17] proposed an energy-efficient data transmission protocol based on network coding, hybrid auto-repeat request, and adaptive window size estimation algorithms to ensure the reliability and efficiency and optimize the trade-off between throughput and energy consumption for data transmission in UWSNs. Additionally, Wang et al. [
18] proposed a network coding-based cross-layer routing protocol for UWSNs that takes advantage of multicast transmission to jointly decode coded packets received from multiple potential nodes throughout the network and optimize the transmission power. Zhan et al. [
19] proposed a joint scheduling strategy. A method for network coding and transmission in UWSNs is proposed to address coding and transmission conflicts. The solution involves a heuristic approach to resolve conflicts in a conflict-free graph, searching for the maximum independent set to minimize the transmission time slots. Zhao et al. [
20] introduced a network coding-aware opportunistic routing protocol and a sliding-window coding algorithm to enhance the data transmission robustness and reduce the decoding overheads in UWSNs. Su et al. [
21] suggested a hybrid coding-aware routing protocol for UWSNs, incorporating inter-flow network coding and a combination of aware and opportunistic routing. They also presented an encoding method that does not depend on opportunistic listening to leverage network encoding opportunities and optimize transmission overheads.
The rise of reinforcement learning (RL) [
22] has brought more possibilities to improve the performance of data transmission in UWSNs. Park et al. [
23] proposed a reinforcement learning-based medium access control protocol for UWSNs to solve the underwater time synchronization problem through asynchronous operation, to improve channel utilization by reducing the number of time slots per frame, and to achieve collision-free scheduling by employing a new random backoff scheme. Chang et al. [
24] proposed a reinforcement learning-based data-forwarding scheme for passive and movable UWSNs to enhance the data transmission performance of UWSNs. Di et al. [
25] proposed a multipath adaptive routing scheme for UWSNs based on channel-aware reinforcement learning by means of a distributed reinforcement-learning framework based on the different underwater channel conditions, adaptively switching between single-path and multipath routing modes to achieve the joint optimization of the routing energy consumption and packet delivery rate. Zhang et al. [
26] proposed a reinforcement learning-based opportunistic routing protocol for UWSNs that selects suitable nodes by comprehensively considering the nodes’ peripheral states. It introduces a recovery mechanism to minimize the impact of routing voids on the data transmission performance. Zhang et al. [
27] introduced a reinforcement learning-based relay selection algorithm for UWSNs, combining RL with a simulated annealing algorithm to enhance the algorithm’s performance. Ye et al. [
28] suggested a deep reinforcement learning-based medium access control protocol for underwater acoustic networks. This protocol maximizes the performance of underwater acoustic networks by effectively utilizing time slots due to propagation delays or unused by other nodes. The available time slots should be utilized to maximize the network throughput. In addition, researchers have launched more studies on the application of reinforcement learning (RL) in network coding (NC) optimization. Jadoon et al. [
29] proposed a relay selection algorithm based on Q-learning for cooperative networks employing spatio-temporal network coding. The proposed algorithm maximizes the total capacity of the network by learning the cooperative network environment. Gao et al. [
30] introduced an RL framework to enhance the network capacity through decoder feedback by dynamically adjusting the network coding parameters online to improve the network coding performance for multihop transmission under dynamic sparse network coding. Xiao et al. [
31] suggested a reinforcement learning-based network coding for UAV-assisted secure wireless communication to select network coding strategies based on the measured interference power, previous transmission performance, and channel loading. Their approach aims to enhance the interception probability, latency, outage probability, and energy consumption to improve the anti-eavesdropping performance. Ali et al. [
32] proposed a reinforcement learning-based selective random linear network coding (RLNC) framework for the haptic Internet, which utilizes network and receiver feedback to optimally choose between block-based RLNC and sliding window-based RLNC to enhance the system’s data transmission performance.
The aforementioned research is of great significance in improving the performance of OM-IoT data transmission and advancing the utilization of network coding in underwater networks. The introduction of network coding in the OM-IoT can effectively improve network throughput and enhance data transmission reliability and communication efficiency. The underwater communication system, with the integration of RL, exhibits better adaptability to the complex marine environment and can enhance the reliability of underwater data transmission systems in complex environments. The optimization of network coding based on RL enables the system to adaptively adjust the coding strategy according to its environment, making data processing and transmission within or across systems more intelligent. However, the majority of the aforementioned studies are more reliant on the actual feedback from the receiver side, particularly for the optimization of coding coefficients. In the unreliable communication links, the absence of accurate feedback from the receiver to the sender can result in a reduction in the adaptability of the network coding coefficients to the channel conditions. This can result in a deterioration in the system’s data transmission performance, which can in turn lead to a reduction in the adaptability of the network coding coefficients to the channel conditions. Concurrently, the prevailing solution to the issue of erroneous transmission and packet loss resulting from unstable link quality is data retransmission or redundant transmission. However, this approach is susceptible to inducing further complications, such as network congestion. Consequently, it is also essential to pursue a further equilibrium between the efficacy and dependability of data transmission. Moreover, there is a paucity of research investigating the integration of reinforcement learning and network coding techniques for the transmission of OM-IoT data.
In order to address the aforementioned issues, this paper proposes a data transmission method for unreliable communication links in the OM-IoT. This method integrates reinforcement learning and network coding, and it is referred to as reinforcement learning-based adaptive network coding (RL-ANC). Firstly, the channel conditions are estimated based on the reception acknowledgment, the channel changes are tracked in real time, and a feedback-independent decoding state estimation method is proposed. Secondly, the sliding coding window is dynamically adjusted in accordance with the estimates of the probability of erasure and the probability of successful decoding. Subsequently, the sliding rule is adaptively determined using a reinforcement learning algorithm and an enhanced greedy strategy. An adaptive optimization method for coding coefficients based on reinforcement learning is proposed to enhance the reliability of underwater data transmission and underwater network coding while reducing the redundancy in coding. Finally, the sampling period and time slot table are updated using the enhanced simulated annealing algorithm to optimize the accuracy and timeliness of the channel estimation in real time. This optimization considers the convergence of the variance of the estimated channel erasure probability and the decoding probability of coded packets.
3. Theory Preparations
3.1. Ocean Mobile Internet of Things Model
In the OM-IoT, the nodes of the underwater network are primarily classified into two types: aggregation nodes, which are floating on the sea surface and facilitate communication between the underwater network and UAV relays, shore-based networks, and satellite networks, among others; and underwater sensor nodes, which consist of general ocean monitoring sensors and unmanned submarine vehicles that collect and transmit ocean monitoring data. In order to emphasize the principle and performance of the proposed algorithm, it is possible to disregard the inherent characteristics of the nodes that are not relevant to the algorithm’s principle and do not significantly impact its performance. Therefore, in this paper, the underwater segment of the OM-IoT network depicted in
Figure 1 is simplified as a three-dimensional stochastic network model, as illustrated in
Figure 3. Assume that a set of OM-IoT nodes is randomly deployed in a finite 3D sea area. These nodes form a total node set N, with the total number of nodes
n(
N). The locations of nodes
Ni ∈
N (
i = 1, 2, …,
n(
N)) are described by the 3D coordinates (
xi,
yi,
zi), where
xi > 0,
yi > 0, and
zi < 0. The depth of the nodes
dep(
Ni) = |
zi|. During any data transmission, the set of source nodes is denoted as
S, the set of destination nodes as
D, and the set of intermediate nodes as
R. Therefore, (
S ∪
R ∪
D) ⊂
N.
As the node mobility and communicable range have a significant impact on the performance of the algorithm, we introduce the movable node model [
33], as shown in
Figure 4. The velocity of a node
Ni ∈
N (
i = 1, 2, …,
n(
N)) is given by
vi = [
vxi,
vyi,
vzi], where
vxi,
vyi,
vzi are the velocity components of
Ni in the
x,
y, and
z directions, respectively. The communication radius of node
Ni is denoted as
φ(
Ni), and the necessary condition for nodes
Ni and
Nj to be able to transmit data is that
d(
Ni,
Nj) ≤ min{
φ(
Ni),
φ(
Nj)}, where
d(
Ni,
Nj) is the Euclidean distance between
Ni and
Nj.
3.2. Underwater Data Transmission Mechanisms
We are primarily concerned with the multihop transmission process of data from an underwater source node to a surface sink node. In this paper, we primarily focus on the scenario in which there is a single sink node. The underwater data transmission mechanism based on the node depth information [
34] is employed as the fundamental transmission model.
Figure 5 serves as an illustrative example of this process.
In the underwater network depicted in
Figure 5, each sensor node is assigned a unique ID, and node
N3 will transmit data packets to the sink node. Once transmission has commenced, node
N3 transmits the packet to
N6, which has a greater change in depth according to the depth priority principle. At this point,
N6 selects the next hop node, and since
N7,
N8, and
N9 have the same depth, the forwarding probability must be determined based on other indicators, such as the node’s residual energy, the number of neighboring nodes, etc. Should
N6 select
N7 as the next hop node, the packet will continue to be transmitted in accordance with the aforementioned rules until it is received by the sink node.
The underwater multihop data transmission method enables communication between nodes over long distances, and it is now widely used in UWSNs. Nevertheless, in the event of unreliable communication links, the possibility of packet mis-transmission or loss between nodes at each hop cannot be discounted. This issue is particularly pronounced when the nodes in question are mobile. Classical data retransmission and redundant transmission mechanisms are considered effective ways to address this problem, but they also introduce additional data transmission burdens, which will affect the overall data transmission performance of the system. Consequently, there is a necessity to achieve a further equilibrium between data transmission efficiency and reliability.
3.3. Network Coding and Decoding
The method proposed in this paper employs RLNC [
35] as the foundation for the coding algorithm. To facilitate the description of the RLNC encoding and decoding process, in this paper, the initial data packets and the encoded packets are described as symbol matrices. Assuming that the matrix consisting of n packets to be encoded is
M = [
P1,
P2, …,
Pn], and that the total encoded packet
Mec = [
Pec(1),
Pec(2), …,
Pec(
n)] is obtained after the RLNC. Therefore,
where
G denotes the matrix of coded coefficients, and it is generated by randomly sampling elements from the finite field
(2
q). In other words, the
i-th coding packet
Pec(
i) is defined as
The decoding of the coded packet is performed in accordance with the Gaussian elimination method. Consequently, the decoding probability of RLNC is contingent upon the rank of the coefficient matrix. The decoding probability of RLNC with respect to the degrees of freedom required for decoding is defined as follows.
Definition 1. (Decoding probability and degrees of freedom required for decoding.) Assuming that the rank of the coding coefficient matrix G is rank(G) and the number of distinct packets in the coded packet Pec is n(P), the decoding probability η and the degree of freedom χ required for decoding are defined as follows: 3.4. Unreliable Communication Link Model
In underwater data transmission systems, numerous factors contribute to link unreliability, including the seawater temperature, current movement, marine biological activity interference, marine equipment noise interference, and sensor node position changes. The modeling of underwater unreliable communication links from a physical mechanism perspective is inherently complex. From the results of data transmission over unreliable communication links, packet loss represents a significant and undesirable situation that significantly impacts the performance of underwater data transmission. Consequently, in this paper, the unreliable communication link is modeled by creating an erasure channel. The typical erasure channels include the Binary Erasure Channel (BEC) and the Gilbert Elliot Channel (GEC). Without the loss of generality, in this paper, the BEC is used as the base model to establish the underwater unreliable communication link model.
It is assumed that each transmission of nmax packets constitutes one transmission round, where nmax is the maximum transmission limit. In particular, if the number of packets to be transmitted does not exceed nmax, then the completion of transmission of all the packets is recorded as one transmission round. The channel erasure probability and channel capacity are defined as follows.
Definition 2. (Channel erasure probability and channel capacity.) In the k-th round, the current node Nk will transmit n(k) packets to the next-hop node Nk+1, assuming that each packet within the same round has the same size and that only one packet is transmitted per transmission time slot ti ∈ Tk. For each packet Pi(k), node Nk+1 sends an acknowledgement packet ACKi(k) to node Nk after reception. Assuming that node Nk receives a total of n′(k) acknowledgement packets from Nk+1 after Tk transmission, the channel erasure probability pe(k) and the channel capacity c(k) are calculated as follows: 4. Reinforcement Learning-Based Adaptive Network Coding Algorithm
4.1. General Process of RL-ANC
The overall flow of the RL-ANC algorithm is depicted in Algorithm 1, in accordance with the descriptions presented in
Section 4.2,
Section 4.3,
Section 4.4,
Section 4.5,
Section 4.6,
Section 4.7 and
Section 4.8. For the sake of clarity, the overall flow of the proposed RL-ANC calculation is also depicted in
Figure 6. In Algorithm 1, the maximum period of the
k-th round is denoted as
Γk, the sampling period is denoted as
τ, and the transmission time slot is denoted as
t. For each
τ, the set of packets in node
Ni is denoted as
MNi(
τ), and the set of transmitted packets in node
Ni is denoted as
MNi’(
τ).
Algorithm 1 RL-ANC Algorithm |
1: while node D does not cover M from node S do 2: for each node Ni ∈ N do 3: Select the next node based on 3.2 4: if MNi(τ) ≠ ∅ then 5: while τ ∈ Γk do 6: while t ∈ τ do 7: Estimate the channel erasure probability and channel capacity via (5) and (6) 8: Resize the sliding window and determine the maximum repeatability via (7) 9: Estimate the decoding probability via (8) 10: while MNi(τ)\MNi’(τ) ≠ ∅ do 11: Determine the slide rule via (10) to (14) 12: Optimize the coding coefficients via (15) to (19) 13: Encode packets to obtain encoded packets based on 3.3 14: Send Pec(τ) to next node 15: end while 16: Refresh the time slot table τ ← τ + 1 17: end while 18: Optimize the sampling period via (20) to (22) 19: end while 20: end if 21: end for 22: end while |
4.2. Channel Estimation
In the actual transmission process, the estimation of the channel erasure probability and channel capacity is based on the number of packets and acknowledgement packets. However, this introduces a significant error. Furthermore, the timeliness of the resulting erasure probability and channel capacity estimates using the maximum transmission period as the channel condition update period is inadequate in the context of the complex underwater environment and node mobility faced by the OM-IoT. Consequently, the proposed algorithm estimates the channel erasure probability in terms of the percentage of time slots where erasure occurs during the sampling period.
It is assumed that the sampling period within round
Tk is
τk and that the maximum transmission period is
Γk, with
τk ≤
Γk. In
τk, node
Nk transmits
n(
τk) packets and receives
n′(
τk) acknowledgements. The channel erasure probability
pe*(
k) with channel capacity
c* (
k) within sampling period
τk is then given by
During the actual transmission process, the channel erasure probability and the channel capacity of the sampling period are estimated from the data transmission in the sampling period (
ξ − 1)
τk (
ξ = 0, 1, 2, …,
ξ ≤
Tk/
τk). Consequently, the estimated value of the channel erasure probability and the channel capacity estimate is given by
4.3. Dynamic Adjustment of the Sliding Code Window
The proposed algorithm implements batch network coding via the sliding code window. When the time slot is updated, the code window slides a certain distance in a certain direction to achieve the update of the packets involved in the coding packet. Consequently, the size of the sliding code window with the minimum number of packets allowed to be duplicated must be designed. The proposed algorithm describes the number of currently to be coded packets allowed to be duplicated with the packets in the window of the previous time slot in terms of the minimum duplication. At the
i-th time slot
ti ∈
ξτk, the sliding window size,
Hi(
k) and the minimum repetition degree,
Oi(
k) are respectively given by
4.4. Pre-Estimation on the State of Decoding
The proposed algorithm estimates the current decoding state in terms of the decoding probability and the degrees of freedom required for decoding. During the
j-th time slot
tj ∈
Tk, the current node
Nk will send a coded packet
Pec(
tj) containing
nec(
tj) packets to the next hop node
Nk+1. Assuming that the rank of the total coding coefficients matrix
Gj(
k) at time slot
tj is
rank[
Gj(
k)], the decoding probability estimated value
and the estimate value of the degree of freedom required for decoding
are respectively given by
where
Oi(
k) is the minimum repetition of the sliding coding window in the
i-th time slot.
4.5. Adaptive Optimization of Sliding Rules for Coded Windows
The sliding rules for the coding window of the proposed algorithm encompass both the direction and the distance of the sliding movement. In contrast, the algorithm proposed assumes that the coding window does not exhibit backward sliding in the form of backtracking, thus establishing a fixed sliding direction. In other words, the proposed algorithm primarily considers the sliding distance when optimizing the sliding rule. In the context of underwater data transmission, the practice of encoding all the packets simultaneously can give rise to significant challenges, including reduced decoding efficiency and limited fault tolerance. Consequently, the approach of batch encoding is frequently employed. Consequently, the number of updates to encoded packets in different time slots is constrained and the algorithm is confronted with a limited state space and action space. The Q-Learning algorithm, a classical reinforcement-learning algorithm, is capable of satisfying this demand and of minimizing the influence of technical factors on the study. Nevertheless, it is important to note that the Q-Learning algorithm is not without its limitations. This paper examines the exploration strategy of the algorithm with a view to enhancing its computational efficiency. This section presents an optimization method based on the sliding rule, which is derived from the Q-Learning algorithm. In
Section 4.6, this paper presents an improvement to the exploration strategy.
The proposed algorithm is based on the Q-learning algorithm for the adaptive optimization of sliding rules. In the sliding rule optimization phase, the Q-value update formula is as follows:
where the
Q value,
QW(s, a), represents the value of selecting action
a at state
s. The reward,
r(s, a), is the value of selecting action
a at state
s. The learning rate,
α, is a parameter that determines the rate of change in the value of the selected action. The discount factor,
γ ∈ [0, 1], is a parameter that determines the relative importance of future rewards. In the proposed algorithm, the coded packets and uncoded packets as of the current slot
ti are used as state
s. The proposed algorithm achieves the update of coded packets through the sliding coding window. The sliding direction of the window is deterministic when the total coding window is deterministic. Consequently, with the sliding distance of the action
a, the action space
AW = {0, 1, …,
Hi(
k) −
Oi(
k)}. In the unreliable communication link scenario, the proposed algorithm aims to improve the decoding probability and thus rewards
rW ∝ Δ
η(
ti), where Δ
η(
ti) is the incremental decoding probability under time slot
ti, that is:
However, if the sole reward is the increment of the decoding probability, it can readily result in a non-sliding sliding window, whereby no new packets are involved in encoding in consecutive time slots. This will result in the algorithm becoming stuck in repetitive encoding, with a concomitant decrease in the efficiency of data transmission in the OM-IoT. In order to prevent the proposed algorithm from becoming continuously repetitive in its encoding process, it is essential to ensure that new packets are encoded within the sliding encoding window during each update time slot before the sliding encoding window has traversed all the packets to be transmitted. However, if no restriction is placed on the number of new packets within the sliding window, it may in turn lead to the situation where the degree of freedom required for decoding is still greater than zero after the current round of transmission. This is because the coding coefficient matrix is not full of rank and the resulting coded packets will not be successfully decoded. Therefore, the proposed algorithm determines the choice of the final action a, i.e.,
4.6. Improved Greedy Strategies
The Q-learning algorithm selects the action with the highest
Q-value based on the
ε-greedy strategy. In this algorithm, the probability that the action with the highest
Q-value, selected according to the general
ε-greedy strategy, is chosen as 1 −
ε.
In the general ε-greedy strategy, the exploration probability is constant, so that the algorithm exploration probability is constant. As the number of explorations accumulates, the additional exploration probability does not have to be maintained at the initial level. Consequently, the greedy strategy needs to be improved.
The proposed algorithm incorporates a decay function for the exploration probability
ε, which increases with the passage of time Δ
t. This is in consideration of the demand on the exploration probability by the difference in the
Q-values of different actions. Consequently, the improved exploration probability is given by
where
ε0 is the initial value of the exploration probability and is defined as a real number from zero to one. As the algorithm progresses,
ε gradually decays, which allows the algorithm to avoid unnecessary exploration and improve its overall efficiency. Consequently, the enhanced greedy strategy is as follows: when
a′ ∈
AW\{0}, if
QW(
s,
a′) >
QW(
s,
a), then
a′ is the subsequent action. Otherwise,
a′ is re-selected with the probability indicated in Equation (13).
4.7. Adaptive Optimization of Coding Coefficients
The proposed algorithm is based on RLNC as the fundamental coding algorithm. However, to avoid the high randomness of the network coding coefficients matrix, which could lead to an uncontrollable decoding probability, we employ the optimization of the network coding coefficients based on DQN. The fundamental framework of the coding coefficient optimization method is illustrated in
Figure 7. In the proposed algorithm, each node is regarded as an agent with an embedded DQN, which is responsible for optimizing the coding coefficients. In the multi-node scenario, all the DQNs are trained centrally in order to simplify and accelerate the training process. At each discrete decision step, the sender performs the estimation of the coding sparsity of the packet. The environmental information encompasses the channel state and the historical packets stored in the node’s cache.
During the execution of the algorithm, in each decision step j, the sender takes an action aj in state sj with the objective of optimizing the coding coefficients of the j-th coded packet in the packet by DQN. Upon the action aj, the state transitions from sj to sj+1, and the sender obtains the reward rj from the environment. Thereafter, the sender stores the experience (sj, aj, rj, sj+1) to the playback buffer. In the multi-node scenario, during training, the centralized optimizer randomly draws a small batch of experience data from the playback cache and updates the parameters of the DQN by minimizing the loss θ. After the parameters θ are updated, the optimizer sends the updated parameters θj to each node. After receiving the updated parameters, the nodes update the parameters of their DQN.
During the process of optimizing the coding coefficients of the proposed algorithm, the
Q-value update formula is given as follows:
where the
Q-value, denoted by
QG(
s,
a;
θ), is the outcome of a selection of action
a at state
s and
θ represents the network estimation parameter. The value of
θ is updated by the loss function, i.e.,
As shown in Equation (16), at each decision step
j, the state
s is comprised of two partial packets, including
Pj and the information
Pec(
m) of
m coded packets from the historical packets in the cache, i.e.,
s = [
Pj,
Pec(
m)]. The action
aj ∈
AG, where
AG = {0, 1, …, 2
q} is the action space and q is the size of the random domain. For each participating packet
Pj, the coding coefficients
gj =
aj. The objective of the coding coefficient optimization is to prevent linear correlation between different coding coefficient vectors, which could result in an uncontrolled decoding probability. Consequently, the proposed algorithm is rewarded with an increment of the rank of the total coefficients matrix, Δrank[
G(
ti)], which is given by
The selection of the coding coefficients
g(
sj) in the final state
sj is performed with reference to the improved greedy strategy presented in
Section 4.6, i.e.,
where the
ε is given by
4.8. Optimization of the Sampling Period
The selection of the sampling period τk in the proposed algorithm affects the channel estimation level. When τk is too high, which is closer to the actual value, the channel estimation delay rises and the algorithm timeliness decreases, which in turn affects the performance of data transmission. Conversely, when τk is too low, the statistical significance of τk is weakened and the coded transmission degenerates into a retransmission mechanism. Consequently, we employ the simulated annealing algorithm to optimize the sampling period.
In the proposed algorithm, the temperature decay equation is given by
where the initial temperature,
T0, represents the starting point of the process. The temperature maximum,
Tmax, is the highest temperature reached during the process. The temperature minimum,
Tmin, is the lowest temperature achieved. The annealing factor,
µ, is a value between zero and one. When optimizing the sampling period, the initial sampling period,
τk, is randomly generated. A random perturbation is then applied to the current sampling period after each round, generating a new sampling period,
τk′, in its neighboring nodes. The update probability of the sampling period is then calculated as follows:
where
σ(
τk) represents the channel estimation standard deviation and is given by
Equation (21) indicates that when the standard deviation of the channel estimation under the new sampling period τk′ is superior to τk, the sampling period is updated to τk′. Conversely, if the standard deviation is not superior, the sampling period is updated with a probability determined by (21).
6. Conclusions
In the OM-IoT, the transmission of data underwater is subject to a number of challenges, including a high packet loss rate, low reliability and communication efficiency. These issues are compounded by the complex and changeable marine environment and the movement of nodes. In order to address the impact of unreliable communication links on the reliability of underwater data transmission in the OM-IoT, this paper proposes an adaptive network coding algorithm based on reinforcement learning. The paper then goes on to simulate and analyze the underwater data transmission process, introducing the proposed algorithm. RL-ANC introduces reinforcement learning on the premise of estimating the channel conditions and decoding states, achieves adaptive packet batching by dynamically adjusting the sliding coding window size and sliding rules, and controls the complexity of coding and decoding operations within a reasonable range by an adaptive optimization method of coding coefficients based on DQN. Furthermore, the RL-ANC algorithm enhances the greedy strategy by improving its efficiency, optimizes the algorithm operation by exploring the dynamic adjustment of the probability of realization, and optimizes the sampling period in real time, thereby enhancing the reliability and communication performance of the coding of underwater networks. A simulation was conducted to compare and analyze the proposed algorithm with common data retransmission, redundant transmission and general network coding. The results demonstrated that the proposed algorithm outperforms the other three algorithms in terms of the packet delivery rate, average retransmission rate, and redundant transmission rate. Additionally, the proposed algorithm exhibited a faster convergence speed of the decoding probability. Furthermore, we have conducted a detailed analysis of the impact of the enhancement of the greedy strategy and the optimization of the sampling period on the algorithm. Our findings demonstrate that both the proposed improvement and optimization have led to enhanced outcomes.
It should be noted that the algorithms proposed in this paper consider a node-depth-based framework for underwater multihop data transmission and mainly consider in-stream network coding. For large-scale data transmission scenarios, e.g., multi-source and multi-sink networks, multi-flow intersection networks, large-scale IoT data transmission, etc., if network coding is used, there will be multi-level concurrent network coding scheduling and optimization problems. Consequently, the subsequent stage of this research will entail the investigation of network coding algorithms in more intricate network scenarios, employing the RL-ANC framework. Furthermore, this paper primarily addresses the underwater network in the OM-IOT. Given the growing trend of integration in the OM-IOT, which integrates multiple networks such as satellites, further research is warranted. In other forms of networks, research on network coding optimization based on reinforcement learning holds considerable potential, as various forms of media, such as electromagnetic signals and optical signals, can be widely used. Consequently, we will also refine and extend the proposed RL-ANC framework to other networks, which represents a key area of future research.