A UoI-Optimal Policy for Timely Status Updates with Resource Constraint

Wang, Lehan; Sun, Jingzhou; Sun, Yuxuan; Zhou, Sheng; Niu, Zhisheng

doi:10.3390/e23081084

Open AccessArticle

A UoI-Optimal Policy for Timely Status Updates with Resource Constraint

by

Lehan Wang

,

Jingzhou Sun

,

Yuxuan Sun

,

Sheng Zhou

^*

and

Zhisheng Niu

Beijing National Research Center for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(8), 1084; https://doi.org/10.3390/e23081084

Submission received: 20 June 2021 / Revised: 12 August 2021 / Accepted: 16 August 2021 / Published: 20 August 2021

(This article belongs to the Special Issue Age of Information: Concept, Metric and Tool for Network Control)

Download

Browse Figures

Versions Notes

Abstract

:

Timely status updates are critical in remote control systems such as autonomous driving and the industrial Internet of Things, where timeliness requirements are usually context dependent. Accordingly, the Urgency of Information (UoI) has been proposed beyond the well-known Age of Information (AoI) by further including context-aware weights which indicate whether the monitored process is in an emergency. However, the optimal updating and scheduling strategies in terms of UoI remain open. In this paper, we propose a UoI-optimal updating policy for timely status information with resource constraint. We first formulate the problem in a constrained Markov decision process and prove that the UoI-optimal policy has a threshold structure. When the context-aware weights are known, we propose a numerical method based on linear programming. When the weights are unknown, we further design a reinforcement learning (RL)-based scheduling policy. The simulation reveals that the threshold of the UoI-optimal policy increases as the resource constraint tightens. In addition, the UoI-optimal policy outperforms the AoI-optimal policy in terms of average squared estimation error, and the proposed RL-based updating policy achieves a near-optimal performance without the advanced knowledge of the system model.

Keywords:

age of information; constrained Markov decision process; reinforcement learning; context-awareness; timely status updates

1. Introduction

With the development of 5G and the Internet of Things (IoT), requirements for wireless communication have shifted from merely providing communication channels to covering the entire process of various IoT applications, e.g., autonomous vehicle [1] and virtual reality (VR) [2], where sensing, communication, computation, and control form a closed loop. Therefore, in addition to the communication delay, it is necessary to consider the information delay counted from the generation of the state information to the execution, namely the timeliness of information. For this purpose, Age of Information (AoI) has been proposed, which is defined as the time elapsed since the generation time of the latest received packets [3]. Due to its concise definition and clear physical meaning, AoI has been widely used for the design of scheduling and updating policies in remote estimation [4,5,6] and wireless communication networks [7,8,9,10,11,12]. Most existing works focus on optimizing average AoI or peak age. In [13], the authors claim that minimizing average age cannot satisfy the requirements for ultra-reliable low-latency communication (URLLC) and study the tail distribution of AoI. The violation probability for peak age is derived in [14] and the stationary distribution of AoI is studied in [15].

Nevertheless, the AoI still has some limitations. First, it fails to measure the nonlinear performance degradation caused by information staleness. In [16,17,18,19], nonlinear age penalty functions were introduced to solve this problem. Meanwhile, the Age of Synchronization (AoS) [20] and Age of Incorrect Information (AoII) [21] are defined to associate information freshness with the content of information. AoS is the time elapsed since the information at the receiver becomes desynchronized with the actual status of the monitored process. AoII is defined as the product of an increasing time penalty function and a penalty function of the estimation error. In addition, the status of heterogeneous data sources may change at different rates. A fast-changing process may require information with a lower age. However, age is independent of the changing rate and thus is not proper in the cases when heterogeneous data sources are jointly considered. To solve this problem, weighted age was introduced in [22,23] to distinguish important monitored processes. In [24], the metric based on information theory is proposed as a replacement of the time-based metric, AoI, to characterize the changing rate. In [5], the authors claim that minimizing age is not equivalent to minimizing the estimation error in a remote estimation problem and propose an effective age to solve this problem [25].

Practical systems (e.g., V2X-communication systems) may have different requirements for information freshness with different contexts. The context refers to all environmental factors that affect the requirement for information freshness. Therefore, resources should be reserved for frequent status updates in emergency to ensure safety.

However, the timeliness metrics mentioned above pay no attention to the significance of context information. To solve this problem, Urgency of Information (UoI) has been proposed in [26,27,28] to measure the influence of inaccurate information on performance under different contexts. To be specific, UoI uses a time-variant context-aware weight

ω (t)

to distinguish different contexts. A higher

ω (t)

indicates that the system is in more urgent situations (e.g., when a vehicle is approaching an intersection or overtaking) and therefore requires frequent updates. For example, when a vehicle passes through an intersection, the context-aware weight increases as the distance between the vehicle and the center of the intersection decreases. Meanwhile, the estimation error

Q (t)

is introduced to measure the information inaccuracy, which is defined as the difference between the actual status and the estimated status at the receiver. The larger the absolute value of

Q (t)

is, the less accurate the estimated status is. Therefore, UoI is defined as the product of context-aware weight and a cost function of the estimation error

Q (t)

:

F (t) = ω (t) δ (Q (t)) .

(1)

In discrete-time systems, the estimation error

Q (t)

is:

Q (t) = \sum_{τ = g (t)}^{t - 1} A (τ),

(2)

where

g (t)

is the generation time of the latest status update at the receiver and

A (t)

is the increment in estimation error in time slot t. Specifically, if the context-aware weight is time-invariant (i.e.,

ω (t) = 1

), and

A (t) = 1

as well as

δ (Q (t)) = Q (t)

, UoI is the same as AoI. If the context-aware weight is process-dependent, UoI can represent weighted age. If the cost function

δ (Q (t))

is nonlinear, UoI can represent the nonlinear age penalty function. For example, when the outdated information is worthless, e.g., the information is about sales that expire after some time [29], then the shifted unit step cost function

δ (Q (t)) = u (Q (t) - τ)

,

τ > 0

is recommended. For the unit step function,

u (x) = 1

when

x \geq 0

and otherwise

u (x) = 0

.

In this work, we considered a single-user remote monitoring system, and the objective was to find an updating policy minimizing the average UoI over time under the constraint on average update frequency. To solve this problem, Refs. [27,30] proposed update-index-based adaptive schemes with Lyapunov optimization but did not conduct a theoretical analysis of their optimality. In addition, the constrained Markov decision process (CMDP) formulation was only used in the simulation for a numerically solved benchmark. Based on the existing works, in this paper, we theoretically analyzed the structure of the UoI-optimal policy and focused on how to derive an updating policy in an unknown environment.

The main contributions of this paper are summarized as follows.

In contrast to [27,30], we assumed that the context-aware weight is a first-order irreducible positive recurrent Markov process or independent and identically distributed (i.i.d.) over time. We formulated the updating problem as a CMDP problem and proved the single threshold structure of the UoI-optimal policy. We then derived the policy through LP with the threshold structure and discussed the conditions that the monitored process needs to satisfy for the threshold structure.
When the distributions of the context-aware weight and the increment in estimation error were unknown, we used model-based RL method to learn the state transitions of the whole system and derive a near-optimal RL-based updating policy.
Simulations were conducted to verify the theoretical analysis of the threshold structure and show the near-optimal performance of the RL-based updating policy. The results indicate that: (i) the update thresholds decrease when the maximum average update frequency becomes large; (ii) the update threshold for emergency can actually be larger than that for ordinary states when the probability of transferring from emergency to ordinary states tends to 1.

The rest of this paper is organized as follows. The system model and the problem formulation are described in Section 2. In Section 3, we obtain the CMDP formulation of the problem with the given distribution of context-aware weight and prove the threshold structure of the UoI-optimal policy. The proposed model-based RL updating policy is obtained in Section 4. In Section 5, the simulation results are shown and discussed while the conclusions are drawn in Section 6.

2. System Model and Problem Formulation

In this paper, we considered a remote monitoring system, in which a fusion center collects the status information (e.g., current location, velocity, information of surrounding) from a vehicle of interest via a wireless channel with limited resources, as shown in Figure 1. The whole system is considered as a discrete-time system and the status can be generated at will. Due to the limitations on the wireless resources and energy supply, there is a constraint on the average update frequency of the vehicle. The update decision in time slot t is denoted by

U (t) \in {0, 1}

, where

U (t) = 1

means that the vehicle decides to transmit the current status to the center, and

U (t) = 0

denotes that the vehicle decides to stay idle.

The wireless channel is assumed as a block fading channel with successful transmission probability

p_{s}

. Let

S (t) \in {0, 1}

be the state of the channel.

S (t) = 0

represents that the channel is in deep fading, and no packet can be successfully transmitted.

S (t) = 1

means the packets can be successfully transmitted to the center through the channel. If the center receives an update, then

U (t) S (t) = 1

and an ACK will be sent to the vehicle.

Let

x (t)

and

\hat{x} (t)

denote the current status of the monitored vehicle and the estimated status of the vehicle at the center, and

Q (t) = x (t) - \hat{x} (t)

denotes the estimation error. Similar to [26], we further assume that the time period of a packet transmission is less than a time slot and the estimation at the center equals the latest status information received by the center. This estimation scheme is easy to implement, theoretically tractable and has been proven to be an optimal policy that can minimize the average squared error of status estimation in a remote estimation system under energy constraints when the monitored process is a Wiener process [31]. Then, the recurrence relation of the estimation error

Q (t)

is:

Q (t + 1) = (1 - U (t) S (t)) Q (t) + A (t) .

(3)

Equation (3) indicates that the estimation error will be the amount of variation of the monitored process from the generation time of the latest received status to the current time. The increment

A (t)

represents the variation of the monitored process. For example, when

A (t)

follows a Gaussian distribution with a mean of zero and variance of

σ^{2}

, represented by

N (0, σ^{2})

, the monitored status follows a Wiener process. When

A (t)

takes values from

{0, 1, - 1}

with a probability of

{1 - 2 p_{r w}, p_{r w}, p_{r w}}

, where

0 < p_{r w} < \frac{1}{2}

, then the status of the monitored source will be a one-dimensional random walk. In this paper, we assumed that the monitored status of the vehicle is a Wiener process and

A (t)

is i.i.d. over time. However, the increment in estimation error during a single slot cannot be infinite in practical systems. Therefore, in contrast to [27,30], we assumed that increment

A (t)

obeys a truncated Gaussian distribution, i.e., the probability density function (PDF) of

A (t)

is:

\begin{matrix} f_{A (t)} (a) = \frac{\frac{1}{σ} ϕ (\frac{a - μ}{σ})}{Φ (\frac{A_{m a x} - μ}{σ}) - Φ (\frac{- A_{m i n} - μ}{σ})}, \end{matrix}

(4)

where

μ

and

σ

are the expectation and standard deviation of increment

A (t)

.

ϕ

and

Φ

are the PDF and the cumulative distribution function (CDF) of standard normal distribution. We also assumed

A (t) \in [- A_{m i n}, A_{m a x}], A_{m a x} = A_{m i n} > 0

and

μ = 0

.

Meanwhile, the scheduling policy of information updates should also be related to the situation and environment of the system. For example, when the system is in an emergency, it should be very sensitive to the accuracy and the delay of the status information, thus the status should be updated more frequently. Therefore, our objective is to find a policy telling the vehicle whether to transmit status information or not in each slot for a minimum average UoI over time under the constraint:

\begin{matrix} min_{U (t)} & \underset{T \to \infty}{lim sup} \frac{1}{T} E [\sum_{t = 0}^{T - 1} w (t) Q {(t)}^{2}] \\ s . t . & \underset{T \to \infty}{lim sup} \frac{1}{T} \sum_{t = 0}^{T - 1} E [U (t)] \leq ρ, \end{matrix}

(5)

where

ω (t) > 0

is the context-aware weight, which is independent with

Q (t)

.

ρ \in (0, 1]

is the maximum average update frequency. The cost function of the estimation error used here is

δ (Q (t)) = {(Q (t))}^{2}

, which is inspired by the squared error of status estimation.

3. Scheduling with CMDP-Based Approach

In this section, we start by formulating problem (5) into a constrained Markov decision process (CMDP) with assumptions on the distribution of the context-aware weight. We will prove the threshold structure of the UoI-optimal updating policy and derive the optimal policy through a linear programming (LP) formulation.

3.1. Constrained Markov Decision Process Formulation

In the remote monitoring system, the context may be related to the distance between adjacent vehicles/mobile devices, the unexpected maneuver of the neighboring vehicles, etc. In [32], the authors prove that whether the distance between two mobile wireless devices with Ornstein–Uhlenbeck mobility is less than a certain threshold follows a first-order Markov process. When the two devices are closer, they are more interested in each other’s status information, communication and computing resources to facilitate cooperation, share resources, and avoid collisions. At this time, the transmission of status information is more urgent than when the two devices are far apart. As for the unexpected maneuver of the neighboring vehicles, it is very challenging to find a proper formulation. Instead, we assumed that such emergencies occur independently in each slot according to a certain probability. Therefore, in contrast to [27,30], we assumed that the context-aware weight

ω (t)

is i.i.d. over time or a first-order irreducible positive recurrent Markov process and formulated the problem (5) as a CMDP problem. The irreducible positive recurrent Markov formulation guarantees the existence of the UoI-optimal policy (see Appendix A). In this section, we will first focus on the situation where

ω (t)

is a first-order Markov process:

State space: The state of the vehicle in slot t, denoted by $s (t) = (Q (t), ω (t))$ , includes the current estimation error and the context-aware weight. Then, we discretize $Q (t)$ with the step size $Δ_{Q} > 0$ , i.e., the estimation error $Q (t) \in Q = {0, \pm Δ_{Q}, \pm 2 Δ_{Q}, \dots, \pm n Δ_{Q}, \dots}$ . For example, when $Q (t) \in [n Δ_{Q} - \frac{1}{2} Δ_{Q}, n Δ_{Q} + \frac{1}{2} Δ_{Q})$ , its value will be taken as $n Δ_{Q}$ . The smaller the step size $Δ_{Q}$ , the smaller the performance degradation caused by discretization. In addition, the value set of the context-aware weight is denoted by $W$ . Then, the state space $S = {Q \times W}$ is thus countable but infinite.
Action space: At each slot, the vehicle can take two actions, namely $U (t) \in U = {0, 1}$ , where $U (t) = 1$ denotes the vehicle deciding to transmit updates in slot t and $U (t) = 0$ denotes the vehicle deciding to wait.
Probability transfer function: After taking action U at state $s = (Q, ω)$ , the next state is denoted by $s^{'} = (Q^{'}, ω^{'})$ . When the vehicle decides not to transmit or the transmission fails, the probability of the estimation error transferring from Q to $Q^{'}$ is written as $\Pr {Q^{'} - Q = a} = p_{a}$ . Due to the discretization of the estimation error, the increment $a \in A = {0, \pm Δ_{Q}, \pm 2 Δ_{Q}, \dots, \pm A_{m}}$ , where $A_{m} = ⌊ \frac{A_{m a x}}{Δ_{Q}} ⌋ Δ_{Q} > 0$ . In addition, $p_{a} = F_{A} (a + \frac{1}{2} Δ_{Q}) - F_{A} (a - \frac{1}{2} Δ_{Q})$ , where $F_{A} (a)$ is the CDF of increment $A (t)$ . In addition, the probability of the context-aware weight transferring from $ω$ to $ω^{'}$ is written as $\Pr {ω \to ω^{'}} = p_{ω ω^{'}}$ . Based on the assumption that the context-aware weight $ω (t)$ is independent with the estimation error $Q (t)$ , then the probability of the state transferring from $s = (Q, ω)$ to $s^{'} = (Q^{'}, ω^{'})$ given action U is:

$\begin{matrix} \Pr {s \to s^{'} | U} = \Pr {(Q, ω) \to (Q^{'}, ω^{'}) | U} \\ = \{\begin{matrix} p_{ω ω^{'}} p_{Q^{'} - Q} & , U = 0, \\ p_{ω ω^{'}} ((1 - p_{s}) p_{Q^{'} - Q} + p_{s} p_{Q^{'} - 0}) & , U = 1 . \end{matrix} \end{matrix}$

(6)
One-step cost: The cost caused by taking action U in state $(Q, ω)$ is:

$\begin{matrix} C (Q, ω, U) = ω Q^{2}, \end{matrix}$

(7)

while the one-step updating penalty only depends on the chosen action:

$\begin{matrix} D (Q, ω, U) = U . \end{matrix}$

(8)

The average cost caused under a certain policy

π

is the average UoI, which is defined as

{\bar{C}}^{π}

and the average updating penalty under

π

is defined as

{\bar{D}}^{π}

. We aimed to find the UoI-optimal policy which minimizes the average cost under the resources constraint. Therefore, problem (5) can be formulated into the following CMDP problem:

\begin{matrix} min_{π} & {\bar{C}}^{π} = lim_{T \to \infty} \frac{1}{T} E_{π} [\sum_{t = 1}^{T} C (Q (t), ω (t), U (t))] \\ s . t . & {\bar{D}}^{π} = lim_{T \to \infty} \frac{1}{T} E_{π} [\sum_{t = 1}^{T} D (Q (t), ω (t), U (t))] \leq ρ . \end{matrix}

(9)

3.2. Threshold Structure of the Optimal Policy

We start from some basic definitions in [33] and show the properties of problem (9).

Definition 1.

A stationary deterministic policy is a policy that takes the same action whenever in a given state

s = (Q, ω)

, while a stationary randomized policy chooses to update or not in state s with a certain probability.

Theorem 1.

There exists an optimal stationary randomized policy for problem (9). The optimal policy is a probabilistic combination of two stationary deterministic policies. The two deterministic policies only differ on at most one state and each policy minimizes the unconstrained cost in (10) with a different Lagrange multiplier λ:

\begin{matrix} L_{λ}^{π} = lim_{T \to \infty} \frac{1}{T} E_{π} [\sum_{t = 1}^{T} [C (Q (t), ω (t), U (t)) + λ D (Q (t), ω (t), U (t))]] . \end{matrix}

(10)

Proof of Theorem 1.

The proof is shown in Appendix A. □

We denote the optimal policy that minimizes the unconstrained cost in (10) with a given

λ

by

π^{★}

and the cost obtained under policy

π^{★}

by

L_{λ}^{π^{★}}

, namely

L_{λ}^{π^{★}} = {min}_{π} L_{λ}^{π}

. Then, there exists a differential cost function

V (Q, ω)

that satisfies the Bellman Equation [34]:

\begin{matrix} V (Q, ω) + L_{λ}^{π^{★}} = min {C (Q, ω, 1) + λ D (Q, ω, 1) \\ + (1 - p_{s}) \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} V (Q + a, ω^{'}) + p_{s} \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} V (a, ω^{'}), \\ C (Q, ω, 0) + \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} V (Q + a, ω^{'})} . \end{matrix}

(11)

To solve problem (5), we first prove that with a given

λ

, the optimal stationary deterministic policy

π^{★}

has a threshold structure. We then introduce a discounted problem with a discount factor

α

and the discounted cost starting from state

(Q, ω)

under a certain policy

π

is:

\begin{matrix} J_{α, π} (Q, ω) = & lim_{T \to \infty} E_{π} [\sum_{t = 0}^{T} α^{t} [C (Q (t), ω (t), U (t)) \\ + λ D (Q (t), ω (t), U (t))] ∣ (Q (0) = Q, ω (0) = ω)] . \end{matrix}

(12)

Denote the minimum cost starting from state

(Q, ω)

by

V_{α} (Q, ω) = {min}_{π} J_{α, π} (Q, ω)

. Then, we have:

\begin{matrix} V_{α} (Q, ω) = & min {C (Q, ω, 1) + λ D (Q, ω, 1) + (1 - p_{s}) α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α} (Q + a, ω^{'}) \\ + & p_{s} α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α} (a, ω^{'}), C (Q, ω, 0) \\ + & α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α} (Q + a, ω^{'})} . \end{matrix}

(13)

Define

Δ (Q, ω)

as the difference between the value functions by taking the two different actions

U = 0, 1

, meaning that:

\begin{matrix} Δ (Q, ω) = C (Q, ω, 0) + α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α} (Q + a, ω^{'}) \\ - & C (Q, ω, 1) - λ D (Q, ω, 1) - p_{s} α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α} (a, ω^{'}) \\ - & (1 - p_{s}) α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α} (Q + a, ω^{'}) \\ = & p_{s} α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} \{V_{α} (Q + a, ω^{'}) - V_{α} (a, ω^{'})\} - λ . \end{matrix}

(14)

Define

\sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α} (Q + a, ω)

as a function

f_{α} (Q, ω)

. Then we will prove that for

\forall | Q_{1} | < | Q_{2} |

, we have

f_{α} (Q_{1}, ω) < f_{α} (Q_{2}, ω)

. To this end, we first prove the following Lemma 1.

Lemma 1.

For a given discount factor α and a fixed context-aware weight ω, the value function for Q equals the value function for

- Q

, namely:

V_{α} (Q, ω) = V_{α} (- Q, ω) .

Proof of Lemma 1.

The Lemma is proven by induction. Define

V_{α}^{(k)} (Q, ω)

as the value function obtained after the

k^{th}

iteration. Assume that for

\forall Q

, we have:

V_{α}^{(k)} (Q, ω) = V_{α}^{(k)} (- Q, ω)

. If action U is taken in the

k^{th}

iteration, then the expected discounted cost is defined as

J_{α, U}^{(k)} (Q, ω)

. Therefore,

V_{α}^{(k + 1)} (Q, ω) = {min}_{U} J_{α, U}^{(k)} (Q, ω)

. We have:

\begin{matrix} J_{α, 0}^{(k)} (Q, ω) & = C (Q, ω, 0) + α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α}^{(k)} (Q + a, ω^{'}) \\ = ω {(- Q)}^{2} + α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α}^{(k)} (- Q - a, ω^{'}) \\ = C_{X} (- Q, ω, 0) + α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α}^{(k)} (- Q + a, ω^{'}) = J_{α, 0}^{(k)} (- Q, ω) . \end{matrix}

(15)

Similarly, we can further prove that

J_{α, 1}^{(k)} (Q, ω) = J_{α, 1}^{(k)} (- Q, ω)

. Notice that the value function obtained in

{(k + 1)}^{th}

iteration is obtained by:

V_{α}^{(k + 1)} (Q, ω) = {min}_{U} J_{α, U}^{(k)} (Q, ω),

and for any action U,

J_{α, U}^{(k)} (Q, ω) = J_{α, U}^{(k)} (- Q, ω)

. Thus,

V_{α}^{(k + 1)} (Q, ω) = V_{α}^{(k + 1)} (- Q, ω)

. By letting

k \to \infty

,

V_{α}^{(k)} (Q, ω) \to V_{α} (Q, ω)

. Hence,

V_{α} (Q, ω) = V_{α} (- Q, ω)

. □

Lemma 2.

For a given discount factor α and a fixed context-aware weight ω, function

f_{α} (Q, ω)

for Q increases monotonically with the absolute value of Q, namely: for

\forall | Q_{1} | < | Q_{2} |, f_{α} (Q_{1}, ω) < f_{α} (Q_{2}, ω)

.

Proof of Lemma 2.

Using the induction method, we first assume that for

\forall | Q_{1} | < | Q_{2} |

, we have

f_{α}^{(k)} (Q_{1}, ω) < f_{α}^{(k)} (Q_{2}, ω)

. Therefore:

\begin{matrix} J_{α, 0}^{(k)} (Q_{1}, ω) & = C (Q_{1}, ω, 0) + α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α}^{(k)} (Q_{1} + a, ω^{'}) \\ = ω Q_{1}^{2} + α \sum_{ω^{'} \in W} p_{ω ω^{'}} f_{α}^{(k)} (Q_{1}, ω^{'}) \\ < C (Q_{2}, ω, 0) + α \sum_{ω^{'} \in W} p_{ω ω^{'}} f_{α}^{(k)} (Q_{2}, ω^{'}) \\ = J_{α, 0}^{(k)} (Q_{2}, ω) . \end{matrix}

(16)

Similarly, we can obtain

J_{α, 1}^{(k)} (Q_{1}, ω) < J_{α, 1}^{(k)} (Q_{2}, ω)

. Meanwhile,

V_{α}^{(k + 1)} (Q, ω) = {min}_{U} J_{α, U}^{(k)} (Q, ω)

, then we have

V_{α}^{(k + 1)} (Q_{1}, ω) < V_{α}^{(k + 1)} (Q_{2}, ω)

, for

\forall | Q_{1} | < | Q_{2} |

. Obviously, if we want to use induction to complete the proof of Lemma 2, we have to prove that:

f_{α}^{(k + 1)} (Q_{1}, ω) < f_{α}^{(k + 1)} (Q_{2}, ω)

, for

\forall | Q_{1} | < | Q_{2} |

. To simplify the proof, it is assumed that

Q_{2} > Q_{1} > 0

. The discussion will be divided into the following three situations.

When $A_{m} \leq | Q_{1} |$ , then $| Q_{1} + a | < | Q_{2} + a |$ , for $\forall a \in [- A_{m}, A_{m}]$ , we can derive that:

$\begin{matrix} f_{α}^{(k + 1)} (Q_{1}, ω) = & \sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α}^{(k + 1)} (Q_{1} + a, ω) \\ < & \sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α}^{(k + 1)} (Q_{2} + a, ω) = f_{α}^{(k + 1)} (Q_{2}, ω) . \end{matrix}$

(17)
When $A_{m} > | Q_{2} |$ , there exists an increment $a^{'} \in A^{'} = {a | a \in [- A_{m}, - \frac{1}{2} (Q_{1} + Q_{2})}$ , such that $| Q_{1} + a^{'} | > | Q_{2} + a^{'} |$ , and $V_{α}^{(k + 1)} (Q_{1} + a^{'}, ω^{'}) > V_{α}^{(k + 1)} (Q_{2} + a^{'}, ω^{'})$ . Notice that $- Q_{1} - a^{'} \in (\frac{1}{2} (Q_{2} - Q_{1}), A_{m} - Q_{1}]$ and $Q_{2} + a \in [- A_{m} + Q_{2}, A_{m} + Q_{2}]$ , then $p_{- Q_{1} - a^{'} - Q_{2}} V_{α}^{(k + 1)} (- Q_{1} - a^{'}, ω)$ is a term in the summation $f_{α}^{(k + 1)} (Q_{2}, ω)$ , namely $\sum_{a = - A_{m}}^{A_{m}} p_{a} V_{α} (Q + a, ω)$ . Similarly, $p_{- Q_{2} - a^{'} - Q_{1}} V_{α}^{(k + 1)} (- Q_{2} - a^{'}, ω)$ is a term in the summation $f_{α}^{(k + 1)} (Q_{1}, ω)$ . We further define $A^{″} = {a | a = - Q_{1} - Q_{2} - a^{'}}$ , since $- Q_{1} - Q_{2} - a^{'} \in (- \frac{1}{2} (Q_{1} + Q_{2}), A_{m} - Q_{1} - Q_{2}]$ , then $A^{'} \cap A^{″} = ⌀$ .
Furthermore, the probability of the estimation error transferring from $Q_{1}$ to $- Q_{2} - a^{'}$ , i.e., $p_{- Q_{2} - a^{'} - Q_{1}}$ equals $p_{- Q_{1} - a^{'} - Q_{2}}$ , the probability of the estimation error transferring from $Q_{2}$ to $- Q_{1} - a^{'}$ . Since $- a^{'} \in (\frac{1}{2} (Q_{1} + Q_{2}), A_{m}]$ , then $| a^{'} | > | - Q_{1} - Q_{2} - a^{'} |$ . According to our assumption of the increment, we can prove that for any $a^{'} \in A^{'}$ , $p_{a^{'}} < p_{- Q_{1} - Q_{2} - a^{'}}$ . Then, we can derive:

$\begin{matrix} f_{α}^{(k + 1)} (Q_{1}, ω) - f_{α}^{(k + 1)} (Q_{2}, ω) \\ = & \sum_{a \in A^{'}} p_{a} V_{α}^{(k + 1)} (Q_{1} + a, ω) + \sum_{a \in A^{″}} p_{a} V_{α}^{(k + 1)} (Q_{1} + a, ω) \\ - & \sum_{a \in A^{'}} p_{a} V_{α}^{(k + 1)} (Q_{2} + a, ω) - \sum_{a \in A^{″}} p_{a} V_{α}^{(k + 1)} (Q_{2} + a, ω) + M (Q_{1}, Q_{2}) \\ = & \sum_{a \in A^{'}} p_{a} {V_{α}^{(k + 1)} (Q_{1} + a, ω) - V_{α}^{(k + 1)} (Q_{2} + a, ω)} \\ + & \sum_{a \in A^{'}} p_{- Q_{1} - Q_{2} - a} {V_{α}^{(k + 1)} (Q_{2} + a, ω) - V_{α}^{(k + 1)} (Q_{1} + a, ω)} + M (Q_{1}, Q_{2}) \\ = & \sum_{a \in A^{'}} (p_{a} - p_{- Q_{1} - Q_{2} - a}) {V_{α}^{(k + 1)} (Q_{1} + a, ω) - V_{α}^{(k + 1)} (Q_{2} + a, ω)} + M (Q_{1}, Q_{2}) < 0, \end{matrix}$

(18)

where $M (Q_{1}, Q_{2}) = \sum_{a \notin A^{'} \cup A^{″}} p_{a} (V_{α}^{(k + 1)} (Q_{1} + a, ω) - V_{α}^{(k + 1)} (Q_{2} + a, ω)) < 0$ .
When $| Q_{2} | > A_{m} > | Q_{1} |$ , since $a^{'} \in [- A_{m}, - \frac{1}{2} (Q_{1} + Q_{2}))$ , we only need to consider the case when $A_{m} > \frac{1}{2} (Q_{1} + Q_{2})$ , in this case $- Q_{1} - a^{'} > \frac{1}{2} (Q_{2} - Q_{1}) > Q_{2} - A_{m}$ . Therefore, $p_{- Q_{1} - a^{'} - Q_{2}} V_{α}^{(k + 1)} (- Q_{1} - a^{'}, ω)$ is a term in the summation $f_{α}^{(k + 1)} (Q_{2}, ω)$ . Similarly, we can also prove that $f_{α}^{(k + 1)} (Q_{1}, ω) < f_{α}^{(k + 1)} (Q_{2}, ω)$ when $| Q_{2} | > A_{m} > | Q_{1} |$ .

According to Lemma 1, the conclusions above can be easily generalized to the cases without the condition

Q_{2} > Q_{1} > 0

. Finally, by letting

k \to \infty

,

V_{α}^{(k + 1)} (Q, ω) \to V_{α} (Q, ω)

, therefore:

f_{α}^{(k + 1)} (Q, ω) \to f_{α} (Q, ω)

. Hence:

f_{α} (Q_{1}, ω) < f_{α} (Q_{2}, ω)

. □

Remark 1.

Lemma 2 holds when

f_{A} (a)

, i.e., the PDF of increment

A (t)

satisfies the following conditions:

$f_{A} (a) = f_{A} (- a)$ , $μ = 0$ ;
$f_{A} (a_{2}) \leq f_{A} (a_{1}), \forall a_{2} \geq a_{1} \geq 0$ .

Then, with Lemmas 1 and 2, we can prove the threshold structure of the optimal stationary deterministic policy which minimizes

L_{λ}^{π}

in (10).

Theorem 2.

For a given λ, the optimal stationary deterministic policy which minimizes

L_{λ}^{π}

in (10) has a threshold structure when the context-aware weight is a first-order irreducible positive recurrent Markov process.

Proof of Theorem 2.

Let

s_{α}^{*} (Q, ω)

denote the optimal action which minimizes the discounted cost

V_{α} (Q, ω)

at state

(Q, ω)

. If the optimal action

s_{α}^{*} (Q, ω) = 1

, then the vehicle will transmit its status update to the center at state

(Q, ω)

and

Δ (Q, ω) \geq 0

. Thus, we have:

\begin{matrix} Δ (Q, ω) = p_{s} α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} \{V_{α} (Q + a, ω^{'}) - V_{α} (a, ω^{'})\} - λ \geq 0 . \end{matrix}

(19)

According to Lemma 2, for any

| Q^{'} | > | Q |

,

Δ (Q^{'}, ω)

can be lower bounded by

\begin{matrix} Δ (Q^{'}, ω) & = p_{s} α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} \{V_{α} (Q^{'} + a, ω^{'}) - V_{α} (a, ω^{'})\} - λ \\ \geq p_{s} α \sum_{ω^{'} \in W} p_{ω ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} \{V_{α} (Q + a, ω^{'}) - V_{α} (a, ω^{'})\} - λ \geq 0 . \end{matrix}

(20)

If

Δ (Q, ω) > 0

, then for any states with

| Q^{'} | > | Q |

, the optimal policy is to transmit the status to the center. If

Δ (Q, ω) < 0

, then for any states with

| Q^{'} | < | Q |

, the optimal action is not to transmit. In addition, the optimal policy will not be choosing to wait in all the slots. Therefore, for each context-aware weight

ω

, there must be a threshold

τ_{ω} \geq 0

. For any state

(Q, ω)

with

| Q | > τ_{ω}

, the optimal choice is to transmit the status update. We can then conclude that for a given weight

ω

, the optimal policy with a discount factor

α

has a threshold structure.

Let

{α_{1}, α_{2}, \dots, α_{k}}

denote a sequence of discount factors and

α_{k}

converges to 1 when

k \to \infty

. Then, the optimal deterministic policy for

α = 1

will also converge to the optimal policy with a discount factor which is less than 1 [35]. Similar derivation is also applied in [12]. Therefore, we can prove the threshold structure of the optimal stationary deterministic policy which minimizes

L_{λ}^{π}

. □

Similarly, when the context-aware weight is i.i.d. over time, we can obtain the following theorem:

Theorem 3.

For a given λ, the optimal stationary deterministic policy which minimizes

L_{λ}^{π}

in (10) has a threshold structure when the context-aware weight is i.i.d. over time. The thresholds are the same for each state of the context-aware weight.

Proof of Theorem 3.

If the context-aware weight is i.i.d. over time, then we have:

\begin{matrix} Δ (Q, ω) = p_{s} α \sum_{ω^{'} \in W} p_{ω^{'}} \sum_{a = - A_{m}}^{A_{m}} p_{a} \{V_{α} (Q + a, ω^{'}) - V_{α} (a, ω^{'})\} - λ = Δ (Q), \end{matrix}

(21)

where

p_{ω}

is the probability of the value of the context-aware weight being in state

ω

. Therefore, in this case, the state will be reduced to one dimension and the thresholds will be the same for all the states of the context-aware weight. □

According to Theorems 2 and 3, we proved the threshold structure of the two stationary deterministic policies that compose the UoI-optimal policy. Since the UoI-optimal policy for problem (9) is a probabilistic combination of two deterministic policies with threshold structures, we can finally draw the conclusion that the UoI-optimal policy also has a threshold structure.

3.3. Numerical Solution of Optimal Strategy

Based on Theorem 2, we only need to consider the policy that chooses to update with a probability of 1 in state

(Q, ω)

, for

\forall | Q | \geq Q_{max} = {max}_{ω} τ_{ω}

. Let

μ_{Q, ω}

denote the probability that the state of the vehicle is

(Q, ω)

.

y_{Q, ω}

denotes the probability that the state is

(Q, ω)

and the vehicle chooses to transmit an update. Therefore, we have:

Theorem 4.

When the context-aware weight is a first-order irreducible positive recurrent Markov process, the UoI-optimal policy can be derived by solving the following LP problem:

\begin{matrix} {μ_{Q, ω}^{*}, y_{Q, ω}^{*}} = & arg min_{{μ_{Q, ω}, y_{Q, ω}}} \sum_{ω \in W} \sum_{Q = - Q_{max}}^{Q_{max}} ω Q^{2} μ_{Q, ω}, \end{matrix}

(22a)

\begin{matrix} s . t . & \sum_{ω \in W} \sum_{Q = - Q_{max}}^{Q_{max}} μ_{Q, ω} = 1, \end{matrix}

(22b)

\begin{matrix} \sum_{ω \in W} \sum_{Q = - Q_{max}}^{Q_{max}} y_{Q, ω} \leq ρ, \end{matrix}

(22c)

\begin{matrix} y_{Q, ω} \leq μ_{Q, ω}, \forall Q, ω, \end{matrix}

(22d)

\begin{matrix} 0 \leq y_{Q, ω} \leq 1, 0 \leq μ_{Q, ω} \leq 1, \forall Q, ω, \\ μ_{Q, ω} = \sum_{ω^{'} \in W} \sum_{Q^{'} = - Q_{max}}^{Q_{max}} y_{Q^{'}, ω^{'}} p_{s} p_{Q} p_{ω ω^{'}} \end{matrix}

(22e)

\begin{matrix} + \sum_{ω^{'} \in W} \sum_{Q^{'} = - Q_{max}}^{Q_{max}} (μ_{Q^{'}, ω^{'}} - y_{Q^{'}, ω^{'}} p_{s}) p_{Q^{'} - Q} p_{ω ω^{'}} . \end{matrix}

(22f)

Proof of Theorem 4.

We first derive the average UoI

{\bar{C}}^{π}

as a function of

μ_{Q, ω}

and

y_{Q, ω}

. The vehicle is in state

(Q, ω)

and produces a cost of

C (Q, ω, u) = ω Q^{2}

with a probability of

μ_{Q, ω}

. Therefore, the average UoI is:

\begin{matrix} \sum_{ω \in W} \sum_{Q = - Q_{max}}^{Q_{max}} ω Q^{2} μ_{Q, ω} . \end{matrix}

(23)

As for the constraints, (22b) means that the sum of the probabilities of all the states should be 1. To explain (22c), we note that

y_{Q, ω}

is the probability of the vehicle being in state

(Q, ω)

and choosing to transmit the update, then the expectation of a one-step updating penalty for state

(Q, ω)

in (8) is

μ_{Q, ω}

. Therefore, the constraint on average update frequency

{\bar{D}}^{π}

can be illustrated by

\begin{matrix} \sum_{ω \in W} \sum_{Q = - Q_{max}}^{Q_{max}} μ_{Q, ω} \leq ρ . \end{matrix}

(24)

Then, we introduce

ξ_{Q, ω} \in [0, 1]

to represent that the probability of the vehicle choosing to transmit updates in state

(Q, ω)

and (22d) can be obtained by the fact that

y_{Q, ω} = μ_{Q, ω} ξ_{Q, ω}

, while (22e) is derived by the nature of probability.

The right-hand side of (22f) can be viewed as two terms. The first term is the sum of transition probability from all the states to state

(Q, ω)

when the vehicle chooses to update and the transmission of status is successful. The second term is the sum of transition probability from all the states to state

(Q, ω)

when the transmission is failed or the vehicle chooses to wait. Therefore, we can prove that the optimal solution of problem (5) equals the solution of the LP problem. □

When

ω (t)

is i.i.d. over time, we can also obtain the UoI-optimal policy through the LP problem proposed in Theorem 4 and only need to use

p_{ω^{'}}

as a replacement of

p_{ω ω^{'}}

.

4. Scheduling in Unknown Contexts

To make decisions, the UoI-optimal updating policy obtained in Section 3 still needs the distributions of the context-aware weight

ω (t)

, the increment

A (t)

and the successful transmission probability, which may not be available in advance or may change over time in most practical systems. To solve this problem, we will assume that the distribution of the context-aware weight is not pre-determined and the vehicle has to learn it. In this section, we use the reinforcement learning (RL) algorithm to learn the dynamic of the context and the characteristic of the wireless channel.

To solve this problem, we turn to the model-based RL framework proposed in [36]. We only consider the cases when the UoI-optimal policy has a threshold structure. This assumption makes the optimal policy based on the truncated state space equal the optimal policy of the original problem.

We use the 3-tuple

(s, s^{'}, U)

to formulate the proposed RL-based updating policy. The states in the current slot and next slot are denoted by s and

s^{'}

, respectively. U denotes the action chosen in the current slot. The settings of the discretized state space and the action space are the same as the settings proposed in Section 3.1. The smaller the step size used in the discretization is, the closer our results are to those in continuous state space. In addition, the selection of the step size only affects the accuracy of the update threshold. Therefore, the performance loss caused by discretization can be reduced by choosing a smaller step size.

We display details about the proposed RL-based updating policy in Algorithm 1. At the beginning of episode k, we randomly decide whether to explore or exploit.

l \in [0, 1]

represents the trade-off between exploration and exploitation during the following episode. A larger l means a higher frequency of exploration and vice versa. If the algorithm chooses to explore during this episode, a random policy

π_{r a n d} (s)

will be used, i.e., we randomly choose to update or not in each state to find more valuable actions. If the algorithm chooses to exploit, then we have to obtain the probability transfer functions

{\tilde{p}}_{k} (s^{'} | s, U)

for each state transmission pair. In Algorithm 1,

N (s, U)

and

N (s, U, s^{'})

represent the number of occurrences of state–action pair

s, U

and state transition from s to

s^{'}

given action U, respectively. Based on the assumption that the optimal policy has a threshold structure, the policy

π (k)

which can minimize the average UoI with the estimated probability transfer functions, can be directly solved through the LP problem proposed in Theorem 4. Then, the vehicle will use policy

π_{k}

to derive state–action pairs and the state transitions in the following

⌈ L_{k} ⌉

slots. Here,

L > 0

is defined to control the number of state transitions observed in each episode. At the end of each episode, the model will be updated according to the state–action pairs and the state transitions observed during the episode. Finally, after K episodes, the algorithm will output the RL-based updating policy

π^{★} (s)

, which is derived based on

{\tilde{p}}_{K} (s^{'} | s, U)

.

Algorithm 1 RL-based Updating Policy

Input:

l \in [0, 1], L > 0, K > 0

1:: for episodes $k = 1, 2, \dots, K$ do
2:: Set $L_{k} = L \sqrt{k}, ϵ_{k} = l / \sqrt{k}$ , uniformly draw $α \in [0, 1]$ .
3:: if $α < ϵ_{k}$ then
4:: Set $π_{k} (s) = π_{r a n d} (s),$
5:: else
6:: for each state $s, s^{'} \in S$ and $U \in U$ do
7:: if $N (s, U) > 0$ then
8:: Let ${\tilde{p}}_{k} (s^{'} | s, U) = N (s, U, s^{'}) / N (s, u),$
9:: else
10:: ${\tilde{p}}_{k} (s^{'} | s, U) = 1 / | S | .$
11:: end if
12:: end for
13:: obtain policy $π_{k} (s)$ by solving the estimated CMDP
14:: end if
15:: Randomly choose an initial state $s (1)$ .
16:: for slots $t = 1, 2, \dots, ⌈ L_{k} ⌉ - 1$ do
17:: Choose action $U (t)$ as $π_{k} (s (t))$ .
18:: Observe the next state $s (t + 1)$ .
19:: $N (s (t), U (t), s (t + 1)) = N (s (t), U (t), s (t + 1)) + 1$ .
20:: $N (s (t), U (t)) = N (s (t), U (t)) + 1$ .
21:: $s (t) \leftarrow s (t + 1)$ .
22:: end for
23:: end for
24:: obtain policy $π^{★} (s)$ by solving the estimated CMDP based on ${\tilde{p}}_{k} (s^{'} | s, U)$ , $s, s^{'} \in S, U \in U$ .

Output: output the RL-based updating policy

π^{★} (s)

5. Simulation Results and Discussion

5.1. Simulation Setup

To facilitate the simulation, we consider the case where the context-aware weight of the vehicle only has two different states: the ’normal’ state and ’urgent’ state. The ’normal’ state means that the vehicle is in ordinary situations and the significance of accuracy of status information is relatively low. We set

ω (t)

as 1 in ’normal’ state while

ω (t)

is set as a constant much larger than 1,

ω_{e}

, in ’urgent’ state to show that the vehicle is in emergencies. Two different distributions of the context-aware weight are taken into consideration to conform to the assumptions about

ω (t)

used in Section 3.1:

The context-aware weight $ω (t)$ has the first-order Markov property. The state transition diagram of $ω (t)$ is shown in Figure 2 and $ω (t)$ is irreducible and positive recurrent. $p_{1}$ is the probability of the context-aware weight transferring from the normal state to the urgent state, while $p_{2}$ is the probability of the weight transferring from the urgent state to the normal state;
The context-aware weight $ω (t)$ is i.i.d. over time. The probability of the weight being in the urgent state and the normal state are denoted by $p_{h}$ and $p_{l}$ , respectively.

As for the increment

A (t)

,

A_{m a x}

is set to a large enough positive number to simplify the simulations.

5.2. Numerical Results

Figure 3 shows the structure of the UoI-optimal updating policy. For the discretization of the estimation error, the step size used is 1. It can be seen that under the two different distributions of the context-aware weight mentioned above, the optimal updating policies all have threshold structures. Especially when the context-aware weight is i.i.d. over time, Figure 3b shows that thresholds for all the states of the context-aware weight are the same, which matches well with theoretical analysis. From Figure 3c, we can find that the UoI-optimal policy also has threshold structure when increment

A (t)

obeys a uniform distribution

U n i f (- 3, 3)

, which verifies Remark 1. We then simulate the UoI-optimal policy under the contexts with more states to show the policy is generic. We consider a three-state context-aware weight which takes value from

ω_{1} = 1, ω_{2} = 50, ω_{3} = 100

. The state transition matrix

P_{3}

of the three-state context-aware weight is:

P_{3} = [\begin{matrix} 0.997 & 0.002 & 0.001 \\ 0.02 & 0.97 & 0.01 \\ 0.2 & 0.1 & 0.7 \end{matrix}],

(25)

where the j-th element on the i-th row indicates the probability that the context transfers from state

ω_{i}

to state

ω_{j}

. The numerical results (Figure 3d) show that when the context-aware weight has more states, the UoI-optimal policy still has a threshold structure, which verifies our theoretical results.

Then, we will focus on the results obtained when the context-aware weight is a first-order irreducible positive recurrent Markov process, as shown in Figure 2. Figure 4 shows the average UoI of the UoI-optimal policy, the AoI-optimal policy derived by CMDP, the RL-based updating policy, and the update-index-based adaptive scheme [27]. In the RL-based updating policy,

L = 8000

,

l = 1

and

K = 50

. All the numerical results of the RL-based policy are averaged over 100 runs.

First of all, the UoI-optimal policy can only be obtained based on advanced information about the system dynamics. However, the RL-based policy achieves near-optimal without knowing the system dynamics, which indicates that Algorithm 1 learns relatively accurate probability transfer functions from the observed state–action pairs and state transitions during the training.

Secondly, according to Figure 4, the AoI-optimal policy yields a much higher UoI than the three UoI-based policies, namely the UoI-optimal policy, the RL-based updating policy, and the update-index-based adaptive scheme. On the one hand, AoI is one special case of UoI. When the context-aware weight

ω (t) = 1

, the increment

A (t) = 1

, and the cost function

δ (Q (t)) = Q (t)

, then UoI equals AoI. Therefore, the AoI-optimal policy ignores the fact that different contexts have different requirements for information freshness. In the proposed UoI-based updating policies, different contexts have different policies and update thresholds, while the AoI-optimal updating policies for different contexts are the same. On the other hand, Figure 5 reveals that the AoI-optimal policy leads to a much higher estimation error, which results in worse performance in terms of UoI. The AoI-optimal policy is an oblivious policy, which is independent of the monitored process. Since AoI increases linearly with time, the AoI-optimal policy can only minimize the linear performance degradation in terms of time. However, the UoI-based policies (the cost function

δ (Q (t)) = {(Q (t))}^{2}

) considered in this paper are process-dependent, which are called non-oblivious policies, and can benefit from both age and process realization [37]. These policies can directly minimize the nonlinear impact exerted by information staleness and the gap between the actual status and the estimated status.

Thirdly, our updating policies outperform the update-index-based adaptive scheme [27] in terms of UoI. Under the adaptive scheme, the vehicle will derive an update index as a function of the current estimation error and the context-aware weight for the next slot. If the index is larger than the adaptive update threshold, then the vehicle is supposed to transmit its status information to the center. If the vehicle transmits an update in slot t, then the adaptive threshold will increase in the next slot; otherwise, the adaptive threshold will decrease. The adaptive scheme will cause an overuse of the resource in `urgent’ states and lead to the fact that the vehicles cannot receive resources in `normal’ states. However, the UoI-optimal policy and the trained RL-updating policy are fixed schemes, which can avoid the extremely unbalanced resource allocation between the two contexts and achieve better performance.

Figure 6 shows the influence of the maximum average update frequency

ρ

and the context weight for emergency,

ω_{e}

, on update threshold of UoI-optimal policy. In order to obtain more accurate results, the step size used here is 0.25. The solid curves show update thresholds for the normal state while the dashed curves show update thresholds for the urgent state. When the constraint on update resources is strict, the update thresholds fall faster. Furthermore, a larger

ω_{e}

results in a lower update threshold for the urgent state and a higher threshold for the normal state. This phenomenon indicates that the value of

ω_{e}

means the tolerance of estimation error in the emergency. When

ρ < 0.1

, the influence of

ω_{e}

on the update threshold for the normal state is larger than the urgent state. For the cases where the maximum average update frequency is relatively large,

ω_{e}

has little effect on update thresholds for both normal state and urgent state.

Figure 7 shows that the update thresholds also depend on the dynamic of context-aware weight when the weight has first-order Markov property. When

p_{2}

is approaching

1 - p_{1}

, the gap between update thresholds for the urgent state and the normal state becomes smaller for the context-aware weight which tends to be i.i.d. over time. When

p_{2} = 1

, the update threshold for the urgent state exceeds the threshold for the normal state. Therefore, the update threshold for the urgent state is not necessarily lower than the update threshold for the normal state.

Figure 8 shows the performance of the RL-based updating policy with different values of L. According to Algorithm 1, the number of state transitions observed in episode k is

⌈ L \sqrt{k} ⌉

. Therefore, L denotes the number of state transitions observed during the whole learning process. Generally speaking, a larger L reduces the randomness of the performance and achieves a better UoI. The performance of the RL-based updating policy depends on the accuracy of the model obtained through training, namely whether the estimated probability transfer function of the system is accurate. A larger L means that the algorithm can collect more data or state transitions and obtain a more accurate model.

Figure 9 shows the influence of the number of episodes, i.e., K, on the performance of the RL-based updating policy. A larger K leads to a lower average UoI and smaller randomness over 100 runs. On the one hand, the more episodes and the more data the algorithm observes, the more accurate the model obtained will be and the better the performance of the updating policy will be. On the other hand, the value of K is the number of iterations for the policy obtained through the estimated CMDP. The policy

π_{k} (s)

used in episode k is derived based on the state–action pairs and the state transitions observed in the previous

k - 1

episodes. Therefore, more frequent iterations of the updating policy can obtain more valuable state–action pairs and better performance.

6. Conclusions

In this work, we studied how to minimize the performance degradation caused by outdated information in terms of UoI, which is a new metric jointly considering context and information freshness. We proved that the UoI-optimal updating policy for the considered single-user remote monitoring system has a single threshold structure. Then, the policy was obtained through linear programming by assuming that the state transition probability of the system is known in advance. In unknown contexts, we further used a reinforcement learning algorithm to learn the dynamics of the system. Simulations verified the threshold structure of the UoI-optimal policies and showed that the update thresholds decrease as the maximum average update frequency increases. In addition, a larger context-aware weight in emergencies resulted in a lower update threshold for urgent states. However, since the state transition probability also influenced the update thresholds, the update threshold for emergencies was not necessarily higher than the update threshold for normal states, especially when the probability of transferring from urgent states to normal states tended towards 1. Furthermore, the numerical results showed that the proposed RL-based updating policy achieved a near-optimal performance without advanced knowledge of the system model.

In fact, determining the context-aware weight in practical systems, where the models of the context are often very complicated and difficult to obtain in advance, remains open. As for future work, we plan to use deep RL algorithms to learn the models of the context variation. We believe that UoI can provide a new performance metric for information timeliness measurement in the future V2X scenario. In addition, we believe the proposed UoI metric and the context-aware scheduling policy can shed some light on low-latency and ultra-reliable wireless communication in the future 5G/6G systems.

Author Contributions

Conceptualization, L.W.; formal analysis, L.W.; methodology, L.W.; software, L.W.; supervision, S.Z. and Z.N.; writing—original draft, L.W.; writing—review and editing, J.S., Y.S., S.Z. and Z.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work is sponsored in part by the National Key R&D Program of China No. 2020YFB1806605, by the Nature Science Foundation of China (No. 62022049, No. 61871254, No. 61861136003), by the China Postdoctoral Science Foundation No. 2020M680558, and Hitachi Ltd.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AoI	Age of Information
AoII	Age of Incorrect Information
AoS	Age of Synchronization
CDF	Cumulative Distribution Function
CMDP	Constrained Markov Decision Process
CoUD	Cost of Update Delay
i.i.d.	Independent and Identically Distributed
IoT	Internet of Things
LP	Linear Programming
PDF	Probability Density Function
RL	Reinforcement Learning
UoI	Urgency of Information
URLLC	Ultra-Reliable Low-Latency Communication
VR	Virtual Reality
V2X	Vehicle to Everything

Appendix A. Proof of Theorem 1

Given a state

s = (Q, ω) \in S

and a nonempty subset of the state space,

G \subset S

, let

R (s, G)

denote the class of policies

θ

such that the probability

P^{θ} (s (t) \in G for some t \geq 1 | s (0) = s) = 1

and the expected time

m_{s, G} (θ)

of the first passage from s to

G

under policy

θ

is finite. Then, let

R^{★} (s, G)

denote the class of policies

θ

such that the expected average UoI

c_{s, G} (θ)

and the expected transmission cost

d_{s, G} (θ)

of the first passage from s to

G

are finite and

θ

\in R (s, G)

. To prove Theorem 1, we then introduce Assumptions A1–A5 in [33]:

Assumption A1.

For all

b > 0

, the set

G (b) ≜ {s | there exists an action U such that C (s, U) + D (s, U) \leq b}

is finite.

Assumption A2.

There exists a stationary deterministic policy π that induces a Markov chain with the following properties: the state space consists of a single (nonempty) positive recurrent class

R^{π}

and a set

T^{π}

of transient states such that

π \in R^{★} (s, R^{π})

, for any

s \in T^{π}

, and both the average UoI

{\bar{C}}^{π}

and the average transmission cost

{\bar{D}}^{π}

on

R^{π}

are finite.

Assumption A3.

Given any two states

s, s^{'} \in S

and

s \neq s^{'}

, there exists a policy π (a function of s and

s^{'}

) such that

π \in R^{★} (s, {s^{'}})

.

Assumption A4.

If a stationary deterministic policy has at least one positive recurrent state, then it has a single positive recurrent class, and this class contains the state

(Q, ω)

with

Q = 0

.

Assumption A5.

There exists a policy π such that the average UoI

{\bar{C}}^{π} < \infty

and average transmission cost

{\bar{D}}^{π} < ρ

.

Furthermore, the problem (9) has the following property:

Lemma A1.

Assumptions A1–A5 hold for problem (9).

Proof of Lemma A1.

First of all, we focus on the cases where the context-aware weight is assumed as a first-order irreducible positive recurrent Markov process:

Assumption A1: In this problem, $C (s, U)$ is the UoI at state s, namely $C (Q, ω, U) = ω Q^{2}$ . $D (s, U)$ is 1 if the vehicle chooses to transmit its status and $D (s, U)$ is 0 otherwise, namely $D (Q, ω, U) = U$ . Therefore, Assumption A1 holds, for any $b > 0$ , the number of states $(Q, ω)$ with $ω Q^{2} \leq b$ is finite.
Assumption A2: Due to the current high-level wireless communication technology, we reasonably assumed that the successful transmission probability $p_{s}$ is relatively close to 1. Based on the assumptions mentioned above, the Markov chain of context-aware weight obviously satisfies Assumption A2. Define the probability of the context-aware weight transferring from $ω$ to $ω^{'}$ in k steps for the first time as $P_{ω, ω^{'}, k}$ . Then, we consider the policy $π (Q, ω) = 1$ for all $(Q, ω) \in S$ , namely this policy chooses to transmit in all the states.
Since the evolution of the context-aware weight is independent with the evolution of the estimation error and the updating policy. Therefore, we first focused on the estimation error, which can be formulated as a one-dimensional irreducible Markov chain with state space $Q = {0, \pm Δ_{Q}, \pm 2 Δ_{Q}, \dots, \pm n Δ_{Q}, \dots}$ . We denote the set of states which can transfer to state Q in a single step by $Z_{Q}$ . The probability of the estimation error transferring from state Q to state $Q^{'}$ at the k-th step without an arrival to state $Q = 0$ is defined as $P_{Q, Q^{'}, k}^{'}$ . Obviously, $\sum_{Q^{'} \in Q} P_{Q, Q^{'}, k}^{'} < {(1 - p_{s})}^{k}$ . Then, the probability of the first passage from state $Q (Q \neq 0)$ to 0 taking $k + 1$ steps is $\sum_{Q^{'} \notin Z_{0}} P_{Q, Q^{'}, k}^{'} p_{s} + \sum_{Q^{'} \in Z_{0}} P_{Q, Q^{'}, k}^{'} (p_{s} + (1 - p_{s}) p_{0 - Q^{'}}) < {(1 - p_{s})}^{k}$ , where $p_{0 - Q^{'}}$ is the probability that the increment in estimation error is $- Q^{'}$ . Therefore, the expected time of the first passage from $Q (Q \neq 0)$ to 0 is finite.
For state $Q = 0$ , the estimation error will stay in this state in the next step with a probability of $p_{s} + p_{0 - 0}$ and will first return to state $Q = 0$ in the second transition with a probability smaller than $(1 - p_{s} - p_{0 - 0})$ . Then, starting from state $Q = 0$ , the estimation error will first return to state $Q = 0$ in the $k + 1$ -th ( $k > 2$ ) step will be smaller than $(1 - p_{s} - p_{0 - 0}) {(1 - p_{s})}^{k - 1}$ . Therefore, we can prove that state $Q = 0$ is a positive recurrent state, and $R_{Q}^{π} = {Q = 0}$ is a positive recurrent class of the induced Markov chain of the estimation error. Furthermore, for any states in $T_{Q}^{π} = Q ∖ R_{Q}^{π}$ , the expected time of the first passage from the state in $T_{Q}^{π}$ to state $Q = 0$ under $π$ is finite and the probability of the states in $T_{Q}^{π}$ not getting to state $Q = 0$ in k steps is smaller than ${(1 - p_{s})}^{k}$ .
Define the probability of state Q transferring to state $Q^{'}$ in k steps for the first time as $P_{Q, Q^{'}, k}$ . Then, the probability of state $(Q, ω)$ transferring to state $(Q^{'}, ω^{'})$ in k steps for the first time is $P_{Q, Q^{'}, k} P_{ω, ω^{'}, k}$ . Since $\sum_{k = 1}^{\infty} P_{Q, Q^{'}, k} k < \infty$ and $\sum_{k = 1}^{\infty} P_{ω, ω^{'}, k} k < \infty$ , then $\sum_{k = 1}^{\infty} P_{Q, Q^{'}, k} P_{ω, ω^{'}, k} k < \infty$ . Therefore, the set of states $R^{π} = {(Q, ω) | Q \in R_{Q}^{π}, ω \in W}$ is a positive recurrent class. Similarly, we can prove that $T^{π} = S ∖ R^{π}$ satisfies Assumption A2. Finally, ${\bar{D}}^{π} = 1 < \infty$ , ${\bar{C}}^{π} = E [ω] \frac{1}{p_{s}} σ^{2} < \infty$ .
Assumption A3: Define $P_{Q, min} = {min}_{Q^{'}} p_{Q - Q^{'}}$ , $P_{Q, m a x} = {max}_{Q^{'}} p_{Q - Q^{'}}$ . Consider the policy $π^{'} (Q, ω) = 0$ for all states $(Q, ω) \in S$ , notably that this policy chooses not to transmit in any states. Similarly, we first focus on the Markov chain of estimation error. Starting from state Q, the probability of transferring to state $Q^{'}$ in $k + 1$ -th $(k \geq 2)$ steps for the first time is smaller than $(1 - p_{Q^{'} - Q}) P_{Q^{'}, m a x} {(1 - P_{Q^{'}, m i n})}^{k - 1}$ . Then, the expected time of the first passage from state Q to state $Q^{'}$ under policy $π^{'}$ is finite. Similarly, since the Markov chain of context-aware weight is irreducible positive recurrent and independent with the updating policy, we can therefore prove that the expected time of the first passage from state $(Q, ω)$ to state $(Q^{'}, ω^{'})$ under policy $π^{'}$ is finite.
Assumption A4: For the Markov chain of the estimation error, any state will return to state $Q = 0$ if a successful transmission occurs. For the policy without transmission, namely $π^{'} (Q, ω) = 0$ , state $Q = 0$ still exists in only one positive recurrent class. For each positive recurrent class containing state $Q = 0$ , we can prove that there is only one positive recurrent class. Since the Markov chain of the context-aware weight is irreducible positive recurrent, we can similarly prove Assumption A4.
Assumption A5: The policy $π_{ρ}$ that updates the status with a probability of $ρ - δ$ satisfies Assumption A5. Here, $δ$ is a small positive number. Under this policy, ${\bar{D}}^{π} = ρ - δ < ρ$ and ${\bar{C}}^{π} = E [ω] \frac{1}{p_{s} (ρ - δ)} σ^{2} < \infty$ .

Similarly, we can prove that Assumptions A1–A5 also holds for problem (9) when the context-aware weight is i.i.d. over time. □

Since Assumptions A1–A5 hold for problem (9), then according to Theorem 2.5 in [33], there exists an optimal stationary randomized policy for problem (9). Meanwhile, the optimal policy is a probabilistic combination of two stationary deterministic policies which only differ on at most one state.

Furthermore, according to Lemma 3.9 in [33], the two stationary deterministic policies each optimize the unconstrained cost in (10) with a different

λ

.

References

Talak, R.; Karaman, S.; Modiano, E. Speed limits in autonomous vehicular networks due to communication constraints. In Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; pp. 4998–5003. [Google Scholar]
Hou, I.H.; Naghsh, N.Z.; Paul, S.; Hu, Y.C.; Eryilmaz, A. Predictive Scheduling for Virtual Reality. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications, Toronto, ON, Canada, 6–9 July 2020; pp. 1349–1358. [Google Scholar]
Kaul, S.; Yates, R.; Gruteser, M. Real-time status: How often should one update? In Proceedings of the 2012 Proceedings IEEE INFOCOM, Orlando, FL, USA, 25–30 March 2012; pp. 2731–2735. [Google Scholar]
Sun, Y.; Polyanskiy, Y.; Uysal-Biyikoglu, E. Remote estimation of the Wiener process over a channel with random delay. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 321–325. [Google Scholar]
Sun, Y.; Polyanskiy, Y.; Uysal, E. Sampling of the wiener process for remote estimation over a channel with random delay. IEEE Trans. Inf. Theory 2019, 66, 1118–1135. [Google Scholar] [CrossRef]
Jiang, Z.; Zhou, S. Status from a random field: How densely should one update? In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 1037–1041. [Google Scholar]
Bedewy, A.M.; Sun, Y.; Singh, R.; Shroff, N.B. Optimizing information freshness using low-power status updates via sleep-wake scheduling. In Proceedings of the Twenty-First International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, New York, NY, USA, 11–14 October 2020; pp. 51–60. [Google Scholar]
Ceran, E.T.; Gündüz, D.; György, A. Average age of information with hybrid ARQ under a resource constraint. IEEE Trans. Wirel. Commun. 2019, 18, 1900–1913. [Google Scholar] [CrossRef] [Green Version]
Sun, J.; Jiang, Z.; Krishnamachari, B.; Zhou, S.; Niu, Z. Closed-form Whittle’s index-enabled random access for timely status update. IEEE Trans. Commun. 2019, 68, 1538–1551. [Google Scholar] [CrossRef]
Yates, R.D.; Kaul, S.K. Status updates over unreliable multiaccess channels. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 331–335. [Google Scholar]
Sun, J.; Wang, L.; Jiang, Z.; Zhou, S.; Niu, Z. Age-Optimal Scheduling for Heterogeneous Traffic with Timely Throughput Constraints. IEEE J. Sel. Areas Commun. 2021, 39, 1485–1498. [Google Scholar] [CrossRef]
Tang, H.; Wang, J.; Song, L.; Song, J. Minimizing age of information with power constraints: Multi-user opportunistic scheduling in multi-state time-varying channels. IEEE J. Sel. Areas Commun. 2020, 38, 854–868. [Google Scholar] [CrossRef] [Green Version]
Abdel-Aziz, M.K.; Samarakoon, S.; Liu, C.F.; Bennis, M.; Saad, W. Optimized age of information tail for ultra-reliable low-latency communications in vehicular networks. IEEE Trans. Commun. 2019, 68, 1911–1924. [Google Scholar] [CrossRef] [Green Version]
Devassy, R.; Durisi, G.; Ferrante, G.C.; Simeone, O.; Uysal-Biyikoglu, E. Delay and peak-age violation probability in short-packet transmissions. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 2471–2475. [Google Scholar]
Inoue, Y.; Masuyama, H.; Takine, T.; Tanaka, T. A general formula for the stationary distribution of the age of information and its application to single-server queues. IEEE Trans. Inf. Theory 2019, 65, 8305–8324. [Google Scholar] [CrossRef] [Green Version]
Sun, Y.; Uysal-Biyikoglu, E.; Yates, R.D.; Koksal, C.E.; Shroff, N.B. Update or wait: How to keep your data fresh. IEEE Trans. Inf. Theory 2017, 63, 7492–7508. [Google Scholar] [CrossRef]
Zheng, X.; Zhou, S.; Jiang, Z.; Niu, Z. Closed-form analysis of non-linear age of information in status updates with an energy harvesting transmitter. IEEE Trans. Wirel. Commun. 2019, 18, 4129–4142. [Google Scholar] [CrossRef] [Green Version]
Kosta, A.; Pappas, N.; Ephremides, A.; Angelakis, V. Non-linear age of information in a discrete time queue: Stationary distribution and average performance analysis. In Proceedings of the ICC 2020—2020 IEEE International Conference on Communications (ICC), Dublin, Ireland, 7–11 June 2020; pp. 1–6. [Google Scholar]
Kosta, A.; Pappas, N.; Ephremides, A.; Angelakis, V. The cost of delay in status updates and their value: Non-linear ageing. IEEE Trans. Commun. 2020, 68, 4905–4918. [Google Scholar] [CrossRef] [Green Version]
Zhong, J.; Yates, R.D.; Soljanin, E. Two freshness metrics for local cache refresh. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 1924–1928. [Google Scholar]
Maatouk, A.; Kriouile, S.; Assaad, M.; Ephremides, A. The age of incorrect information: A new performance metric for status updates. IEEE/ACM Trans. Netw. 2020, 28, 2215–2228. [Google Scholar] [CrossRef]
Kadota, I.; Sinha, A.; Uysal-Biyikoglu, E.; Singh, R.; Modiano, E. Scheduling policies for minimizing age of information in broadcast wireless networks. IEEE/ACM Trans. Netw. 2018, 26, 2637–2650. [Google Scholar] [CrossRef] [Green Version]
Song, J.; Gunduz, D.; Choi, W. Optimal scheduling policy for minimizing age of information with a relay. arXiv 2020, arXiv:2009.02716. [Google Scholar]
Sun, Y.; Cyr, B. Information aging through queues: A mutual information perspective. In Proceedings of the 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Kalamata, Greece, 25–28 June 2018; pp. 1–5. [Google Scholar]
Kam, C.; Kompella, S.; Nguyen, G.D.; Wieselthier, J.E.; Ephremides, A. Towards an effective age of information: Remote estimation of a markov source. In Proceedings of the IEEE INFOCOM 2018—IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Honolulu, HI, USA, 15–19 April 2018; pp. 367–372. [Google Scholar]
Zheng, X.; Zhou, S.; Niu, Z. Context-aware information lapse for timely status updates in remote control systems. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar]
Zheng, X.; Zhou, S.; Niu, Z. Beyond age: Urgency of information for timeliness guarantee in status update systems. In Proceedings of the 2020 2nd IEEE 6G Wireless Summit (6G SUMMIT), Levi, Finland, 17–20 March 2020; pp. 1–5. [Google Scholar]
Zheng, X.; Zhou, S.; Niu, Z. Urgency of Information for Context-Aware Timely Status Updates in Remote Control Systems. IEEE Trans. Wirel. Commun. 2020, 19, 7237–7250. [Google Scholar] [CrossRef]
Ioannidis, S.; Chaintreau, A.; Massoulié, L. Optimal and scalable distribution of content updates over a mobile social network. In Proceedings of the IEEE INFOCOM 2009, Rio de Janeiro, Brazil, 19–25 April 2009; pp. 1422–1430. [Google Scholar]
Wang, L.; Sun, J.; Zhou, S.; Niu, Z. Timely Status Update Based on Urgency of Information with Statistical Context. In Proceedings of the 2020 32nd IEEE International Teletraffic Congress (ITC 32), Osaka, Japan, 22–24 September 2020; pp. 90–96. [Google Scholar]
Nayyar, A.; Başar, T.; Teneketzis, D.; Veeravalli, V.V. Optimal strategies for communication and remote estimation with an energy harvesting sensor. IEEE Trans. Autom. Control 2013, 58, 2246–2260. [Google Scholar] [CrossRef] [Green Version]
Cika, A.; Badiu, M.A.; Coon, J.P. Quantifying link stability in Ad Hoc wireless networks subject to Ornstein-Uhlenbeck mobility. In Proceedings of the ICC 2019—2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019; pp. 1–6. [Google Scholar]
Sennott, L.I. Constrained average cost Markov decision chains. Probab. Eng. Inf. Sci. 1993, 7, 69–83. [Google Scholar] [CrossRef]
Bertsekas, D.P. Dynamic Programming and Optimal Control; Athena Scientific: Belmont, MA, USA, 2000. [Google Scholar]
Sennott, L.I. Average cost optimal stationary policies in infinite state Markov decision processes with unbounded costs. Oper. Res. 1989, 37, 626–633. [Google Scholar] [CrossRef]
Liu, B.; Xie, Q.; Modiano, E. Rl-qn: A reinforcement learning framework for optimal control of queueing systems. arXiv 2020, arXiv:2011.07401. [Google Scholar]
Chen, X.; Liao, X.; Bidokhti, S.S. Real-time Sampling and Estimation on Random Access Channels: Age of Information and Beyond. arXiv 2020, arXiv:2007.03652. [Google Scholar]

Figure 1. Remote control and monitoring model. The vehicle of interest is shown in red.

Figure 2. The state transition diagram of

ω (t)

.

Figure 2. The state transition diagram of

ω (t)

.

Figure 3. Threshold structure of the UoI-optimal updating policy when: (a) the context-aware weight is a first-order Markov process,

ρ = 0.05, p_{1} = 0.001, p_{2} = 0.01, p_{s} = 0.9, σ^{2} = 1, ω_{e} = 100

; (b) the context-aware weight is i.i.d. over time,

ρ = 0.05, p_{l} = 0.999, p_{h} = 0.001, p_{s} = 0.9, σ^{2} = 1, ω_{e} = 100

; (c) the context-aware weight is a first order Markov process,

ρ = 0.05, p_{1} = 0.001, p_{2} = 0.01, p_{s} = 0.9, ω_{e} = 100

, increment in the estimation error during one slot, i.e.,

A (t) \sim U n i f (- 3, 3)

, for

\forall t

; and (d) the context-aware weight is a three-state first-order Markov process, which takes value from

ω_{1} = 1, ω_{2} = 50, ω_{3} = 100

and evolves according to the state transition matrix

P_{3}

,

ρ = 0.05, p_{s} = 0.9, σ^{2} = 1

.

Figure 3. Threshold structure of the UoI-optimal updating policy when: (a) the context-aware weight is a first-order Markov process,

ρ = 0.05, p_{1} = 0.001, p_{2} = 0.01, p_{s} = 0.9, σ^{2} = 1, ω_{e} = 100

; (b) the context-aware weight is i.i.d. over time,

ρ = 0.05, p_{l} = 0.999, p_{h} = 0.001, p_{s} = 0.9, σ^{2} = 1, ω_{e} = 100

; (c) the context-aware weight is a first order Markov process,

ρ = 0.05, p_{1} = 0.001, p_{2} = 0.01, p_{s} = 0.9, ω_{e} = 100

, increment in the estimation error during one slot, i.e.,

A (t) \sim U n i f (- 3, 3)

, for

\forall t

; and (d) the context-aware weight is a three-state first-order Markov process, which takes value from

ω_{1} = 1, ω_{2} = 50, ω_{3} = 100

and evolves according to the state transition matrix

P_{3}

,

ρ = 0.05, p_{s} = 0.9, σ^{2} = 1

.

Figure 4. Average UoI of the UoI-optimal updating policy, the RL-based updating policy, the update-index-based adaptive scheme [27], and the AoI-optimal updating policy when

p_{1} = 0.001, p_{2} = 0.01, p_{s} = 0.9, σ^{2} = 1, ω_{e} = 100

.

Figure 4. Average UoI of the UoI-optimal updating policy, the RL-based updating policy, the update-index-based adaptive scheme [27], and the AoI-optimal updating policy when

p_{1} = 0.001, p_{2} = 0.01, p_{s} = 0.9, σ^{2} = 1, ω_{e} = 100

.

Figure 5. Average squared estimation error of the UoI-optimal updating policy and the AoI-optimal updating policy when

p_{1} = 0.001, p_{2} = 0.01, σ^{2} = 1, ω_{e} = 100

.

Figure 5. Average squared estimation error of the UoI-optimal updating policy and the AoI-optimal updating policy when

p_{1} = 0.001, p_{2} = 0.01, σ^{2} = 1, ω_{e} = 100

.

Figure 6. Update thresholds of the UoI-optimal updating policy with different values of

ω_{e}

when

p_{1} = 0.001, p_{2} = 0.01, p_{s} = 0.9, σ^{2} = 1

.

Figure 6. Update thresholds of the UoI-optimal updating policy with different values of

ω_{e}

when

p_{1} = 0.001, p_{2} = 0.01, p_{s} = 0.9, σ^{2} = 1

.

Figure 7. Update thresholds of the UoI-optimal updating policy with different values of

p_{2}

when

p_{1} = 0.01, p_{s} = 0.9, σ^{2} = 1, ω_{e} = 100

.

Figure 7. Update thresholds of the UoI-optimal updating policy with different values of

p_{2}

when

p_{1} = 0.01, p_{s} = 0.9, σ^{2} = 1, ω_{e} = 100

.

Figure 8. Average UoI of the RL-based updating policy with different values of L when

p_{1} = 0.001, p_{2} = 0.01, σ^{2} = 1, ω_{e} = 100

.

Figure 8. Average UoI of the RL-based updating policy with different values of L when

p_{1} = 0.001, p_{2} = 0.01, σ^{2} = 1, ω_{e} = 100

.

Figure 9. Average UoI of the RL-based updating policy with different values of K when

p_{1} = 0.001, p_{2} = 0.01, σ^{2} = 1, ω_{e} = 100

.

Figure 9. Average UoI of the RL-based updating policy with different values of K when

p_{1} = 0.001, p_{2} = 0.01, σ^{2} = 1, ω_{e} = 100

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, L.; Sun, J.; Sun, Y.; Zhou, S.; Niu, Z. A UoI-Optimal Policy for Timely Status Updates with Resource Constraint. Entropy 2021, 23, 1084. https://doi.org/10.3390/e23081084

AMA Style

Wang L, Sun J, Sun Y, Zhou S, Niu Z. A UoI-Optimal Policy for Timely Status Updates with Resource Constraint. Entropy. 2021; 23(8):1084. https://doi.org/10.3390/e23081084

Chicago/Turabian Style

Wang, Lehan, Jingzhou Sun, Yuxuan Sun, Sheng Zhou, and Zhisheng Niu. 2021. "A UoI-Optimal Policy for Timely Status Updates with Resource Constraint" Entropy 23, no. 8: 1084. https://doi.org/10.3390/e23081084

APA Style

Wang, L., Sun, J., Sun, Y., Zhou, S., & Niu, Z. (2021). A UoI-Optimal Policy for Timely Status Updates with Resource Constraint. Entropy, 23(8), 1084. https://doi.org/10.3390/e23081084

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A UoI-Optimal Policy for Timely Status Updates with Resource Constraint

Abstract

1. Introduction

2. System Model and Problem Formulation

3. Scheduling with CMDP-Based Approach

3.1. Constrained Markov Decision Process Formulation

3.2. Threshold Structure of the Optimal Policy

3.3. Numerical Solution of Optimal Strategy

4. Scheduling in Unknown Contexts

5. Simulation Results and Discussion

5.1. Simulation Setup

5.2. Numerical Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Proof of Theorem 1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI