Next Article in Journal
Utilizing the Particle Swarm Optimization Algorithm for Determining Control Parameters for Civil Structures Subject to Seismic Excitation
Previous Article in Journal
Fine-Grained Pests Recognition Based on Truncated Probability Fusion Network via Internet of Things in Forestry and Agricultural Scenes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Algorithm for Making Regime-Changing Markov Decisions

School of Mathematical and Physical Sciences, University of Technology Sydney, P.O. Box 123, Ultimo, NSW 2007, Australia
Algorithms 2021, 14(10), 291; https://doi.org/10.3390/a14100291
Submission received: 23 August 2021 / Revised: 27 September 2021 / Accepted: 30 September 2021 / Published: 4 October 2021
(This article belongs to the Special Issue Machine Learning Applications in High Dimensional Stochastic Control)

Abstract

:
In industrial applications, the processes of optimal sequential decision making are naturally formulated and optimized within a standard setting of Markov decision theory. In practice, however, decisions must be made under incomplete and uncertain information about parameters and transition probabilities. This situation occurs when a system may suffer a regime switch changing not only the transition probabilities but also the control costs. After such an event, the effect of the actions may turn to the opposite, meaning that all strategies must be revised. Due to practical importance of this problem, a variety of methods has been suggested, ranging from incorporating regime switches into Markov dynamics to numerous concepts addressing model uncertainty. In this work, we suggest a pragmatic and practical approach using a natural re-formulation of this problem as a so-called convex switching system, we make efficient numerical algorithms applicable.

1. Introduction

Decision-theoretic planning is naturally formulated and solved using Markov Decision Processes (MDPs, see [1]). This theory provides a fundamental and intuitive formalism not only for sequential decision optimization, but also for diverse learning problems in stochastic domains. A typical goal in such framework is to model an environment as a set of states and actions that can be performed to control these states. Thereby, the goal is to drive the system maximizing specific performance criteria.
The methodologies of MDPs have been successfully applied to (stochastic) planning, learning, robot control, and game playing problems. In fact, MDPs nowadays provide a standard toolbox for learning techniques in sequential decision making. To explain our contribution to this traditional and widespread area, let us consider a simplified example of motion control. Suppose that a robot is moving in two horizontal directions on a rectangular grid whose cells are identified with the states of the system. At any time step, there are four actions to guide the robot from the current position to one of the neighboring cells. These actions are UP, DOWN, RIGHT, LEFT, which command a move to the corresponding direction. However, the success of these actions is uncertain and is state-dependent: For instance, a command (UP) may not always cause a transition to the intended (upper) cell, particularly if the robot is at the (upper) boundary. The controller aims to reach a pre-specified target cell at minimal total costs. These costs are accumulated each time when a control is applied and depend on the current position—the cell currently occupied—accounting for obstacles and other adverse circumstances that may be encountered at some locations. In mathematical terms, such a motion is determined by a so-called controlled discrete-time discrete-space Markov chain. In a more realistic situation, the environment is dynamic: The target can suddenly change its location, or navigation through certain cells can become more difficult, changing the control costs and transitions. In principle, such problems can be addressed in terms of the so-called partially observable Markov decision processes (POMDPs, see [2]), but this approach may turn out to be cumbersome due its higher complexity than that of ordinary MDPs. Instead, we suggest a technique which overcomes this difficulty by a natural and direct modeling in terms of a finite number of Markov decision processes (sharing the same set of sets and actions), each active in a specific regime, when the regime changes, another Markov decision processes takes over. Thereby, the regime is not directly observable, thus the controller must guess which of those Markov decision processes is valid at the current decision time: Determining an optimal control becomes challenging due to regime switches. Surprisingly, one can re-formulate such control problems as a convex switching system [3], in order to take advantage from efficient numerical schemes and advanced error control. As a result, we obtain sound algorithms for solutions of regime-modulated Markov decision problems.
This work models random environment as selection of a finite-number ordinary Markov Decision Processes which are mixed by uncertain observations. On this account, we follow a traditional path facing well-known difficulties originated from high-dimensional state spaces and all problems from incorporating an information flow into a centralized decision process. However, let us emphasize that there is an approach which aims to bypass some of the problems through an alternative modeling. Namely, a game-theoretic framework has attracted significant attention (see [4,5]) as an alternative to the traditional centralized decision optimization within a random and dynamic environment. Here, individual agents aiming at selfishly maximizing their own wealth are acting in a game whose equilibrium replaces a centralized decision optimization. In such a context, the process of gathering and utilizing information is easier to model and manage since the individual strategy optimization is simpler and more efficient in collecting and processing all relevant information which may be private, noisy and highly dispersed.
Before we turn to technical details, let us we summarize in the notations and abbreviations used in this work Table 1.

2. Discrete-Time Stochastic Control

First, let us review a finite-horizon control theory. Consider a random dynamics within a time horizon { 0 , 1 , , T } N whose state x evolves in a state space X (subset of a Euclidean space) and is controlled by actions a from a finite action set A. The mapping π t : X A , describing the action π t ( x ) that the controller takes at time t = 0 , T in the situation x X is referred to as decision rule. A sequence of decision rules π = ( π t ) t = 0 T 1 is called a policy. Given a family of
stochastic kernels K t a ( x , d x ) , t = 0 , , T 1 a A on X ,
there exists an appropriately constructed probability space which supports a stochastic process ( X t ) t = 0 T such that for each initial point x 0 X and each policy π = ( π t ) t = 0 T 1 there exists a measure P x 0 , π and such that
P x 0 , π ( X 0 = x 0 ) = 1 , P x 0 , π ( X t + 1 B | X 0 , , X t ) = K t π t ( X t ) ( X t , B )
holds for t = 0 , , T 1 for each measurable B X . Such a system is called controlled Markovian evolution. The interpretation is that if at time t the process state is X t and the action π t ( X t ) is applied, then the distribution K t π t ( X t ) ( X t , · ) randomly changes the state from X t to X t + 1 .
Stochastic kernels are equivalently described in terms of transition operators. The transition operator, associated with the stochastic kernel K t a will be denoted by the same letter, but it acts on functions v : X R by
( K t a v s . ) ( x ) = X v ( x ) K t a ( x , d x ) x X , t = 0 , , T 1
whenever the above integrals are well-defined.
Having introduced such controlled Markovian dynamics, let us turn to control costs now. Suppose that for each time t = 0 , , T 1 , we are given the t-step reward function  r t : X × A R where r t ( x , a ) represents the reward for applying an action a A when the state of the system is x X at time t. At the end of the time horizon, at time T, it is assumed that no action can be taken. Here, if the system is in a state x, a scrap value  r T ( x ) , described by a pre-specified scrap function  r T : X R is collected. The expectation of the cumulative reward from following a policy π is referred to as policy value:
v 0 π ( x ) = E x , π t = 0 T 1 r t ( X t , π t ( X t ) ) + r T ( X T ) , x X ,
where E x , π denotes the expectation over the controlled Markov chain whose distribution is defined by (1). For t = 0 , , T 1 , introduce the backward induction operator
T t a v ( x ) = r t ( x , a ) + K t a v ( x ) , x X , a A
which acts on each measurable function v : X R where K t a v is well-defined. The policy value is obtained as a result of the recursive procedure
v T π = r T , v t π ( x ) = T t π ( x ) v t + 1 π ( x ) x X , for t = T 1 , , 0
which is referred to as the policy iteration. The value of a “best possible” policy is addressed using
v 0 * ( x ) = sup π v 0 π ( x ) , x X ,
where π runs through the set of all policies, the function v 0 * is called the value function. The goal of the optimal control is to find a policy π * = ( π t * ) t = 0 T 1 where the above maximization is attained:
v 0 π * ( x ) = v 0 * ( x ) for each x X .
Such policy optimization is well-defined (see [6], p. 199). Thereby, the calculation of an optimal policy π * is performed in the following framework: For t = 0 , , T 1 , introduce the Bellman operator
T t v ( x ) = max a A r t ( x , a ) + K t a v ( x ) , x X
which acts on each measurable function v : X R where K t a v is well-defined for all a A . Further, consider the Bellman recursion, (also called backward induction)
v T * = r T , v t * = T t v t + 1 * for t = T 1 , , 0 .
There solution ( v t * ) t = 0 T to the above Bellman recursion returns the value function  v 0 * and determines an optimal policy π * via
π t * ( x ) = argmax a A r t ( x , a ) + K t a v t + 1 * ( x ) , x X , t = 0 , , T 1 .
Remark 1.
In practice, discounted versions the above stochastic control are popular. These are obtained by replacing the stochastic kernel K t a by κ K t a in the backward induction where κ [ 0 , 1 ] is a discount factor. The advantage of this approach is that for long time horizons, the optimal policy becomes stationary, provided that all rewards and transition kernels are time-independent.

3. Markov Decisions under Partial Observation

In view of the general framework from Section 2, the classical Markov Decision Processes (MDPs) are obtained by specifying controlled Markovian evolution ( X t ) t = 0 T in terms of assumptions on the state space and on the transition kernels. Here, the state space is given by a finite set P such that states are driven in terms of
stochastic matrices ( α p , p a ) p , p P , a A .
indexed by a finite number of actions a A with the interpretation that α p , p a [ 0 , 1 ] stands for the transition probability from p P to p P if the action a A was taken. In this settings, the stochastic kernels K t a are acting functions v : P R by
K t a v ( p ) = p P α p , p a v ( p ) p P , t = 0 , T 1 .
There is no specific assumption on scrap and reward functions, they are given by following mappings on the state space P:
p r T ( p ) , ( p , a ) r t ( p , a ) , t = 0 , , T 1 , a A .
In this work, we consider control problems where a finite number of Markov decision problems are involved, sharing the same state space: each is activated by a certain regime. For this reason, we introduce a selection of MDPs indexed by a finite set S of regimes. Here, we assume that stochastic matrices as in (9) and control costs as in (11) are now indexed by S:
( α p , p a ( s ) ) p , p P , p r T ( p ) ( s ) , ( p , a ) r t ( p , a ) ( s ) , t = 0 , , T 1 , a A , s S .
Let us now consider decision making under incomplete information. Given family of Markov decision problems as in (12), the controller deals with a dynamic mixture of these problems in the sense that each of these MDPs becomes valid under a certain regime which changes exogenously in an uncontrolled way and is not observed directly. More precisely, interpreting each probability distribution s ^ = ( s ^ ( s ) ) s S on S as controllers believe about the current regime, we introduce the following convex mixtures of the ingredients (12):
α p , p a ( s ^ ) = ( s S s ^ ( s ) α p , p a ( s ) ) p , p P p s S s ^ ( s ) r T ( p ) ( s ) , ( p , a ) s S s ^ ( s ) r t ( p , a ) ( s ) , t = 0 , , T 1 , a A , s ^ S ^ .
where S ^ stands for the simplex of all probability distributions on S. That is, the transition kernels and the control costs are now modulated by an external variable s ^ S ^ which represents the evolution of controller’s believe about the current situation. To describe its dynamics, we suppose that the information is updated dynamically through the observation of another process ( y t ) t = 1 T which follows the so-called hidden Markov dynamics. Let us introduce this concept before we finish the definition of our regime-changing Markov decision problem.
Consider a time-homogeneous Markov chain ( s t ) t = 0 T evolving on a
finite space S which is identified with the set of orthonormal basis vectors of a finite - dimensional Euclidean space .
Assume that this Markov chain is governed by the stochastic matrix Γ = ( Γ s , s ) s , s S . It is supposed that the evolution of ( s t ) t = 0 T is unobservable, representing a hidden regime. The available information is realized by another stochastic process ( y t ) t = 0 T taking values in a space Y, such that at any time t, the distribution of the next observation y t + 1 depends on the past { y 0 , , y t } through the recent state s t only. More precisely, we suppose that ( ( s t , y t ) ) t = 0 T follows a Markov process whose transition operators are acting on functions v : S × Y R as
( s , y ) s S Y v ( s , y ) Γ s , s μ s ( d y ) .
Here, ( μ s ) s S stands for the family of distributions for the next-time observation of y t + 1 , conditioned on s t = s S . For each s S , we assume that the distribution μ s is absolutely continuous with respect to a reference measure μ on Y and introduce the densities
ν s ( y ) = d μ s d μ ( y ) for y Y and s S .
Since the state evolution ( s t ) t = 0 T is not available, one must rely on the believe distribution s ^ t of the state s t , conditioned on the observations y 0 , , y t . With this, the hidden state estimates  ( s ^ t ) t = 0 T yield a process that takes values in the set S ^ of all probability measures on S which is identified with the convex hull of S due to (14). It turns out that although the observation process ( y t ) t = 0 T is non-Markovian, it can be augmented by the believes process to obtain ( z t = ( s ^ t , y t ) ) t = 0 T which follows a Markov process on the state space Z = S ^ × Y . This process is driven by the transition kernels acting on functions v : S ^ × Y R as
K t v ( s ^ , y ) = Y v s . Γ V ( y ) s ^ V ( y ) s ^ , y V ( y ) s ^ μ ( d y ) , ( s ^ , y ) S ^ × Y ,
for all t = 0 , T 1 . In this formula, V ( y ) stands for the diagonal matrix whose diagonal elements are given by ( ν s ( y ) ) s S , y Y whereas · represents the usual l 1 norm.
Remark 2.
In the Formula (16) the quantity
s ^ ( y ) : = Γ V ( y ) s ^ V ( y ) s ^
represents the updated believe state under the assumption that prior to the observation y Y , the believe state was s ^ S ^ . On this account, this vector must be an element of S ^ , which is seen as follows (the author would like to thank to anonymous referee for highlighting out this point): Having observed that all entries of Γ V ( y ) s ^ and of V ( y ) s ^ are non-negative (ensured by multiplication of s ^ by matrices with non-negative entries), we have to verify that they sum up to one, thus (17) represents a probability distribution on S. For this, we introduce a vector 1 = ( 1 , 1 , , 1 ) of ones with our believe state dimension, in order to verify that the scalar product is
1 s ^ ( y ) = 1 Γ V ( y ) s ^ V ( y ) s ^ = 1 V ( y ) s ^ V ( y ) s ^ = 1
in other words, all entries of s ^ ( y ) sum up to one. Here, we have used that 1 Γ = 1 since all rows of stochastic matrix Γ are probability vector whose l 1 norm is representable by a scalar product with 1 .
The optimal control for such modulated MDP is formulated in terms of the transitions operators, rewards and scrap functions which we define now. With notations introduced above, consider a state space X = P × Z with Z = S ^ × Y . First, introduce a controlled Markovian dynamics in terms of the transition operators acting on functions v : P × Z R as
K t a v ( p , s ^ , y ) = p P α p , p a ( s ^ ) Y v p , Γ V ( y ) s ^ V ( y ) s ^ , y V ( y ) s ^ μ ( d y ) = K t a v ( p , s ^ )
for all p P , ( s ^ , y ) S ^ × Y , t = 0 , T 1 . This kernel describes the following evolution: Based on the current situation ( p , s ^ , y ) , a transition to the next decision state p P occurs according to the mixture α p , p a ( s ^ ) of transition probabilities introduced in (13). Furthermore, our believe state evolves due to the information update based on the new observations y , as described in (16). Obviously, this transformation does not depend on the decision variable y Y , with abbreviation introduced by the last equality of (18). Finally, we introduce the control costs that are expressed by scrap and reward functions in terms of mixtures (13):
r T ( p , s ^ , y ) = s S s ^ ( s ) r T ( p ) ( s ) = r T ( p , s ^ ) ,
r t ( p , s ^ , y , a ) = s S s ^ ( s ) r t ( p , a ) ( s ) = r t ( p , s ^ , a ) ,
for t = 0 , , T 1 , a A , s ^ S ^ , and y Y . Note that (18), (20) and (19) uniquely define a sequential decision problem in terms of specific instances to controlled dynamics and its control costs. In what follows, we show that this problem seamlessly falls under the umbrella of a general scheme that allows an efficient numerical treatment.

4. Approximate Algorithmic Solutions

Although there are a number theoretical and computational methods to solve stochastic control problems, many industrial applications exhibit complexity and size driving numerical techniques to their computational limits. For an overview on this topic we referrer the reader to [7]. One of the major difficulties originates from high-dimensionality. Here, approximate methods have been proposed based on state and action space discretization or approximations of functions on this space. Among function approximation methods, the least-squares Monte Carlo approach represents a traditional way to approximate the value function s in [8,9,10,11,12,13]. However, function approximation methods have also been used to capture local behavior of value functions and advanced regression methods such as kernel methods [14,15], local polynomial regression [16], and neural networks [17], have been suggested.
Considering partial observability, several specific approaches are studied in [18], with bound estimation presented in [19]. The work [20] provides an overview of modern algorithms in this field with the main focus on the so-called point-based solvers. The main aspect of any point-based POMDP algorithm (see [21]) is a dynamical adaptation of the state-discretized grid.
In this work, we treat our regime switching Markov decision problem in terms of efficient numerical schemes, which deliver an approximate solution along with its diagnostics. The following section presents this methodology and elaborates on specific assumptions, required in this setting. The numerical solution method is based on function approximations which require convexity and linear state dynamics. For technical details, we refer the interested reader to [22]. Furthermore, there are applications to pricing financial options [23], natural resource extraction [3], battery management [24] and optimal asset allocation under hidden state dynamics [25], many applications are illustrated using R in [26].
Suppose that the state space X = P × R d is a Cartesian product of a finite set P and the Euclidean space R d . Consider a controlled Markovian process ( X t ) t = 0 T : = ( P t , Z t ) t = 0 T that consists of two parts. The discrete-space component ( P t ) t = 0 T describes the evolution of a finite-state controlled Markov chain, taking values in a finite set P, while the continuous-space component ( Z t ) t = 0 T follows uncontrolled evolution with values in R d . More specifically, we assume that at any time t = 0 , , T 1 in an arbitrary state ( p , z ) X the controller chooses an action a from A in order to trigger the one-step transition from the mode p P to the mode p P with probability α p , p a ( z ) , given in terms of pre-specified transition probability matrices ( α p , p a ( z ) ) p , p P indexed by actions a A . Note that these transition probabilities can depend on continuous state component z R d . For the continuous-state process ( Z t ) t = 0 T , we assume an uncontrolled evolution which is governed by linear state dynamics
Z t + 1 = W t + 1 Z t , t = 0 , , T 1 ,
with independent disturbance matrices  ( W t ) t = 1 T , thus the transition operators K t a are
K t a v ( p , z ) = p P α p , p a ( z ) E ( v ( p , W t + 1 z ) ) , p P , z R d , t = 0 , , T 1 ,
acting on all function v : P × R d R where the required expectations exist. Furthermore, we suppose that the reward and the scrap functions
r t : P × R d × A R , r T : P × R d R , are convex in the sec ond argument .
The numerical treatment aims determining approximations to the true value functions ( v t * ) t = 0 T 1 and to the corresponding optimal policies π * = ( π t * ) t = 0 T 1 . Under some additional assumptions, the value functions turn out to be convex and can be approximated by piecewise linear and convex functions.
To obtain an efficient (approximative) numerical treatment of these operations, the concept of the so-called sub-gradient envelopes was suggested in [22]. A sub-gradient g f of a convex function f : R d R at a point g R d is an affine–linear functional supporting this point g f ( g ) = f ( g ) from below g f f . Given a finite grid G = { g 1 , g 2 , , g m } R d , the sub-gradient envelope S G f of f on G is defined as a maximum of its sub-gradients
S G f = g G ( g f ) ,
which provides a convex approximation of the function f from below S G f f , and enjoys many useful properties. Using the sub-gradient envelope operator, define the double-modified Bellman operator as
T t m , n v ( p , z ) = max a A S G r t ( p , z , a ) + p P k = 1 n ν t + 1 ( k ) S G α p , p a ( · ) v ( p , W t + 1 ( k ) · ) ( z ) ,
where the probability weights ( ν t + 1 ( k ) ) k = 1 n correspond to the distribution sampling ( W t + 1 ( k ) ) k = 1 n of each disturbance matrix W t + 1 . The corresponding backward induction
v T m , n ( p , z ) = S G r T ( p , z ) ,
v t m , n ( p , z ) = T t m , n v t + 1 m , n ( p , z ) , t = T 1 , 0 ,
yields the so-called double-modified value functions ( v t m , n ) t = 0 T . Under appropriate assumptions on increasing grid density and disturbance sampling, the double-modified value functions converge uniformly to the true value functions on compact sets (see [22]). The crucial point of our algorithm is a treatment of piecewise linear convex functions in terms of matrices. To address this aspect, let us agree on the following notation: Given a function f and a matrix F, we write f F whenever f ( z ) = max ( F z ) holds for all z R d , and call Fa matrix representative of f. To be able to capture a sufficiently large family of functions by matrix representatives, an appropriate embedding of the actual state space into a Euclidean space might be necessary: For instance, to include constant functions, one adds a dimension to the space and amends all state vectors by a constant 1 in this dimension.
It turns out that the sub-gradient envelope operation S G acting on convex piecewise linear functions corresponds to a certain row-rearrangement operator Υ G acting on the matrix representatives of these functions, in the sense that
f F S G f Υ G [ F ] .
Such row-rearrangement operator Υ G , associated with the grid
G = { g 1 , , g m } R d
acts on each matrix F with d columns as follows:
( Υ G [ F ] ) i , · = F argmax ( F g i ) , · for all i = 1 , , m .
If piecewise linear and convex functions ( f i ) i = 1 n are given in terms of their matrix representatives ( F i , · ) i = 1 n , such that
f k F k , k = 1 , , n .
then it holds that
S G ( k = 1 n f k ) k = 1 n Υ G [ F k ]
S G ( k = 1 n f k ) Υ G [ k = 1 n F k ]
S G ( f k ( W · ) Υ G [ F k W ] k = 1 , , n ,
where the operator ⊔ denotes binding matrices by rows (for details, we refer the reader to [22]) and to ([23]). The algorithms presented there use the properties (29)–(31) to calculate approximate value functions in terms of their matrix representatives as follows:
  • Pre-calculations: Given a grid G = { g 1 , , g m } , implement the row rearrangement operator Υ G and the row maximization operator a A . Determine a distribution sampling ( W t ( k ) ) k = 1 n of each disturbance W t with corresponding weights ( ν t ( k ) ) k = 1 n for t = 1 , , T . Given reward functions ( r t ) t = 0 T 1 and scrap value r T , assume that the matrix representatives of their sub-gradient envelopes are given by
    R t ( p , a ) S G r t ( p , · , a ) , R T ( p ) S G r T ( p , · )
    for t = 0 , , T 1 , p P and a A . The matrix representatives of each double-modified value function
    v t ( m , n ) ( p , · ) V t ( p ) for t = 0 , , T , p P
    are obtained via the following matrix-form of the approximate backward induction (also depicted in the Algorithm 1.
  • Initialization: Start with the matrices
    V T ( p ) = R T ( p ) , for all p P .
  • Recursion: For t = T 1 , , 0 and for p P calculate
    V t + 1 E ( p , a ) = p P k = 1 n ν t + 1 ( k ) α p , p a ( · ) V t + 1 ( p ) W t + 1 ( k ) ,
    V t ( p ) = a A R t ( p , a ) + V t + 1 E ( p , a ) .
Algorithm 1: Value Function Approximation.
Algorithms 14 00291 i001
Here, the term α p , p a ( · ) V t + 1 ( p ) W t + 1 ( k ) stands for the matrix representative of the sub-gradient envelope of the product function
S G z α p , p a ( z ) · max V t + 1 ( p ) W t + 1 ( k ) z
which must be calculated from both factors using the product rule. The product rule has to be modified due to the assumption that in our context, the sub-gradient g f of a function f at a point g is an affine–linear function, which represents the Taylor approximation developed at g to the linear term. For such sub-gradients, the product is given by g ( f 1 f 2 ) = ( g f 1 ) f 2 ( g ) + f 1 ( g ) ( g f 2 ) f 1 ( g ) f 2 ( g ) . The concrete implementation of this operation depends on how the matrix representative of a constant function is expressed. In Section 6.1, we provide a code which realizes such a product rule, based on the assumption that the state space is represented by probability vectors.
Having calculated matrix representatives ( V t E ) t = 0 T , approximations to expected value functions are obtained as
v t + 1 E ( p , z , a ) = max ( V t + 1 E ( p , a ) z )
for all z R d , t = 0 , , T 1 , a A and p P . Furthermore, an approximately optimal strategy ( π t ) t = 0 T 1 is obtained for t = 0 , , T 1 by
π t ( p , z ) = argmax a A ( r t ( p , z , a ) + v t + 1 E ( p , z , a ) ) , p P , z R d .
In what follows, we apply this technique to our regime-switching Markov decision problems.

5. HMM-Modulated MDP as a Convex Switching Problem

Now, we turn to the main step—an appropriate extension of the state space P × S (as introduced in Section 3). This procedure will allow treating our regime-switching Markov decision problems by numerical methodologies described in the previous section.
With notations and conventions of Section 3, we consider the so-called positively homogeneous function extensions: A function v ˜ : P × R + d R is called positively homogeneous if v ˜ ( p , c x ) = c v ˜ ( p , x ) holds for all c R + and ( p , x ) P × R + d . Obviously, for each v : P × S ^ R the definition
v ˜ ( p , z ) = z v ( p , z z ) , ( p , z ) P × R + S
yields a positively-homogeneous extension of v.
Given the stochastic kernel (18), we construct a probability space supporting a sequence ( Y t ) t = 1 T of independent identically distributed random variables, each following the same distribution μ in order to introduce the random matrices (disturbances)
W t = Γ V ( Y t ) , t = 1 , , T .
These disturbances are used to define the following stochastic kernels in P × R + S :
K ˜ t a v ˜ ( p , z ) = p P α p , p a ( z z ) E ( v ˜ ( p , W t + 1 z ) ) , ( p , z ) P × R + S , t = 0 , , T 1
acting on all functions v ˜ : P × R + S R where the above expectations are well-defined. A direct verification shows that for each a A and t = 0 , , T 1 it holds that
if v ˜ is a positively homogeneous extension of v then K ˜ t a v ˜ is a positively homogeneous extension of K t a v .
The kernels (36) satisfy the linear dynamics assumption (21) required in (22) and define a control problem of convex switching type whose value functions also solve the underlying regime-switching Markov decision problem.
Proposition 1.
Given a regime modulated Markov decision problem whose dynamics are defined by the stochastic kernel (18) with control costs given by (19) and (20), consider the value functions ( v t ) t = 0 T returned by the corresponding backward induction
v T ( p , s ^ ) = r T ( p , s ^ ) , v t ( p , s ^ ) = max a A r t ( p , s ^ , a ) + K t a v t + 1 ( p , s ^ ) , p P , s ^ S ^ , t = T 1 , , 0 .
Moreover, consider functions ( v ˜ t ) t = 0 T returned by
v ˜ T ( p , z ) = r ˜ T ( p , z ) , v ˜ t ( p , z ) = max a A r ˜ t ( p , z , a ) + K ˜ t a v ˜ t + 1 ( p , z ) , p P , z R + S , t = T 1 , , 0 .
where r ˜ t , r ˜ T are positively homogeneous extensions of r t , r T and K ˜ t a is from (36). Then for t = 0 , , T , it holds that
v ˜ t i s p o s i t i v e l y h o m o g e n e o u s e x t e n s i o n o f v t
Proof. 
Let us prove (40) inductively. Starting at t = T , assertion (40) holds since r ˜ T is a positively homogeneous extension of r T . Having assumed that (40) holds for t + 1 with a positively homogeneous v ˜ t + 1 , we apply observation (37) to conclude that
K ˜ t a v ˜ t + 1 is a positively homogeneous extension of K t a v t + 1
thus adding r ˜ t ( · , · , a ) , which is a positively homogeneous extension of r t ( · , · , a ) and maximizing over a A yields (40).
To finish the proof, we verify (37). Given v, with a positively homogeneous extension v ˜ , for p P and z R + S it holds that
K ˜ t a v ˜ ( p , z ) = p P α p , p a ( z z ) E ( v ˜ ( p , Γ V ( Y t + 1 ) z ) )
= p P α p , p a ( z z ) Y v ˜ p , Γ V ( y ) z μ ( d y )
= p P α p , p a ( z z ) Y v p , Γ V ( y ) z V ( y ) z V ( y ) z μ ( d y ) .
From the expression (41) we conclude that K ˜ t a v ˜ is indeed positively homogeneous. Setting z = s ^ S ^ , we observe K ˜ t a v ˜ ( p , s ^ ) = K t a v ( p , s ^ ) , meaning that K ˜ t a v ˜ is a function extension of K t a v . □

6. Algorithm Implementations and Performance Analysis

The stylized algorithm presented in Algorithm 1 is appropriate for problems whose scale is similar to that of the illustration provided in the next section. In this example, although we have used a relatively slow scripting language R, all calculations are performed within a few seconds. This shows a practical relevance of such implementations for small and medium-size applications. However, to address larger problems, a significant increase in calculation performance is needed and can be achieved by approximations, which are based on a standard technique from big data analysis, the so-called next-neighbor search. A realization of this concept within a package for the statistical language R is described in [26]. With all critical parts of the algorithm written in C, this implementation shows a reasonable performance which is examined and discussed in [26]. For technical details, we address the reader to [23,27]. Let us merely highlight the main idea here. The point is that the computational performance of our approach suffers from the fact that most of the calculation time is being spent on matrix rearrangements required by the operator Υ G . Namely, in order to calculate an expression
k = 1 n ν t + 1 ( k ) Υ G [ V t + 1 ( p ) · W t + 1 ( k ) ]
as in (25), the row-rearrangement Υ G must be performed n times, once for each disturbance matrix multiplication. This task becomes increasingly demanding for larger values of the disturbance sampling sizes n, particularly in high state space dimensions. Let us omit t + 1 and p in (43) to clarify the idea of efficiency improvement developed in [23]. This approach focuses on the two major computational problems
the rearrangement Υ G [ V W ( k ) ] of large matrices V · W ( k )
and
the summation of matrices Υ [ V · W ( k ) ] over a large index range k = 1 , , n .
It turns out that one can approximate the procedure in (44) by replacing the row-rearrangement operation with an appropriate matrix multiplication using next-neighbor techniques. To address (45) each random disturbance matrix W t is represented as the linear combination
W t = W ¯ + j = 1 J ϵ j ( t ) E ( j )
with non-random matrices W ¯ and ( E ( j ) ) j = 1 J , and random coefficients ( ε j ( t ) ) j = 1 J whose dimension J is preferably significantly lower than the dimension of the state space. Both techniques are applicable and save a significant amount of calculations. Only the disturbances ( W t ) t = 1 T are identically distributed, so that all pre-calculations have to be done only once. However, since a detailed discussion of this approximation and its performance gain are out of the scope here, we refer the reader to [23,26,27].
Note that an application of convex switching techniques under partial observations requires some adaptations, due to the non-observable nature of the state space. More precisely, the user must realize a recursive filter to extract believe states from the current information flow, before optimal decisions can be made. In other words, filtering must be followed by the application of the decision policy, returned from the optimization. A stylized realization of this approach is depicted in Figure 1.

6.1. An illustration

Let us consider a typical application of Markov decision theory to agricultural management [28] which addresses a stochastic forest growth model in the context of timber harvesting optimization. The work [28] re-considers the classical results of Faustmann in the framework of random growth and potential ecological hazards (for a derivation of Faustmann’s result from 1849 and its discussion, we refer the reader to [28]). The idea is that as a reasonable approximation, one supposes that the only stochastic element is the growth of trees: That is, the timber volume per hectare defines the state. To facilitate the analysis, this timber volume (cubic meter per hectare) is discretized as shown in the first row of the Table 2. Furthermore, the costs of action (in USD per hectare) are shown in the second row, assuming that the action means timber harvest for all states (state 2–state 6) followed by re-forestation with the exception of the bare land (state 1), where there is re-forestation only. There are no costs if no action is taken, as seen from the third row of Table 2.
Given the state space P = { 1 , , 6 } and control costs as defined in Table 2, define the action space A = { 1 , 2 } , where 1 means being idle and 2 means acting. Assuming decision intervals of 20 years, a Markov decision problem is determined with the action-dependent stochastic matrices
α 1 = 1 0 0 0 0 0 0.1 0.1 0.7 0.1 0 0 0.1 0 0.1 0.7 0.1 0 0.1 0 0 0.1 0.7 0.1 0.1 0 0 0 0.1 0.8 0.1 0 0 0 0 0.9 , α 2 = 0.1 0.9 0 0 0 0 0.1 0.9 0 0 0 0 0.1 0.9 0 0 0 0 0.1 0.9 0 0 0 0 0.1 0.9 0 0 0 0 0.1 0.9 0 0 0 0 .
Note that the transitions to state 1 (bare land) from higher state in α 1 may be interpreted as a result of a natural disaster, whereas those to a neighboring state may describe a randomness in the forest growth. The matrix α 2 describes the timber harvest followed by re-forestation. The work [28] discusses an optimal policy for this model assuming an infinite time horizon with a discounting which is inferred from the interest rate effects. The optimal policy is compared to that from a non-random growth model, represented by the same parameters with the exception of the free-growth transition α 1 , being replaced by
α 1 = 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 .
This non-random model resembles the well-known results of Faustmann, claiming that the only optimal strategy is a roll-over of harvesting and re-planting after a number of years. In the above Markov decision problem, the discount factor is set to κ = ( 1 + g ) 20 = 0.61 (corresponding to g = 2.5 % annual interest rates). Here, it turns out that it is optimal to reforest in state 1, do nothing in state 2 and 3, and cut and reforest in states 4, 5, and 6. That is, the resulting roll-over is three times the decision period, 60 years. In contrast, the random model suggests that in the presence of ecological risks, it is optimal to reforest in state 1, do nothing in state 2, and cut and reforest in states 3,4, 5, and 6, giving a shorter roll-over of two decision periods, 40 years.
Using our techniques, the optimal forest management can be refined significantly. Major improvement can be achieved by incorporating all adverse ecological situations into a selection of transition matrices representing a random forest evolution within appropriate regimes. The regime switch can be monitored and estimated using stochastic filtering techniques as described above. For instance, the potential climate change with an anticipated increase of average temperature and extended drought periods can be managed adaptively. In what follows, we consider a simplified numerical example based on the stochastic forest growth model presented above.
First, let us replicate the results of [28] in the following code Algorithms 14 00291 i002a Algorithms 14 00291 i002b
Indeed, the above plot shows that in the stochastic forest growth model, it is optimal to do nothing in state 2, but harvest and replant in all other states—an optimal roll-over of two periods, 40 years. Running the same algorithm for the deterministic growth, returns the value functions and policies given below—an optimal roll-over over three periods, 60 years. Algorithms 14 00291 i003
Now let us illustrate an adaptive forest management in presence of regime-changing risk. First, let us specify the hidden Markovian dynamics of the information process ( Y t ) t = 1 T . For this, a stochastic matrix Γ = ( Γ s , s ) s , s S and a family of measures ( μ s ) s S must be specified, according to (15). For simplicity, we suppose that the state comprises two regimes S = { s 1 , s 2 } and that an indirect information ( Y t ) t = 1 T is observed within the interval [ 0 , 1 ] , after appropriate transformation. For instance, Y t could measure a percentage of rainy days over a pre-specified sliding window over a past period. Alternatively, Y t may stand for a specific quantity (average temperature) with distribution function typical for this quantity in the normal regime, applied on its recording. With this assumption, we consider s 1 as a regular regime, under which the observations follow a beta distribution β ( s 1 ) with specific parameters defined for the state s 1 , whereas in the presence of environmental hazards, in the regime s 2 , the observation follows a beta distribution β ( s 2 ) with other parameters typical for s 2 . Finally, let us suppose that the regime change matrix
Γ = γ 1 , 1 γ 1 1 γ 2 γ 2
describes a situation where the regular regime is more stable 1 γ 1 > γ 2 > 0 . The following code illustrates a simulation of such observations and the corresponding filtered information with results of filtering procedure depicted in Figure 2. Algorithms 14 00291 i004
Let us address an implementation of the Algorithm 1. Having defined the appropriate operators Algorithms 14 00291 i005a Algorithms 14 00291 i005b we introduce arrays as data containers Algorithms 14 00291 i006 these containers are filled with appropriate quantities: Algorithms 14 00291 i007a Algorithms 14 00291 i007b
After such initialization, the backward induction is run: Algorithms 14 00291 i008
The Figure 3 depicts the value function for the state p = 2 depending on the believe state, such that the x- axes is interpreted as the conditioned probability that the system is in the deterministic growth regime. That is, the end points of the graph represent values of the expected return of the optimal strategy under the condition that it is known with certainty that the system starts in random growth (left end) or in deterministic growth (right end). Observe that those values are close to 2660.412 and 3012.010 obtained in the classical Markov decision setting. Naturally, the uncertainty about the current regime yields intermediate values, connected by a convex line.
The Figure 4 illustrates the choice of the optimal action for the state p = 3 depending on the believe state depicted in the same way as in Figure 3. Again the end points of the graph represent optimal actions conditioned on certainty about the current regime. As expected, the optimal decision switches from a = 2 (act) to a = 1 (be idle) at some believe probability of approximately 0.55 .

7. Conclusions

Having utilized a number of specific features of our problem class, we suggest a simple, reliable, and easy-to-implement algorithm that can provide a basis for rational sequential decision-making under uncertainty. Our results can be useful if there are no historical data about what could go wrong, or when high requirements on risk assessment are proposed. In such situations, we suggest encoding all relevant worst-case scenarios as potential regime changes, making conservative a priori assumptions on regime change probabilities. Using our algorithm, all optimal strategies can be efficiently examined in such a context, giving useful insights about parameter sensitivity and model risk associated with this sensitivity. The author believes that the suggested algorithm can help gaining a better understanding of risks and opportunities in such context.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Puterman, M. Markov Decision Processes: Discrete Stochastic Dynamic Programming; Wiley: New York, NY, USA, 1994. [Google Scholar]
  2. Smallwood, R.D.; Sondik, E.J. The Optimal Control of Partially Observable Markov Processes over a Finite Horizon. Oper. Res. 1973, 21, 1071–1088. [Google Scholar] [CrossRef]
  3. Hinz, J.; Tarnopolskaya, T.; Yee, T. Efficient algorithms of pathwise dynamic programming for decision optimization in mining operations. Ann. Oper. Res. 2020, 286, 583–615. [Google Scholar] [CrossRef]
  4. Tsiropoulou, E.E.; Kastrinogiannis, T.; Symeon, P. Uplink Power Control in QoS-aware Multi-Service CDMA Wireless Networks. J. Commun. 2009, 4. [Google Scholar] [CrossRef]
  5. Huang, X.; Hu, F.; Ma, X.; Krikidis, I.; Vukobratovic, D. Machine Learning for Communication Performance Enhancement. Wirel. Commun. Mob. Comput. 2018, 2018, 3018105:1–3018105:2. [Google Scholar] [CrossRef]
  6. Bäuerle, N.; Rieder, U. Markov Decision Processes with Applications to Finance; Springer: Heidelberg, Germany, 2011. [Google Scholar] [CrossRef]
  7. Powell, W.B. Approximate Dynamic Programming: Solving the Curses of Dimensionality; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar]
  8. Belomestny, N.; Kolodko, A.; Schoenmakers, J. Regression methods for stochastic control problems and their convergence analysis. SIAM J. Control Optim. 2010, 48, 3562–3588. [Google Scholar] [CrossRef] [Green Version]
  9. Carriere, J.F. Valuation of the Early-Exercise Price for Options Using Simulations and Nonparametric Regression. Insur. Math. Econ. 1996, 19, 19–30. [Google Scholar] [CrossRef]
  10. Egloff, D.; Kohler, M.; Todorovic, N. A dynamic look-ahead Monte Carlo algorithm. Ann. Appl. Probab. 2007, 17, 1138–1171. [Google Scholar] [CrossRef] [Green Version]
  11. Tsitsiklis, J.N.; Van Roy, B. Regression Methods for Pricing Complex American-Style Options. IEEE Trans. Neural Netw. 2001, 12, 694–703. [Google Scholar] [CrossRef] [Green Version]
  12. Tsitsiklis, J.; Roy, B.V. Optimal Stopping of Markov Processes: Hilbert Space, Theory, Approximation Algorithms, and an Application to Pricing High-Dimensional Financial Derivatives. IEEE Trans. Automat. Contr. 1999, 44, 1840–1851. [Google Scholar] [CrossRef] [Green Version]
  13. Longstaff, F.; Schwartz, E. Valuing American options by simulation: A simple least-squares approach. Rev. Financ. Stud. 2001, 14, 113–147. [Google Scholar] [CrossRef] [Green Version]
  14. Ormoneit, D.; Glynn, P. Kernel-Based Reinforcement Learning. Mach. Learn. 2002, 49, 161–178. [Google Scholar] [CrossRef] [Green Version]
  15. Ormoneit, D.; Glynn, P. Kernel-Based Reinforcement Learning in Average-Cost Problems. IEEE Trans. Automat. Contr. 2002, 47, 1624–1636. [Google Scholar] [CrossRef]
  16. Fan, J.; Gijbels, I. Local Polynomial Modelling and Its Applications; Chapman and Hall: London, UK, 1996. [Google Scholar]
  17. Bertsekas, D.P.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Nashua, NH, USA, 1996. [Google Scholar]
  18. Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artific. Intell. 1998, 101, 99–134. [Google Scholar] [CrossRef] [Green Version]
  19. Lovejoy, W.S. Computationally Feasible Bounds for Partially Observed Markov Decision Processes. Operat. Res. 1991, 39, 162–175. [Google Scholar] [CrossRef]
  20. Shani, G.; Pineau, J.; Kaplow, R. A survey of point-based POMDP solvers. Autonom. Agents Multi-Agent Syst. 2013, 27, 1–51. [Google Scholar] [CrossRef]
  21. Pineau, J.; Gordon, G.; Thrun, S. Point-based value iteration: An anytime algorithm for POMDPs. In Proceedings of the 18th International IJCAI’03: Joint Conference on Artificial Intelligence, Acapulco, Mexico, 9–15 August 2003; pp. 1025–1032. [Google Scholar]
  22. Hinz, J. Optimal stochastic switching under convexity assumptions. SIAM J. Contr. Optim. 2014, 52, 164–188. [Google Scholar] [CrossRef]
  23. Hinz, J.; Yap, N. Algorithms for optimal control of stochastic switching systems. Theor. Prob. Appl. 2015, 60, 770–800. [Google Scholar] [CrossRef] [Green Version]
  24. Hinz, J.; Yee, J. Optimal forward trading and battery control under renewable electricity generation. J. Bank. Financ. 2017. [Google Scholar] [CrossRef] [Green Version]
  25. Hinz, J.; Yee, J. Stochastic switching for partially observable dynamics and optimal asset allocation. Int. J. Contr. 2017, 90, 553–565. [Google Scholar] [CrossRef]
  26. Hinz, J.; Yee, J. Rcss: R package for optimal convex stochastic switching. R J. 2018. [Google Scholar] [CrossRef] [Green Version]
  27. Hinz, J.; Yee, J. Algorithmic Solutions for Optimal Switching Problems. In Proceedings of the 2016 Second International Symposium on Stochastic Models in Reliability Engineering, Life Science and Operations Management (SMRLO), Beer Sheva, Israel, 15–18 February 2016; pp. 586–590. [Google Scholar] [CrossRef]
  28. Buongiorno, J. Generalization of Faustmann’s Formula for Stochastic Forest Growth and Prices with Markov Decision Process Models. For. Sci. 2001, 47, 466–474. [Google Scholar] [CrossRef]
Figure 1. Stylized implementation of the control flow within a realistic application.
Figure 1. Stylized implementation of the control flow within a realistic application.
Algorithms 14 00291 g001
Figure 2. Believe state evolution: Observations depicted by red line, the probability that the system is in the regime 2 by black (true state) and blue (filtered believe) line.
Figure 2. Believe state evolution: Observations depicted by red line, the probability that the system is in the regime 2 by black (true state) and blue (filtered believe) line.
Algorithms 14 00291 g002
Figure 3. Value function for the state 2 depending on believe probability.
Figure 3. Value function for the state 2 depending on believe probability.
Algorithms 14 00291 g003
Figure 4. Optimal action for the state 3 depending on believe probability.
Figure 4. Optimal action for the state 3 depending on believe probability.
Algorithms 14 00291 g004
Table 1. Notations and measurement units.
Table 1. Notations and measurement units.
stochastic kernel K t a vector with unit entries 1 volumecubic meter
controlled probability P x , π l 1 -norm · areahectare
controlled expectation E x , π maximumcurrencyUSD
Bellman operator T t binding by rowcostsUSD per hectare
Table 2. Discrete states and their control costs.
Table 2. Discrete states and their control costs.
State 1State 2State 3State 4State 5State 6
volume029274530728868
acting–494–11730686396897010,790
being idle000000
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Hinz, J. An Algorithm for Making Regime-Changing Markov Decisions. Algorithms 2021, 14, 291. https://doi.org/10.3390/a14100291

AMA Style

Hinz J. An Algorithm for Making Regime-Changing Markov Decisions. Algorithms. 2021; 14(10):291. https://doi.org/10.3390/a14100291

Chicago/Turabian Style

Hinz, Juri. 2021. "An Algorithm for Making Regime-Changing Markov Decisions" Algorithms 14, no. 10: 291. https://doi.org/10.3390/a14100291

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop