Modeling the Decision and Coordination Mechanism of Power Battery Closed-Loop Supply Chain Using Markov Decision Processes

Zhang, Huanyong; Li, Ningshu; Lin, Jinghan

doi:10.3390/su16114329

Open AccessArticle

Modeling the Decision and Coordination Mechanism of Power Battery Closed-Loop Supply Chain Using Markov Decision Processes

by

Huanyong Zhang

,

Ningshu Li

^* and

Jinghan Lin

School of Business, Jiangnan University, Wuxi 214122, China

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(11), 4329; https://doi.org/10.3390/su16114329

Submission received: 6 March 2024 / Revised: 10 May 2024 / Accepted: 17 May 2024 / Published: 21 May 2024

(This article belongs to the Section Economic and Business Aspects of Sustainability)

Download Review Reports Versions Notes

Abstract

:

With the rapid growth of the new energy vehicle market, efficient management of the closed-loop supply chain of power batteries has become an important issue. Effective closed-loop supply chain management is very critical, which is related to the efficient utilization of resources, environmental responsibility, and the realization of economic benefits. In this paper, the Markov Decision Process (MDP) is used to model the decision-making and coordination mechanism of the closed-loop supply chain of power batteries in order to cope with the challenges in the management process, such as cost, quality, and technological progress. By constructing the MDP model for different supply chain participants, this paper investigates the optimization strategy of the supply chain and applies two solution methods: dynamic programming and reinforcement learning. The case study results show that the model can effectively identify optimized supply chain decisions, improve the overall efficiency of the supply chain, and coordinate the interests among parties. The contribution of this study is to provide a new modeling framework for power battery recycling and to demonstrate the practicality and effectiveness of the method with empirical data. This study demonstrates that the Markov decision-making process can be a powerful tool for closed-loop supply chain management, promotes a deeper understanding of the complex decision-making environment of the supply chain, and provides a new solution path for decision-making and coordination in the supply chain.

Keywords:

Markov decision process; power battery; closed-loop supply chain; decision-making and coordination mechanism

1. Introduction

Against the background of globalized energy structure transformation and rising awareness of environmental protection, the new energy vehicle industry has ushered in unprecedented development opportunities. As a core component of new energy vehicles, the performance and life of power batteries directly affect the promotion and sustainable development of new energy vehicles. However, with the popularization of new energy vehicles, the treatment and recycling of used power batteries have gradually become the focus of social concern, and effective solutions are urgently needed. Taking a new energy vehicle manufacturer in China as a case study, this study discusses in depth the decision-making and coordination mechanism of the closed-loop supply chain of power batteries by collecting and analyzing relevant data, aiming to provide new ideas for scientific management of this environmental and resource issue.

This paper is based on the powerful theoretical tool of the Markov Decision Process (MDP), which has been widely used in many fields such as supply chain management, inventory control, transportation scheduling, etc. By describing the decision-making process in an uncertain environment, the MDP model provides a new perspective for solving the problems of supply chain management such as high recycling cost of power batteries, poor channels, and unstable quality. In this study, it is applied to the decision analysis of the closed-loop supply chain of power batteries, which not only designs an effective recycling network, pricing strategy, reuse decision, and inventory control scheme but also focuses on maximizing the interests of all parties of the supply chain in the dynamic and uncertain environment.

The purpose of this study is to construct a scientific decision-making model to provide theoretical and practical guidance for the management of the closed-loop supply chain of power batteries. Through case studies, this study verifies the validity and practicability of the constructed model, which provides decision-making support for relevant enterprises and helps to promote the efficient recycling and utilization of used power batteries. Meanwhile, the innovation of this study is that multiple decision makers and objectives in the supply chain are considered comprehensively, a decentralized decision-making mechanism and centralized coordination mechanism are designed, and the overall performance of the supply chain under different strategies is evaluated through simulation experiments. These research results not only enrich the theoretical system in the field of supply chain management but also provide new ideas and methods for the optimization and management of the closed-loop supply chain of power batteries, which is of great theoretical and practical significance.

2. Literature Review

This section mainly synthesizes the research status of closed-loop supply chain management of power batteries and the research on the application of the Markov decision-making process in supply chain management and points out the shortcomings and room for improvement of the research in this paper.

2.1. Research Status of Power Battery Closed-Loop Supply Chain Management

Closed-loop supply chain management for power batteries covers the entire life cycle of power batteries, including production, use, recycling, and reuse to final disposal, and involves multiple participants, such as manufacturers, distributors, consumers, recyclers, and reusers. The goal of managing the closed-loop supply chain of power batteries is to simultaneously meet consumer demand, achieve efficient reuse of resources, and minimize environmental impacts.

The research provides insights into the following aspects of closed-loop supply chain management for power batteries:

In terms of recycling network design, the goal is to minimize the recycling cost and maximize the recycling efficiency through the use of mathematical planning, heuristic algorithms, and meta-heuristic algorithms, as in the case of the study of Jia Xianmin and Li Shaonan (2022) [1]. In terms of recycling pricing strategies, theories such as game theory and mechanism design have been used to determine the prices or subsidies that recyclers offer to consumers in order to increase the recycling rate of used power batteries, as exemplified by the research conducted by Cai Xiaoqian and Lin Yiyi (2023) in this area [2]. The research in this area is an example. In the area of reuse decision making, multi-attribute decision-making methods are used to select the best reuse route for used power batteries to ensure that the reuse value is maximized, which includes the work of Yang Shuai (2023) [3]. The inventory control problem, on the other hand, balances inventory costs and service levels through a Markov decision process to improve the efficiency of inventory management.

Current research focuses more on the discrete aspects of closed-loop supply chains and lacks coordination and optimization from a holistic perspective. Meanwhile, many studies are based on deterministic or static assumptions, ignoring dynamic and uncertain factors such as demand fluctuations, price changes, and technological advances in the supply chain. In view of this, this paper adopts the Markov decision-making process to model the decision-making and coordination mechanism of the closed-loop supply chain of power batteries, aiming to analyze the impact of these uncertainties and dynamics on the interests of all parties in the supply chain and the overall efficiency, in order to promote the theoretical and practical development in the field of closed-loop supply chain management.

2.2. Research on the Application of the Markov Decision Process in Supply Chain Management

The Markov Decision Process (MDP) is a mathematical model that describes the process of decision making in uncertain environments. MDP consists of a set of states, a set of actions, a transfer probability, and a reward function. MDP can be solved for the optimal strategy, i.e., what action can be taken to maximize the desired reward in each state. MDP has a wide range of applications in supply chain management, such as inventory control, demand management, transportation scheduling, etc.

Inventory control is the determination of inventory levels and replenishment strategies to balance inventory costs and service levels. The main approaches to inventory control are the economic order quantity model, the new svendor model, and the Markov decision process. Both the economic order quantity model and new svendor model are based on deterministic or static assumptions and ignore the uncertainty and dynamics of demand and supply. MDP can consider the stochastic and time-varying nature of demand and supply and solve the optimal replenishment strategy by methods such as dynamic planning or reinforcement learning. For example, Zhou Yang et al. (2023) [4] proposed a dynamic supplier classification management model based on the Markov decision process, which can dynamically adjust the inventory level and replenishment strategy according to the changes of suppliers; Huang Shuai-Bo et al. (2022) [5] proposed an energy management strategy model based on the Markov decision process, which can dynamically determine the charging station’s energy level according to factors such as the grid price, the charging demand, the storage device, and the renewable energy source, dynamically determining the energy inventory level and replenishment strategy of charging stations; Riccardo A et al. (2023) [6], based on an Italian survey, analyzed the impact of Industry 4.0 technologies on the performance of closed-loop supply chains, which includes the impact on inventory control, e.g., real-time data sharing, intelligent forecasting, and adaptive adjustments can improve the efficiency and accuracy of inventory control.

Demand management is the process of forecasting and influencing consumer demand for products or services in order to improve the efficiency and competitiveness of the supply chain. The main methods of demand management are statistical forecasting, collaborative forecasting, and demand signaling. Statistical forecasting and collaborative forecasting are based on historical data or information sharing to predict future demand, while ignoring the uncertainty and variability of consumer behavior. MDP can consider the stochastic and time-varying nature of consumer behavior and solve the optimal demand management strategy through methods such as dynamic programming or reinforcement learning. For example, Liu Zhengyuan et al. (2023) [7] established a framework for supply chain reliability analysis based on a propagation dynamics model, in which the effects of two mechanisms, information propagation and influence propagation, on supply chain demand management are considered, e.g., information propagation improves supply chain adaptability and synergy, and influence propagation improves supply chain stability and resilience; Zhang Xuelong et al. (2019) [8] established a Markov-chain-based supply chain trust evolution game model, which can analyze the impact of the trust relationship between the nodes in the supply chain on the supply chain demand management, showing that the trust relationship can promote the behavior of information sharing, demand coordination, risk sharing, etc., so as to improve the supply chain’s demand satisfaction rate and the demand response rate; Feng Z et al. (2023) [9] investigated the participation of an “Internet+Recycling” platform in the selection strategy of a two-tier remanufacturing closed-loop supply chain. The strategy involves demand management issues, such as how to determine the appropriate “Internet+Recycling” platform participation methods according to the demand of different types of consumers for recycled products or services, so as to improve the attractiveness of recycling demand and recycling efficiency.

Transportation scheduling refers to determining the allocation and scheduling of transportation resources to minimize transportation costs and maximize transportation efficiency. The main methods of transportation scheduling are mathematical planning, heuristic algorithms, and meta-heuristic algorithms. Mathematical planning and heuristic algorithms are based on deterministic or static assumptions and ignore the uncertainty and dynamics of transportation demand and transportation environment. MDP can consider the stochastic and time-varying nature of the transportation demand and transportation environment and solve the optimal transportation scheduling strategy by methods such as dynamic planning or reinforcement learning. For example, Ali P et al. (2023) [10] incorporated the vehicle path problem into the optimization model of a closed-loop supply chain network, considered factors such as product demand, recycling volume, transportation cost, and environmental impacts, and developed a mixed-integer linear programming model to minimize the total cost of the closed-loop supply chain network and proposed an effective solution algorithm; Mehrnaz B et al. [11] designed a new closed-loop supply chain network with a location–allocation and routing model that considers simultaneous recycling and distribution and optimizes under uncertainty. The model involves problems in transportation scheduling, such as how to determine appropriate location, allocation, and routing schemes according to the recycling and distribution demands in different regions and time periods, so as to optimize the transportation cost, transportation time, transportation distance, and other metrics; Hao G et al. [12] proposed a hybrid differential evolutionary algorithm for solving the location–inventory problem in a closed-loop supply chain with product recycling. The problem involves transportation scheduling aspects, such as how to determine the appropriate location, inventory level, and transportation resource allocation scheme according to the recycling and distribution demand in different regions and time periods, so as to optimize the transportation cost, transportation time, transportation distance, and other metrics.

The above studies mainly focus on single or partial links in the supply chain and less on the coordination and optimization of the whole supply chain system. In addition, most of these studies are based on a single decision maker or a single objective, while ignoring the existence of multiple decision makers or multiple objectives in the supply chain, such as profit, cost, service, environment, and so on. Therefore, this paper attempts to model the decision-making and coordination mechanism of a closed-loop supply chain for power batteries from a holistic and multi-objective perspective using a Markov decision process and analyze its impact on the interests of all parties in the supply chain and the overall efficiency.

2.3. Research Progress in Closed-Loop Supply Chain and Reverse Logistics

Investigation into closed-loop supply chains and reverse logistics is increasingly crucial given the growing significance of environmental and resource conservation. Closed-loop supply chains are an advanced form of traditional supply chain that facilitate the return of products from consumers and their subsequent reintegration into manufacturing cycles. In contrast, reverse logistics is a critical component of product circulation within these closed-loop systems.

Reverse logistics serves a distinct function within conventional supply chains, concentrating on the efficient retrieval and treatment of returned items and waste, followed by their reintegration as resources into manufacturing and consumer activities. In the research conducted by Kolyaei et al. [13], they implemented an integrated robust optimization strategy for designing closed-loop supply chain networks. This approach addresses the challenge of optimizing supply chain configurations amidst uncertainty, thereby effectively minimizing risks and enhancing operational efficiency. The current body of literature indicates that proficient management of closed-loop supply chains can lead to beneficial outcomes in economic, environmental, and societal spheres. In another study, Gu et al. [14] examined the influence of governmental incentives on the recycling strategies for electric vehicle batteries. They assessed the economic repercussions of these recycling initiatives through the incorporation of policy analysis.

Reverse logistics research also covers a wide range of thematic areas, including return processing, remanufacturing, product repair, and material recovery. In this study, reverse logistics, as a component of the closed-loop supply chain, focuses on how to maximize the efficiency benefits at the supply chain level, especially in the recycling of power batteries for new energy vehicles, aiming to solve the problem of how to reintroduce discarded power batteries into the production process as a kind of resource to achieve the sustainable use of resources. The research in this paper draws on existing literature and further analyzes and explores new paths for decision making in uncertain and dynamic environments within the framework of Markov decision-making process applications.

3. Markov Decision Process Model

3.1. Fundamentals of the Markov Decision Process

The Markov Decision Process (MDP) is a mathematical model that describes the process of making decisions in uncertain environments. MDP consists of a state set, an action set, a transfer probability, and a reward function. The set of states is all possible situations faced by the decision maker, the set of actions is all possible actions that the decision maker can take in each state, the transfer probability is the probability of transferring to the next state after each action is taken in each state, and the reward function is the instantaneous reward obtained after each action is taken in each state.

The solution methods of MDP mainly include two categories: dynamic programming and reinforcement learning. Dynamic programming methods are based on the known information of the model and find the optimal policy through value iteration or policy iteration algorithms. The value iteration algorithm derives the optimal policy by continuously updating the long-term expected reward value of the state until the optimal value function is found; the policy iteration algorithm directly optimizes the policy and finds the optimal policy by iterating through the policy evaluation and improvement steps. Reinforcement learning methods, on the other hand, do not rely on complete information about the model but learn the optimal policy through interaction with the environment. Monte Carlo algorithms and temporal difference algorithms are the two main approaches to reinforcement learning, with the former learning based on complete rounds of data and the latter updating the policy through immediate feedback at each step.

In the context of the closed-loop supply chain of power batteries, the application of the MDP model has significant advantages. Wei Guoxin et al. (2019) [15], through research based on a Markov model, proposed a battery life prediction method, which provides a theoretical basis for the maintenance and replacement of power batteries. Yang Zhe (2018) [16], on the other hand, utilized a strategy based on the Markov algorithm to smooth the power allocation of hybrid vehicles and improve the energy utilization efficiency. These studies show that the MDP model can effectively deal with the uncertainty and dynamics in the closed-loop supply chain of power batteries and provides a powerful tool for decision making and coordination in the supply chain.

3.2. Modeling the Markov Decision Process of Power Battery Closed-Loop Supply Chain

In this paper, we consider a closed-loop supply chain for power batteries consisting of manufacturers, distributors, consumers, recyclers, and reusers. The manufacturer is responsible for producing new power batteries and selling them to the distributor, the distributor is responsible for providing new power batteries and recycling services to the consumer, the consumer is responsible for using the power batteries and delivering them to the distributor or recycler, the recycler is responsible for recovering used power batteries from the consumer or the distributor and selling them to the reuse vendor, and the reuse vendor is responsible for reusing the used power batteries and selling them to the manufacturer or the distributors. In this paper, it is assumed that all parties in the supply chain behave rationally and self-interestedly, i.e., each participant tries to maximize its own profit.

In order to model the Markov decision process of the closed-loop supply chain of power batteries, the state set, action set, transfer probability, and reward function need to be determined. Since there are multiple decision makers in the supply chain, this paper adopts the framework of the Multi-Agent Markov Decision Process (MAMDP), i.e., each decision maker has its own set of states, set of actions, transfer probabilities, and reward functions, but their decisions affect each other. The MDP model for each decision maker is explained separately below.

Manufacturer’s MDP model:

State set: The state of the manufacturer consists of two variables, the current inventory of new power cells held by the manufacturer and the inventory of reused power cells. Assume that both the manufacturer’s inventory of new power cells and the inventory of reused power cells are discrete and have upper bounds, which are denoted as

N_{m a x}

and

R_{m a x}

, respectively. Then, the manufacturer’s state set is

S_{M} = \{(n, r)| n = 0,1, \dots, N_{m a x}; r = 0,1, \dots, R_{m a x}\}

.

Action set: The manufacturer’s action consists of two variables, the number of new power cells to be produced by the manufacturer in the next period and the number of reused power cells to be purchased from the reutilizer. Assume that the number of new power cells to be produced and the number of reused power cells to be purchased by the manufacturer are discrete and have upper bounds, which are denoted as

P_{m a x}

and

B_{m a x}

, respectively. Then, the set of manufacturer’s actions is

A_{M} = \{(p, b)| p = 0,1, \dots, P_{m a x}; r = 0,1, \dots, P_{m a x}; b = 0,1, \dots, B_{m a x}\}

.

Transfer probability: The transfer probability of a manufacturer is given by the following equation:

P_{M} ((n^{'}, r^{'}| (n, r), (p, b)) = \{\begin{matrix} P_{D} (n - p - b + n^{'}) P_{R} (r + b - r^{'}), i f 0 \leq n - p - b + n^{'} \leq N_{m a x} a n d 0 \leq r + b - r^{'} \leq R_{m a x} \\ 0, o t e h r w i s e \end{matrix}

where

P_{D} (x)

denotes the probability that a distributor orders P_max + B_max new or reused power cells from a manufacturer in the next period, and

P_{R} (x)

denotes the probability that a reuser provides P_max + B_max reused power cells to a manufacturer in the next period. In this paper, it is assumed that both probabilities are known and obey some known probability distribution, such as Poisson distribution, normal distribution, and so on.

Reward function: The manufacturer’s reward function is given by the following equation:

R_{M} ((n, r), (p, b)) = c_{p} p + c_{b} b - c_{n} n - c_{r} r

where

c_{p}

denotes the manufacturer’s unit cost of producing a new power cell,

c_{b}

denotes the manufacturer’s unit price of purchasing a reused power cell,

c_{n}

denotes the manufacturer’s unit inventory cost of holding a new power cell, and

c_{r}

denotes the manufacturer’s unit inventory cost of holding a reused power cell. This paper assumes that these parameters are known and constant.

Distributor’s MDP model:

State set: The state of a distributor consists of four variables, namely the current inventory of new power cells held by the distributor, the inventory of reused power cells, the inventory of used power cells, and the consumer demand for new power cells. It is assumed that the distributor’s inventory of new power batteries, inventory of reused power batteries, and inventory of used power batteries are discrete and have upper bounds, which are denoted as

N_{m a x}

,

R_{m a x}

, and

W_{m a x}

, respectively. Suppose the consumer demand for new power batteries is also discrete and has an upper bound, denoted as

D_{m a x}

. Then, the set of states of distributors is

S_{D} = \{(n, r, w, d)| n = 0,1, \dots, N_{m a x}; r = 0,1, \dots, R_{m a x}; w = 0,1, \dots, W_{m a x}; d = 0,1, \dots, D_{m a x}\}

.

Action set: The distributor’s action consists of four variables, namely, the number of new power cells and the number of reused power cells that the distributor will order from the manufacturer and the number of used power cells that the distributor will sell to the recycler and the number of used power cells that the distributor will buy from the recycler in the next period. It is assumed that the number of new power cells ordered and the number of reused power cells ordered by the distributor are discrete and have upper bounds, denoted as

O_{m a x}

and

U_{m a x}

, respectively. Assume that the number of used power cells sold and the number of used power cells purchased by the distributor are discrete and have upper bounds, denoted as

S_{m a x}

and

T_{m a x}

, respectively. Then, the set of actions of the distributor is

A_{D} = \{(o, u, s, t)| o = 0,1, \dots, O_{m a x}; u = 0,1, \dots, U_{m a x}; s = 0,1, \dots, S_{m a x}; t = 0,1, \dots, T_{m a x}\}

.

Transfer probability: The transfer probability of a distributor is given by the following equation:

P_{M} ((n^{'}, r^{'}, w^{'}, d^{'}| (n, r, w, d), (o, u, s, t)) = \{\begin{matrix} P_{M} (o + u - n^{'}) P_{C} (d - n - r + d^{'}) P_{R} (s + t - w^{'}), \\ i f 0 \leq o + u - n^{'} \leq N_{m a x} a n d 0 \leq d - n - r + d^{'} \leq D_{m a x} a n d 0 \leq s + t - w^{'} \leq W_{m a x} \\ 0, o t e h r w i s e \end{matrix}

where

P_{M} (x)

denotes the probability that a manufacturer will provide new power cells or sell power cells to a distributor in the next period as

x

,

P_{C} (x)

denotes the probability that a consumer will purchase new power cells or deliver used power cells to a distributor in the next period as

x

, and

P_{R} (x)

denotes the probability that a recycler will acquire or provide used power cells to a distributor in the next period as

x

. In this paper, it is assumed that all these probabilities are known and obey some known probability distribution, such as Poisson distribution, normal distribution, and so on.

Reward function: The distributor’s reward function is given by the following equation:

R_{D} ((n, r, w, d), (o, u, s, t)) = c_{o} o + c_{u} u + c_{s} s - c_{n} n - c_{r} r - c_{w} w

where

c_{o}

denotes the unit price at which the distributor orders new power cells from the manufacturer,

c_{u}

denotes the unit price at which the distributor orders reused power cells from the manufacturer,

c_{s}

denotes the unit price at which the distributor sells used power cells to the recycler,

c_{t}

denotes the unit price at which the distributor buys used power cells from the recycler,

c_{n}

denotes the unit cost of inventory at which the distributor holds the new power cells,

c_{r}

denotes the unit cost of inventory at which the distributor holds the reused power cells, and denotes the unit cost of inventory at which the distributor holds the used power cells.

c_{w}

denotes the unit inventory cost of reused power batteries held by the distributor and denotes the unit inventory cost of used power batteries held by the distributor. This paper assumes that these parameters are known and constant.

MDP modeling for consumers:

State set: The consumer’s state consists of two variables, the remaining capacity of the power battery currently used by the consumer and the consumer’s demand for a new power battery. Assume that both the remaining capacity of the consumer’s power battery and the demand are discrete and have upper bounds, which are denoted as

C_{m a x}

and

D_{m a x}

, respectively. Then, the set of consumers’ states is

S_{C} = \{(c, d)| c = 0,1, \dots, C_{m a x}; d = 0,1, \dots, D_{m a x}\}

.

Action set: The consumer’s action set consists of two variables, the number of new power cells the consumer will purchase from the distributor and the number of used power cells the consumer will deliver to the distributor or recycler in the next period. Assume that the number of new power batteries to be purchased and the number of used power batteries to be delivered by the consumer are discrete and have upper bounds, which are denoted as

B_{m a x}

and

F_{m a x}

, respectively. Then, the set of consumer actions is

A_{C} = \{(b, f)| b = 0,1, \dots, B_{m a x}; f = 0,1, \dots, F_{m a x}\}

.

Transfer probability: The transfer probability of a consumer is given by the following equation:

P_{C} ((c^{'}, d^{'})| (c, d), (b, f)) = \{\begin{matrix} P_{U} (c - b + f - c^{'}) P_{D} (d - b + d^{'}), i f 0 \leq c - b + f - c^{'} \leq C_{m a x} a n d 0 \leq d - b + d^{'} \leq D_{m a x} \\ 0, o t e h r w i s e \end{matrix}

where

P_{U} (x)

denotes the probability that the number of power batteries used by consumers in the next period is

x

, and

P_{D} (x)

denotes the probability that the demand for new power batteries by consumers in the next period is

x

. In this paper, it is assumed that both probabilities are known and obey some known probability distribution, such as Poisson distribution, normal distribution, and so on.

Reward function: The reward function of the consumer is given by the following equation:

R_{C} ((c, d), (b, f)) = c_{v} c + c_{b} b - c_{f} f - c_{d} d

where

c_{v}

denotes the unit value of the power battery used by the consumer,

c_{b}

denotes the unit price at which the consumer purchases a new power battery from a distributor,

c_{f}

denotes the unit price at which the consumer delivers a used power battery to a distributor or recycler, and

c_{d}

denotes the unit cost of the consumer’s demand for a new power battery. This paper assumes that these parameters are known and constant.

Recycler’s MDP model:

State set: The state of the recycler consists of a variable, i.e., the current inventory of used power batteries held by the recycler. Suppose the recycler’s inventory of used power batteries is discrete and has an upper limit, denoted as

W_{m a x}

. Then, the state set of the recycler is

S_{R} = \{w| w = 0,1, \dots, W_{m a x}\}

.

Action set: The recycler’s action consists of two variables, the number of used power cells the recycler wants to sell to the reuse vendor and the number of used power cells the recycler wants to buy from the distributor or consumer in the next period. Assume that the number of used power cells to be sold and the number of used power cells to be purchased by the recycler are discrete and have upper bounds, denoted as

S_{m a x}

and

T_{m a x}

, respectively. Then, the action set of the recycler is

A_{R} = \{(s, t)| s = 0,1, \dots, S_{m a x}; t = 0,1, \dots, T_{m a x}\}

.

Transfer probability: The transfer probability of a recycler is given by the following formula:

P_{C} (w^{'}| (w), (s, t)) = \{\begin{matrix} P_{U} (s - w + w^{'}) P_{D} (t - w + w^{'}), i f 0 \leq s - w + w^{'} \leq W_{m a x} a n d t - w + w^{'} \leq W_{m a x} \\ 0, o t e h r w i s e \end{matrix}

where

P_{U} (x)

denotes the probability that the number of used power batteries supplied by the reuse vendor to the recycler in the next period is

x

, and

P_{D} (x)

denotes the probability that the number of used power batteries supplied by the distributor or consumer to the recycler or acquired by the recycler in the next period is

x

. In this paper, it is assumed that both probabilities are known and obey some known probability distributions, such as Poisson distribution, normal distribution, and so on.

Reward function: The reward function of the recycler is given by the following formula:

R_{R} ((w), (s, t)) = c_{s} s + c_{t} t - c_{w} w

where

c_{s}

denotes the unit price at which the recycler sells the used power battery to the reuse vendor,

c_{t}

denotes the unit price at which the recycler buys the used power battery from the distributor or the consumer, and

c_{w}

denotes the unit inventory cost at which the recycler holds the used power battery. This paper assumes that these parameters are known and constant.

The MDP model for reutilizers:

State set: The state of the reuser consists of one variable, i.e., the current inventory of used power cells held by the reuser. Assume that the reutilizer’s inventory of used power cells is discrete and has an upper limit, denoted as

W_{m a x}

. Then, the set of states of the reutilizer is

S_{U} = \{w| w = 0,1, \dots, W_{m a x}\}

.

Action set: The action of a reuser consists of two variables, the number of reused power cells that the reuser wants to sell to a manufacturer or distributor in the next period and the number of used power cells that the reuser wants to buy from a recycler. Assume that the number of reused power cells to be sold and the number of used power cells to be purchased by the reuse vendor are discrete and have upper bounds, denoted as

U_{m a x}

and

T_{m a x}

, respectively. Then, the set of actions of the reutilizer is

A_{U} = \{(u, t)| u = 0,1, \dots, U_{m a x}; t = 0,1, \dots, T_{m a x}\}

Transfer probability: The transfer probability of the reutilizer is given by the following equation:

P_{U} (w^{'}| (w), (u, t)) = \{\begin{matrix} P_{M} (u - w + w^{'}) P_{R} (t - w + w^{'}), i f 0 \leq u - w + w^{'} \leq W_{m a x} a n d t - w + w^{'} \leq W_{m a x} \\ 0, o t e h r w i s e \end{matrix}

where

P_{M} (x)

denotes the probability that a manufacturer or distributor orders a quantity of reused power batteries from a reuse vendor in the next period as

x

, and

P_{R} (x)

denotes the probability that a recycler provides a quantity of used power batteries to a reuse vendor in the next period as

x

. In this paper, it is assumed that both probabilities are known and obey some known probability distribution, such as Poisson distribution, normal distribution, and so on.

Reward function: The reward function of the reutilizer is given by the following equation:

R_{U} ((w), (u, t)) = c_{u} u + c_{t} t - c_{w} w

where

c_{u}

denotes the unit price at which the reuser sells the reused power cell to the manufacturer or distributor,

c_{t}

denotes the unit price at which the reuser buys the used power cell from the recycler, and

c_{w}

denotes the unit inventory cost at which the reuser holds the used power cell. This paper assumes that these parameters are known and constant.

So far, this paper has modeled the Markov decision-making process of each decision maker in the closed-loop supply chain of power batteries.

3.3. Methods for Solving the Model

Since there are multiple decision makers in the closed-loop supply chain of power batteries and their decisions affect each other, this paper adopts the framework of the Multi-Agent Markov Decision Process (MAMDP), which considers each decision maker in the supply chain as an agent, and each agent has its own set of states, set of actions, transfer probabilities, and reward function, but their transfer probabilities and reward functions are affected by the other agents’ actions. Therefore, this paper needs to consider the game and coordination problem in the supply chain, i.e., how to make each agent maximize its own interests and also maximize the overall efficiency of the supply chain.

In this paper, the following two methods are used to solve the MAMDP model:

Dynamic-planning-based approach: This approach is based on modeling, i.e., it is necessary to know the transfer probability and reward function of each agent. The method has two steps: the first step is to use the concept of Nash Equilibrium (NE) to solve for the optimal combination of actions in each state, i.e., in each state, no agent can improve its long-term expected reward by changing its actions. The second step is to use algorithms such as value iteration or policy iteration to solve for the optimal value function in each state, i.e., the maximum long-term expected reward that can be obtained after making decisions according to the optimal combination of actions in each state. The advantage of this method is that it can guarantee to find the global optimal solution, but the disadvantage is the high computational complexity and the need for complete and symmetric information.

Reinforcement-learning-based approach: This approach is based on data, i.e., instead of knowing the transfer probability and reward function of each agent, the optimal policy is learned by interacting with the environment to generate data. The approach has two steps: the first step is to use the concept of the Multi-Armed Bandit (MAB) to design an exploration–exploitation algorithm that allows each agent to balance the trade-off between exploring new actions and exploiting known ones during the learning process. The second step is to use algorithms such as Monte Carlo or temporal differencing to estimate the value function for each state or state–action pair and determine the optimal policy based on the estimates. The advantage of this method is that it can adapt to dynamically changing environments and does not require complete and symmetric information, but the disadvantage is that it may fall into local optimal solutions and has a slow convergence rate.

4. Case Studies

This section focuses on a specific case study to validate the effectiveness and practicality of the Markov decision process model of the closed-loop supply chain of power batteries established in this paper and to compare the performance and effectiveness of the two solution methods based on dynamic programming and based on reinforcement learning under different parameters and scenarios.

4.1. Case Selection and Data Collection

This paper involves a new energy vehicle manufacturer in China as the subject of the case study, which produces a variety of new energy vehicles using lithium-ion power batteries and has established a closed-loop supply chain for power batteries involving multiple distributors, consumers, recyclers, and reusers. In this paper, based on the data provided by this manufacturer and related literature, each parameter in the model is reasonably set and estimated, as shown in Table 1.

To simplify the model, the following points are assumed in this paper:

In this paper, only one manufacturer, one distributor, one consumer, one recycler, and one reuse provider are considered, i.e., competition and diversity in the supply chain are ignored.

In this paper, only one variety of power battery is considered, i.e., product differences and diversity in the supply chain are ignored.

In this paper, only one period of decision making is considered, i.e., temporal differences and diversity in the supply chain are ignored.

In this paper, we assume that all probability distributions are known and obey the Poisson distribution, i.e., uncertainty and complexity in the supply chain are ignored.

These assumptions are made to facilitate the solution and analysis of the model and do not affect the generality and scalability of the model. Future research could relax these assumptions to increase the adaptability and usefulness of the model.

4.2. Analysis Using Markov Decision Process Models

In this paper, two solution methods based on dynamic programming and based on reinforcement learning are implemented using the Python programming language and related mathematical and statistical libraries, and the models are simulated and analyzed. The following metrics are used in this paper to evaluate the performance and effectiveness of the models and methods:

Profit for each party in the supply chain: The total revenue minus the total cost earned by each decision maker over a period of time.

Overall supply chain efficiency: The average of the sum of the profits of the supply chain parties over a period of time.

Overall supply chain utility: The weighted average of the sum of the profits of the supply chain parties over a period of time, where the weights reflect the importance that the supply chain parties place on profits, and this can be set by the decision makers themselves or determined by the coordination mechanism.

Convergence of the model: Refers to whether the model is able to reach a stable state or strategy within a limited number of iterations or interactions.

Model robustness: Refers to the ability of the model to adapt to different parameters and scenarios, such as demand fluctuations, price changes, technological advances, etc.

In this paper, we first compare the performance and effectiveness of two methods based on dynamic programming and based on reinforcement learning under different discount factors. The dynamic-planning-based approach can guarantee to find the global optimal solution, i.e., to maximize the overall efficiency and utility of the supply chain under any discount factor. In contrast, reinforcement-learning-based methods may fall into a local optimal solution, i.e., the overall supply chain efficiency and utility cannot be maximized under certain discount factors. In addition, dynamic-programming-based methods require fewer iterations to converge, while reinforcement-learning-based methods require more interactions to converge. Therefore, the dynamic-planning-based approach outperforms the reinforcement-learning-based approach when the information is complete and symmetric.

4.3. Python Implementation of the Model

4.3.1. Environment and Library Configuration

This study was implemented using Python 3.8 for programming and the following key libraries were utilized:

numpy: For creating and manipulating arrays and matrices for efficient linear algebra operations.

scipy: Provides rich numerical computation tools, including linear programming solvers, etc., for numerical analysis and optimization in models.

gym: An open source toolkit for reinforcement learning environments that provides multiple test problems and algorithmic interfaces for easy simulation and evaluation.

stable_baselines3: A library of reinforcement learning algorithms based on TensorFlow 2 for implementing and testing different reinforcement learning strategies.

matplotlib: For data visualization, plotting graphs, and presenting results.

4.3.2. Implementation of the Dynamic Programming Algorithm

The implementation of the dynamic programming algorithm in this study consists of two main methods: value iteration and strategy iteration (Algorithm 1):

Algorithm 1 # Python code for the value iteration algorithm.

1:import numpy as np
2:def value_iteration(transition_probs, rewards, gamma = 0.9, threshold = 0.01).
3: """
4: value iterative algorithm implementation.
5: :param transition_probs: Matrix of state transition probabilities.
6: :param rewards: The rewards function.
7: :param gamma: Discount factor.
8: :param threshold: Convergence threshold.
9: :return: Optimal strategy and state value.
10: """
11: num_states = len(transition_probs)
12: V = np.zeros(num_states)
13: policy = np.zeros(num_states)
14:
15: while True:
16: delta = 0
17: for s in range(num_states):
18: v = V[s]
19: V[s] = max([sum([p * (rewards[s][a][s_prime] + gamma * V[s_prime]))
20: for s_prime, p in enumerate(transition_probs[s][a])])
21: for a in range(len(rewards[s]))])
22: delta = max(delta, abs(v − V[s]))
23: if delta < threshold.
24: break
25: for s in range(num_states):
26: policy[s] = np.argmax([sum([p * (rewards[s][a][s_prime] + gamma * V[s_prime]))
27: for s_prime, p in enumerate(transition_probs[s][a])])
for a in range(len(rewards[s]))])
28: return policy, V

29:# Example use (assuming state transfer probabilities and rewards are defined)
30:# policy, V = value_iteration(transition_probs, rewards)

Value iteration: The value function of each state is updated by iteration until it converges to the optimal value function. Thereafter, the optimal policy is derived from the value function.

Policy iteration: A policy is randomly initialized, and then the optimal policy is found through iterations of policy evaluation and policy improvement, from which the optimal value function is derived (Algorithm 2).

Algorithm 2 # Python code for the strategy iteration algorithm.

1:def policy_evaluation(policy, transition_probs, rewards, gamma = 0.9, threshold = 0.01).
2: """
3: Strategy evaluation function.
4: """
5: num_states = len(transition_probs)
6: V = np.zeros(num_states)
7: while True:
8: delta = 0
9: for s in range(num_states):
10: v = V[s]
11: a = policy[s]
12: V[s] = sum([p * (rewards[s][a][s_prime] + gamma * V[s_prime]))
13: for s_prime, p in enumerate(transition_probs[s][a])])
14: delta = max(delta, abs(v − V[s]))
15: if delta < threshold.
16: break
17: return V
18:def policy_iteration(transition_probs, rewards, gamma = 0.9).
19: """
20: Strategy Iteration Algorithm Implementation.
21: """
22: num_states = len(transition_probs)
23: policy = np.random.choice(len(rewards[0]), size = num_states)
24: while True:
25: V = policy_evaluation(policy, transition_probs, rewards, gamma)
26: policy_stable = True
27: for s in range(num_states):
28: old_action = policy[s]
29: policy[s] = np.argmax([sum([p * (rewards[s][a][s_prime] + gamma * V[s_prime]))
30: for s_prime, p in enumerate(transition_probs[s][a])])
31: for a in range(len(rewards[s]))])
32: if old_action ! = policy[s].
33: policy_stable = False
34: if policy_stable.
35: break
36: return policy, V

37:# Example use (assuming state transfer probabilities and rewards are defined)
38:# policy, V = policy_iteration(transition_probs, rewards)

Both methods are implemented through Python’s looping structure and conditional judgment statements, which ensure that the algorithm stops iterating when a convergence condition is reached.

4.3.3. Implementation of Reinforcement Learning Algorithm

Applications of reinforcement learning in this study include round-based and step-based algorithms:

Monte Carlo Algorithm: This is a round-based algorithm that estimates the value of a state or state–action by sampling complete rounds and optimizes it step by step.

Temporal difference algorithms, e.g., SARSA and Q-learning: These algorithms update state or state–action value estimates in real time based on single-step reward and state information (Algorithm 3).

Algorithm 3 # Examples of OpenAI-Gym-based reinforcement learning environments.

1:import gym
2:from stable_baselines
3 import PPO, DQN3:# Create the environment
4:env = gym.make(‘YourMDPEnv-v0’) # Assuming ‘YourMDPEnv-v0’ is a customized
5:environment
6:# Use of PPO algorithms
7:model = PPO(‘MlpPolicy’, env, verbose = 1)
8:model.learn(total_timesteps = 10,000)
9:# Using the DQN algorithm
10:model = DQN(‘MlpPolicy’, env, verbose = 1)
11:model.learn(total_timesteps=10,000)
12:# Test models
13:obs = env.reset()
14:for _ in range(1000).
15: action, _states = model.predict(obs, deterministic = True)
16: obs, rewards, dones, info = env.step(action)
17: env.render()

These algorithms were implemented through Python programming and tested through the MDP environment interface provided by OpenAI Gym. Meanwhile, the PPO and DQN algorithms from the stable_baselines3 library were used for training to automate reinforcement learning.

4.3.4. Analysis and Validation of Results

The results are analyzed mainly by calculating the profit metrics of each party in the supply chain. Learning curves were also drawn in this study to reflect the convergence of different algorithms. The impact of the discount factor on the performance of the algorithms was evaluated through parameter sensitivity analysis. The following is the Python code implementation of these steps.

1.: Calculation of key performance indicators

Performance metrics typically include profit, efficiency, etc. Below is a sample function to calculate and output these metrics (Algorithm 4).

Algorithm 4 Python code for calculating key performance indicators.

1:def calculate_performance_metrics(states, rewards, policy, V).
2: """
3: Calculate and output performance metrics.
4: :param states: Collection of states.
5: :param rewards: The rewards function.
6: :param policy: Policy.
7: :param V: Status value.
8: """
9: total_profit = sum([rewards[s][policy[s]] for s in states])
10: average_profit = total_profit / len(states)
11: efficiency = sum([V[s] for s in states]) / len(states)
12: print(“Total profit:”, total_profit)
13: print(“average_profit:”, average_profit)
14: print(“Efficiency:”, efficiency)
15:# Example use (assumes existing states, rewards, policy, V)
16:# calculate_performance_metrics(states, rewards, policy, V)

2.: Mapping the learning curve

Learning curves are an important tool for observing the progress and performance of model training. Below is a function to plot the learning curve (Algorithm 5).

Algorithm 5 Python code for drawing learning curves

1:import matplotlib.pyplot as plt
2:def plot_learning_curve(rewards, title=“Learning Curve”):.
3: """
4: Mapping the Learning Curve.
5: :param rewards: Rewards for each step or turn.
6: :param title: Chart title.
7: """
8: plt.figure(figsize=(10, 5))
9: plt.plot(rewards)
10: plt.title(title)
11: plt.xlabel(‘Number of rounds’)
12: plt.ylabel(’Reward’)
13: plt.show()
14:# Example use (assuming a list of rewards already exists)
15:# plot_learning_curve(rewards)

3.: Parametric sensitivity analysis

Parameter sensitivity analysis is the process of evaluating the impact of different parameter settings on model performance. The following is an example of a basic parameter sensitivity analysis (Algorithm 6).

Algorithm 6 Python code for parameter sensitivity analysis

1:def sensitivity_analysis(param_range, env, model_class).
2: """
3: Sensitivity analysis was performed for different parameter values.
4: :param param_range: Parameter range.
5: :param env: Enhanced learning environment.
6: :param model_class: Intensive learning model class.
7: """
8: performance_metrics = []
9: for param in param_range:
10: model = model_class(‘MlpPolicy’, env, gamma=param, verbose=0)
11: model.learn(total_timesteps = 10,000)
12: # Evaluate model performance...
13: performance = evaluate_model(model, env) # assumes evaluate_model is
14:defined
15: performance_metrics.append(performance)
16: # Plotting the results of parameter sensitivity analyses
17: plt.figure(figsize=(10, 5))
18: plt.plot(param_range, performance_metrics)
19: plt.title(“Parameter sensitivity analysis”)
20: plt.xlabel(‘Parameter value’)
21: plt.ylabel(‘Performance Indicator’)
22: plt.show()
23:# Example usage (assuming env, model_class already exists
)24:# sensitivity_analysis(np.linspace(0.1, 0.9, 9), env, PPO)

5. Decision-Making and Coordination Mechanisms

5.1. Decision-Making Mechanisms Based on Markovian Decision Process Models

The decision-making mechanism based on the Markov decision process model refers to the supply chain parties dynamically adjusting their actions to maximize their own interests based on the current state and future expectations. The decision mechanism has the following characteristics:

The decision-making mechanism is decentralized, i.e., each decision maker can make decisions independently without the need to communicate or consult with other decision makers.
The decision-making mechanism is adaptive, i.e., each decision maker can continuously update his/her state and strategy in response to changes and feedback from the environment in order to adapt to uncertainty and dynamics.
The decision mechanism is intelligent, i.e., each decision maker can learn and optimize to find the optimal or near-optimal action to improve his or her long-term desired reward.

The steps for the implementation of this decision-making mechanism are set out below:

Step 1: Initialization. Each decision maker needs to initialize its state, action, value function, and strategy. The state and action can be set according to the actual situation, the value function can be initialized randomly or zero, and the strategy can be initialized randomly or uniformly.

Step 2: Observation. Each decision maker needs to observe the current state and the actions of the other decision makers and calculate its own immediate reward based on the transfer probability and reward function.

Step 3: Learning. Each decision maker needs to update its value function and policy based on the observed data. Algorithms such as value iteration or policy iteration can be used if dynamic-programming-based methods are used; algorithms such as Monte Carlo or temporal differencing can be used if reinforcement-learning-based methods are used.

Step 4: Execution. Each decision maker needs to select an action and execute it based on the current state and the updated strategy.

Step 5: Repetition. Each decision maker needs to repeat steps 2 through 4 until the termination conditions are met, such as convergence, maximum number of iterations reached, or number of interactions.

This decision-making mechanism allows supply chain parties to maximize their own interests without a central coordinator or information sharing. However, there are some drawbacks of this decision-making mechanism, such as it may lead to the reduction of the overall efficiency and utility of the supply chain, the imbalance of interests among the supply chain parties, and a lack of trust among the supply chain parties. Therefore, in the next section of this paper, a coordination mechanism will be designed to ameliorate these problems.

5.2. Coordination Mechanisms Based on Markov Decision Process Models

The coordination mechanism based on the Markov decision process model refers to the rational allocation and incentives, so that all parties in the supply chain can consider the maximization of the overall efficiency and utility of the supply chain while pursuing their own interests. This coordination mechanism has the following characteristics:

The coordination mechanism is centralized, i.e., a central coordinator is needed to design and implement the coordination mechanism, as well as to communicate or consult with all parties in the supply chain.
The coordination mechanism is contractual in nature, i.e., it requires a contract or agreement to bind the supply chain parties to their behaviors and responsibilities, as well as to specify the benefits and risks for each party in the supply chain.
The coordination mechanism is incentive-based, i.e., it needs to provide incentives or penalties to motivate supply chain parties to comply with the contract or agreement, as well as to promote the overall efficiency and utility of the supply chain.
The specific steps for the implementation of this coordination mechanism are set out below:

Step 1: Determine the objectives. The central coordinator needs to determine the objective function for the overall efficiency and utility of the supply chain and the degree of importance, i.e., the weights, that each party in the supply chain places on profit. The objective function can be linear or non-linear, and the weights can be fixed or variable.

Step 2: Design contracts. The central coordinator needs to design contracts or agreements to specify the actions that each supply chain party should take in each state, as well as the rewards or penalties to be assigned based on the actions and outcomes. The contract or agreement can be complete or incomplete, i.e., whether it contains all possible states and actions.

Step 3: Enforce the contract. The central coordinator needs to monitor whether supply chain parties are making decisions in accordance with the contract or agreement and enforce rewards or penalties based on decisions and results. If supply chain parties are found to have violated the contract or agreement, the central coordinator can take appropriate measures, such as termination of the contract, claims, and lawsuits.

Step 4: Update contracts. The central coordinator needs to update the objective function, weights, contracts or agreements, etc., in response to changes and feedback from the environment in order to adapt to uncertainty and dynamics and to improve the overall efficiency and utility of the supply chain.

This coordination mechanism can enable supply chain parties to maximize the overall efficiency and utility of the supply chain with a central coordinator and information sharing. However, there are some challenges in this coordination mechanism, such as how to determine reasonable objective functions, weights, contracts or agreements, etc., and how to ensure the authenticity and integrity of supply chain parties. Therefore, in the next section of this paper, the effectiveness of this coordination mechanism is evaluated and some suggestions for improvement are made.

5.3. Assessment of the Effectiveness of Coordination Mechanisms and Recommendations for Improvement

This paper evaluates the effectiveness of the coordination mechanism based on the Markov decision process model through simulation experiments and makes some suggestions for improvement. This paper uses the following indicators to evaluate the effectiveness of the coordination mechanism:

Rate of increase in the overall efficiency of the supply chain: The percentage increase in the overall efficiency of the supply chain after the use of coordination mechanisms compared to before the use of decision-making mechanisms.
Rate of increase in the overall utility of the supply chain: The percentage increase in the overall utility of the supply chain after the use of the coordination mechanism compared to before the use of the decision-making mechanism.
Equity in profit distribution among supply chain parties: This refers to whether the distribution of profits among supply chain parties is in line with their contributions and expectations after the use of the coordination mechanism and whether there is any imbalance or exploitation in profit distribution.
Contract compliance rate of supply chain parties: This refers to whether, after using the coordination mechanism, supply chain parties make decisions in accordance with the contract or agreement and whether there is any violation of the contract or agreement.
In this paper, we design different coordination mechanisms based on different objective functions, weights, contracts, or agreements, and compare their effects under different parameters and scenarios.

After using the coordination mechanism, the overall efficiency and utility of the supply chain have been improved to different degrees, but there are some problems, such as unfair profit distribution and low contract compliance rate. Therefore, this paper puts forward the following suggestions for improvement:

When determining the objective function, the multiple objectives of the supply chain parties, such as profit, cost, service, and environment, should be considered and weighed and balanced according to the actual situation and priorities.
In determining the weights, the benefit preferences and risk preferences of all parties in the supply chain should be taken into account and allocated and adjusted according to the actual situation and the principle of fairness.
The incomplete and asymmetric information of the supply chain parties should be taken into account when designing contracts or agreements, and they should be designed and optimized according to the actual situation and incentive principles.
The truthfulness and good faith of the parties in the supply chain should be taken into account in the execution of the contract or agreement, which should be monitored and enforced in accordance with the actual situation and the principle of constraint.
Limitations of the study and future prospects

In the research process of constructing a decision-making and coordination mechanism for the closed-loop supply chain of power batteries, the model in this paper relies on a series of assumptions, which has a certain impact on the research results. While the simplification of the model helps to focus on the analysis of core decision-making issues, it also limits the generalizability of the results. The model limits the number of supply chain participants to a single manufacturer, distributor, consumer, recycler, and reuse provider, excluding the phenomenon of competition in the supply chain. The assumption of no other competitors may deviate from the complex reality of the business environment. When markets are competitive, supply chain participants must make adjustments to their strategic decisions to remain competitive, a dynamic that the current study fails to encompass.

In addition, the types of power batteries involved are simplified to a single type. In practice, the supply chain needs to deal with multiple types of batteries, each of which may have different recycling, storage, and transportation needs, which adds additional complexity. This study focuses on a single decision cycle and fails to fully explore strategy changes under the influence of long-term and complex uncertainties. Future work could explore the adaptability and sustainability of supply chain decisions in long-term operations by incorporating multiple decision cycles.

In response to these limitations, future research could extend the model to reflect more competitors in the supply chain and consider different types of batteries. This would improve the utility and broad applicability of the model. At the same time, extending the model to a long-term multi-period decision-making environment and combining multiple probability distributions to consider more complex uncertainties can bring the model closer to actual business operations. These improvements are expected to provide finer and more comprehensive strategic recommendations for closed-loop supply chain management of power batteries, helping practitioners to optimize supply chain operations and achieve the dual goals of environmental sustainability and economic efficiency.

6. Conclusions and Outlook

This study successfully constructs and analyzes the decision-making and coordination mechanism of a closed-loop supply chain for power batteries by applying a Markov Decision Process (MDP) model. The core contribution of the study is the development of a comprehensive modeling framework that can effectively address the uncertainty and dynamics in the supply chain and provide coordination strategies for the multiple decision makers involved. Through case studies, this study validates the effectiveness of the proposed model and demonstrates the application of two solution methods, dynamic programming and reinforcement learning, under different conditions.

The research results show that the constructed MDP model can effectively improve the overall efficiency and effectiveness of the closed-loop supply chain of power batteries, which provides new perspectives and solutions for supply chain management. The model solution results reveal the optimal strategies that should be adopted by all parties in the supply chain under different decision cycles and probability distributions and how to maximize the overall efficiency through the coordination mechanism.

However, there are some limitations to the research. The current model is based on a series of simplifying assumptions, such as a single supply chain participant and a single type of battery, which limits the generalizability of the model. Future research can improve the utility and adaptability of the model by introducing more supply chain participants, multiple battery types, and a long-term multi-period decision-making environment. In addition, exploring different probability distributions and utilizing data-driven approaches to learn about uncertainty in the supply chain will further enhance the accuracy and applicability of the model.

In summary, this study provides an innovative solution to the problem of decision-making and coordination in the closed-loop supply chain of power batteries and points out the direction for future research. The research results are not only theoretically significant but also provide a valuable reference for supply chain management in practice. Despite the limitations, the results of this study lay a solid foundation for further exploration in the field of supply chain management.

Author Contributions

Conceptualization, N.L.; Data curation, J.L.; Writing—original draft, N.L.; Writing—review & editing, H.Z.; Project administration, N.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jia, X.; Li, S. Progress of domestic research on closed-loop supply chain. Logist. Sci. Technol. 2022, 45, 123–127. [Google Scholar]
Cai, X.; Lin, Y. Comparative analysis of market behavior of closed-loop supply chain decision makers under different power structures. Logist. Sci. Technol. 2023, 46, 118–123. [Google Scholar]
Yang, S. Research on the strong chain path of power battery industry based on the goal of “double carbon”. Automob. Accessories 2023, 63–67. [Google Scholar] [CrossRef]
Zhou, Y.; Wang, P.; Liu, E. An introduction to dynamic classification management of suppliers under supply chain management environment. Economist, 2023; 49–51+54. [Google Scholar] [CrossRef]
Huang, S.-B.; Chen, B.; Gao, S.-Y. Energy management strategy for electric vehicle charging station based on Markov decision process. Power Autom. Equip. 2022, 42, 92–99. [Google Scholar]
Aldrighetti, R.; Battini, D.; Das, A.; Simonetto, M. The performance impact of Industry 4.0 technologies on closed-loop supply chains: Insights from an Italy based survey. Int. J. Prod. Res. 2023, 61, 3004–3029. [Google Scholar] [CrossRef]
Liu, Z.; Li, P.; Wang, C. Reliability analysis of supply chain based on propagation dynamics model. J. Mil. Transp. 2023, 2, 42–49. [Google Scholar]
Zhang, X.; Han, L.; Qin, Y.; Li, F. Markov chain-based supply chain trust evolution game and stabilization strategy. Stat. Decis. Mak. 2019, 35, 47–51. [Google Scholar]
Feng, Z.; Yang, D.; Wang, X. “Internet+ Recycling” Platform Participation Selection Strategy in a Two-Echelon Remanufacturing Closed-Loop Supply Chain. Int. J. Environ. Res. Public Health 2023, 20, 3999. [Google Scholar] [CrossRef] [PubMed]
Pedram, A.; Sorooshian, S.; Mulubrhan, F.; Abbaspour, A. Incorporating Vehicle-Routing Problems into a Closed-Loop Supply Chain Network Using a Mixed-Integer Linear-Programming Model. Sustainability 2023, 15, 2967. [Google Scholar] [CrossRef]
Bathaee, M.; Nozari, H.; Szmelter-Jarosz, A. Designing a New Location-Allocation and Routing Model with Simultaneous Pick-Up and Delivery in a Closed-Loop Supply Chain Network under Uncertainty. Logistics 2023, 7, 3. [Google Scholar] [CrossRef]
Guo, H.; Liu, G.; Zhang, Y.; Zhang, C.; Xiong, C.; Li, W. A hybrid differential evolution algorithm for a location-inventory problem in a closed-loop supply chain with product recovery. Complex Intell. Syst. 2023, 9, 4123–4145. [Google Scholar] [CrossRef]
Kolyaei, M.; Azar, A.; Ghatari, R.A. An integrated robust optimisation approach to closed-loop supply chain network design under uncertainty: The case of the auto glass industry. Int. J. Process Manag. Benchmarking 2023, 14, 285–310. [Google Scholar] [CrossRef]
Gu, X.; Huang, H.; Guo, J. Power battery recycling strategy for electric vehicles considering government subsidies. Inf. Manag. Res. 2022, 7, 1–14. [Google Scholar]
Wei, G.; Li, Z.; Lu, T.; Xi, W.; Yuan, Y. Research on battery life prediction method based on Markov model. Autom. Instrum. 2019, 44–47. [Google Scholar] [CrossRef]
Yang, Z. Power Allocation Smoothing Strategy for Hybrid Vehicles Based on Markov Algorithm. Master’s Thesis, Tianjin University, Tianjin, China, 2018. [Google Scholar]

Table 1. Model parameter settings.

Parameters	Hidden Meaning	Value	Unit (of Measure)
Nmax	Inventory caps for new power cells for manufacturers and distributors	100
Rmax	Inventory caps for manufacturers and distributors of reused power cells	100
Wmax	Inventory cap on used power batteries for distributors, recyclers, and reusers	100
Cmax	Maximum remaining capacity of power batteries for consumers	100	kWh
Dmax	Consumer demand for new power cells capped	10
Pmax	Manufacturer’s production quantity cap for new power cells	10
Bmax	Caps on the number of reused power cells purchased by manufacturers and consumers	10
Omax	Maximum number of new power cell orders from distributors	10
Umax	Maximum number of reused power cells sold by distributors and reusers	10
Smax	Maximum quantity of used power batteries to be sold by distributors and recyclers	10
Tmax	Maximum quantity of used power batteries to be purchased by distributors and recyclers	10
Bmax	Consumers are capped on the number of new power cells they can purchase	10
Fmax	Consumers are capped on the number of used power batteries delivered	10
cp	Manufacturer’s unit cost of producing new power cells	1000	CNY
cb	Manufacturer’s unit price for purchasing reused power cells	5000	CNY
cn	Manufacturer’s unit inventory cost of holding new power cells	50	CNY
cr	Unit inventory costs for manufacturers to hold reused power cells	50	CNY

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, H.; Li, N.; Lin, J. Modeling the Decision and Coordination Mechanism of Power Battery Closed-Loop Supply Chain Using Markov Decision Processes. Sustainability 2024, 16, 4329. https://doi.org/10.3390/su16114329

AMA Style

Zhang H, Li N, Lin J. Modeling the Decision and Coordination Mechanism of Power Battery Closed-Loop Supply Chain Using Markov Decision Processes. Sustainability. 2024; 16(11):4329. https://doi.org/10.3390/su16114329

Chicago/Turabian Style

Zhang, Huanyong, Ningshu Li, and Jinghan Lin. 2024. "Modeling the Decision and Coordination Mechanism of Power Battery Closed-Loop Supply Chain Using Markov Decision Processes" Sustainability 16, no. 11: 4329. https://doi.org/10.3390/su16114329

APA Style

Zhang, H., Li, N., & Lin, J. (2024). Modeling the Decision and Coordination Mechanism of Power Battery Closed-Loop Supply Chain Using Markov Decision Processes. Sustainability, 16(11), 4329. https://doi.org/10.3390/su16114329

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Modeling the Decision and Coordination Mechanism of Power Battery Closed-Loop Supply Chain Using Markov Decision Processes

Abstract

1. Introduction

2. Literature Review

2.1. Research Status of Power Battery Closed-Loop Supply Chain Management

2.2. Research on the Application of the Markov Decision Process in Supply Chain Management

2.3. Research Progress in Closed-Loop Supply Chain and Reverse Logistics

3. Markov Decision Process Model

3.1. Fundamentals of the Markov Decision Process

3.2. Modeling the Markov Decision Process of Power Battery Closed-Loop Supply Chain

3.3. Methods for Solving the Model

4. Case Studies

4.1. Case Selection and Data Collection

4.2. Analysis Using Markov Decision Process Models

4.3. Python Implementation of the Model

4.3.1. Environment and Library Configuration

4.3.2. Implementation of the Dynamic Programming Algorithm

4.3.3. Implementation of Reinforcement Learning Algorithm

4.3.4. Analysis and Validation of Results

5. Decision-Making and Coordination Mechanisms

5.1. Decision-Making Mechanisms Based on Markovian Decision Process Models

5.2. Coordination Mechanisms Based on Markov Decision Process Models

5.3. Assessment of the Effectiveness of Coordination Mechanisms and Recommendations for Improvement

6. Conclusions and Outlook

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI