*3.1. Analysis Methodology*

### 3.1.1. Degree Centrality Analysis Methodology

Degree centrality is an indicator showing how many nodes are connected to a certain node. In social network theory, degree centrality is defined as the number of nodes linked directly to any given node. This study measured the out-degree and in-degree of each region (node) from a centrality perspective depending on the level of node connectivity, using them as indicators to be applied in reinforcement learning. Centrality can be calculated as follows:

$$\mathcal{C}'\_D(\mathsf{N}\_i) = \frac{\sum\_{j=1}^{\mathcal{S}} \mathsf{x}\_{ij}}{\mathcal{S} - 1}, \; i \neq j. \tag{1}$$

where *C D* is the standardized degree centrality of node *i*, *<sup>g</sup> j* = 1 *xij* is the degree centrality of node *j*, and *g* is the number of nodes. If there is no direction in the network, the above equation simply shows the degree of nodes. On the other hand, if directions are present, the equation distinguishes out-degree centrality ( *<sup>C</sup>outD*) and in-degree centrality ( *<sup>C</sup>inD*). Here, out-degree centrality is defined as the level of connections going out from a certain node to other nodes, and in-degree centrality is defined as the level of connections coming in from other nodes to a certain node. In this study, in-degree centrality referred to the number of vehicles coming into a certain zone (Gu), thus indicating a region as a frequently selected destination by passengers (usually residential areas during late-night periods). On the contrary, out-degree centrality referred to the number of vehicles traveling out of a certain zone, thus indicating a region as a preferred destination by drivers (central and subcentral regions). Therefore, this study identified indicators by region considering both out-degree and in-degree centrality.

### 3.1.2. Reinforcement Learning Methodology

Reinforcement learning is a learning method through which an agen<sup>t</sup> chooses an action to take in an environment to maximize reward. This a ffects not only the immediate reward due to the action of the agent, but also the long-term reward. The main characteristics of reinforcement learning are a trial-and-error search and delayed reward (Figure 1). This study used the Q-learning algorithm for analysis.

**Figure 1.** Conceptual diagram of reinforcement learning.

In the Q-learning algorithm [35,36], Q initially has an arbitrary fixed value. When the agen<sup>t</sup> selects an action (*a*) in accordance with a learning step (*t*), an immediate reward (*r*) is observed before entering a new state (*s*), updating the Q value. The key characteristic of this algorithm is the value iteration update, using the weighted average of the old value and new information. The equation of Q-learning can be expressed as follows:

$$Q(\mathbf{s}\_{l}, a\_{l}) \leftarrow (1 - a)Q(\mathbf{s}\_{l}, a\_{l}) + a(r\_{l} + \gamma \max\_{a} Q(\mathbf{s}\_{l+1}, a), \tag{2}$$

where α is the learning rate which updates the current Q value with immediate reward and/or future expected value, usually between 0 and 1. When the value is 0, the current Q value is continuously used without any update according to learning. On the contrary, if the value is 1, the previous Q value is ignored and updated automatically. γ is a discount factor which is a variable explaining the di fference between immediate and future rewards, also between 0 and 1. When the value is 0, the agen<sup>t</sup> takes a myopic action as by making an update with immediate reward. When the value is 1, the agen<sup>t</sup> leans toward a future reward, underestimating the current reward. In this study, on the basis of existing studies, we experimentally set the learning rate to 0.1 and discount factor to 0.9.

### *3.2. Environment Setting for Reinforcement Learning Simulation*
