3.2.4. Q-Learning Algorithm

A total of 20,000 Q-learning reiterations were conducted by taking the index value for each zone into account in the late-night period (Figure 3). In one episode, the goal was to identify the optimal surge for each time slot during 6 h of operation for each OD matrix. The Q-table included states (current state and price) for each time slot and action value by applying the concept of time. The travel time and price required to calculate the reward value (operating profit) for each state referred to a separate OD table. The weight values applied to the preference of drivers were the centrality indices for each time slot and zone. The reward function was created separately depending on the application of the time value.

**Figure 3.** Q-learning algorithm pseudo code.
