3.2.3. Reward Function

The criteria for reward were different depending on matching. Once a driver was matched with a passenger, the reward was the base fare (*Pf are*) for the travel between origin and destination multiplied by the surcharge coefficient (*Sa*). If a driver was not matched with a passenger, a negative reward was applied in line with the waiting time value (*Wt*). The waiting time was set to 8 min considering the fact that the average waiting time during late-night periods when a surcharge is applied is 8.1 min according to a study in Seoul Metropolitan City [38]. The time value (*Vt*) was calculated on the basis of the taxi fare for 1 min in the OD matrix. While it increased profitability through a surcharge, it was also designed to provide a negative reward when unmatched by multiplying the time value for 8 min by the surcharge. The equation for the reward can be expressed as shown in Equation (6). Learning was conducted separately for when a negative value was applied to waiting time and for when it was not. Through a comparison, the study identified the appropriate reward function.

$$r = \begin{cases} matched & \mathcal{P}\_{fare} \times \mathcal{S}\_a \\ & \begin{pmatrix} -\mathcal{W}\_{\mathcal{I}} \times \mathcal{V}\_{\mathcal{I}} \times \mathcal{S}\_{\mathcal{U}} & \rightarrow \mathit{alt1} \\ 0 & \rightarrow \mathit{alt2} \end{pmatrix} \end{cases} \tag{6}$$
