*2.2. Deep Q-Networks for Optimal Path Planning*

In this study, DQNs were applied to find optimal paths as Q-tables based on virtual 2D grid maps through simulation, in which each target uses a virtual grid map and also many optimal paths are found for the wheelchair to reach that target from any start position of the wheelchair. In the DQNs, we could set variables related to the operation of the wheelchair and one real environment, particularly, the wheelchair is called Agent on a virtual 2D grid map (Environment), consisting of obstacles and free spaces. With the positions of start and target, Agent's task is to reach the target cell. In addition, the Agent interacts with Environment based on Actions (Left, Right, Up, Down). After each Action, Environment returns to Agent and State *St* = (*xt, yt*) is the wheelchair position at time *t*, with the (*x, y*) grid coordinate, and the reward points (Reward, *R*) correspond to that State. In addition, Agent has a limited State, *St* є *S*, with an *m* × *n* pre-defined size of *S*, and Agent is often placed in the middle of the grid cells for the possibility of moving in all four directions.

In this algorithm, State consists of three types of obstacle *So*, free space *Sf*, and target *Sg*. At each moment *t*, Agent is the State *St* and needs to select an Action from a fixed set of possible Actions. Therefore, the decision to select which Action for movement operation is only dependent on the current State, not the Action history, due to being irrelevant. In addition, the result of Action at time *at* will cause the conversion from the current State *St* at time t to the new State *St+*<sup>1</sup> at the time (*t* + 1) and then immediate Reward collected after each Action *R*(*st, at*) є [–1, 1] is calculated using the following rule:

$$R(s\_t, a\_t) = \begin{cases} R\_f & \text{if } a\_t = s\_t \to s\_f \\ R\_\circ & \text{if } a\_t = s\_t \to s\_\circ \\ R\_o & \text{if } a\_t = s\_t \to s\_o \end{cases} \tag{1}$$

Each movement of the wheelchair from one cell to an adjacent cell will lose *Rf* points and this will prevent it from wandering around and possibly reaching the desired target with the shortest path. In this algorithm, the maximum Reward is *Rg* points for movement of the wheelchair to hit the target. While the wheelchair tries to enter an obstacle cell, *Ro* points will be subtracted. It means that this is a serious punishment (penalty), so the wheelchair will learn how to completely avoid the punishment and so the effort to move to an obstacle cell is invalid and cannot be performed. The same rule for an attempt to move outside the map boundary with a punishment of *Rb* points applies. The case is that the wheelchair will lose *Rp* points for any movement to the cell that has been passed. Moreover, to avoid infinite loops during the training process using the DQNs, the total Reward is bigger than the negative threshold (*thr* × *m* × *n*) and then the wheelchair can move normally. Inversely, the movement of the wheelchair can be lost and many errors can be made, so the training needs to be carried out again until the total Reward is enough.

In this DQN, the main learning model is a Feedforward Neural Network (FWNN) with backpropagation training algorithm, in which the environmental States are the input of the network and bring Rewards back for each Action vector. The goal of Agent is to move following the map by a Policy to obtain a maximum Reward from the Environment. Therefore, Policy *π* at State *st* produces an Action *at* so that the total Reward *Q* Agent receives is the largest and is calculated by the following equation:

$$\pi(\mathbf{s}\_{\mathbf{t}}) = \arg\max\_{i=0,1,\ldots,n} Q(\mathbf{s}\_{\mathbf{t}}, a\_i),\tag{2}$$

$$Q(s\_{t\prime}, a\_t) = \mathcal{R}(s\_{t\prime}, a\_t) + \gamma.\max\_{i=0,\ldots,n} Q(s\_{t+1\prime}, a\_i),\tag{3}$$

in which *Q*(*st, ai*) are Actions, *ai* (*i* = 0, 1, ... , (*n*−1)), *n* denotes the number of Actions and satisfies the following equation of Bellman [35], *st*+1 is the next State, *γ* denotes the discount coefficient which makes sure that Agent is far from the target and it is smaller than the *Q*-value.

For approximating *Q*(*st, at*), the FWNN has the input as a State and its output is the vector *Q*, in which the *Q*-value corresponds to *n* Actions. In addition, *Qi* approximates the value of *Q*(*st, ati*) for each Action *ati*. When the network is fully and accurately trained, it will be used in the optimal path planning model for selecting Policy *π* as follows:

$$
\pi(s\_t) = a\_{\dot{\jmath}\prime} \tag{4}
$$

$$j = \underset{i=0,\ldots,n}{\arg\max} (Q\_i)\_{\prime} \tag{5}$$

in which the value *j* is determined based on the maximum *Q.*

The purpose of the neural network model is to learn how to exactly estimate the Q-value for Actions, so the objective/goal function applied here is to calculate the error *Loss* between the actual and predicted values *Q* and it is described by the following equation:

$$\text{Loss} = \left( \mathcal{R}(\mathbf{s}\_{t\prime}, a\_t) + \gamma \max\_{a\_{t+1}} \mathcal{Q}(\mathbf{s}\_{t+1\prime}, a\_{t+1}) - \mathcal{Q}(\mathbf{s}\_t, a\_t) \right)^2. \tag{6}$$

In addition, the FWNN model has the input of the current State and the outputs of the values *Q*. However, if the input of the FWNN is continually pushed into each State, it is very easily overfitted because the States are often the same and linear. For eliminating the overfitting problem in the FWNN model, a technique, called experience replay [23], is applied. In particular, instead of each State, the network is updated once, and the State is saved into memory and then sampled as small batches connected to the input of the FWNN for training. Therefore, it may provide diversification of the FWNN input and also avoid the overfitting problem. In this case, the training model will forget old samples not good enough for the training process and then they will be deleted from memory.

The FWNN model used in the training system has two hidden layers with the number of nodes equal to that of cells in the virtual 2D grid map built in the indoor environment. In addition, the size of the input layer is similar to the hidden one due to States of the virtual map used as the input. The output layer has the number of neurons equal to Actions (four Actions used in this paper) due to predicting the *Q*-value to estimate each Action. Finally, the FWNN model will choose the largest *Q*-value to perform an Action for the next State. In this research, the Parametric Rectified Linear Unit (PReLU) activation function, the optimization method of RMSProp, and the loss function of Mean Squared Error (MSE) are applied in the model for optimal path planning.

$$f(y\_i) = f(y\_i) = \max(0, y\_i) + a\_i \min(0, y\_i) \tag{7}$$

in which *yi* is any input on the *i*th layer and *ai* is the negative slope which is a learnable parameter.

#### *2.3. Wheelchair Navigation in Real Environment*

In the optimal path planning, a simulated 2D grid map plays one very important role due to showing optimal paths for navigating the electric wheelchair to targets. In particular, one 2D grid map is simulated based on a lot of information related to one real environment. It means that the wheelchair, when moving in one real environment, may use parameters and State values for wheelchair navigation. Therefore, the simulated 2D grid map is divided into a lot of cells, including free spaces and occupancies. Each cell is calculated to be an actual area in the real environment with free spaces and obstacles and it can be one free space or one occupancy (obstacle). Therefore, we assume that the wheelchair can be driven through these free space areas to reach the desired target.

Figure 5 describes the 2D grid maps with occupancies and cells, including *m* × *n* cells in the indoor environment, in which the wheelchair can move through to reach targets. In particular, the real environment with objects (blue) is measured and divided into cells with the size of the wheelchair for creating the map as shown in Figure 5a. Therefore, the

map with the divided cells is converted into a 2D grid map with calculation for filling cells related to obstacles (yellow). Therefore, the 2D grid map in Figure 5b is simulated to create the virtual 2D grid map as described in Figure 5c. The cells in the virtual 2D grid map are assigned 1 s to represent the occupied workspace (obstacles) and 0 s for the free workspace. Therefore, this virtual 2D grid map is considered as a binary map with black and white cells and the original coordinate of the virtual map is in the top left corner with the first location (0,0). It is obvious that this virtual map lets us know all cell locations which are used to find optimal paths using the algorithm of DQNs.

**Figure 5.** The occupancy 2D grid map of the real environment. (**a**) Environmental grid map with real obstacles and cells; (**b**) occupied cells related to the real obstacles; (**c**) virtual 2D grid map with black occupancy cells.

In this model, the wheelchair is located on a map through landmarks including the location and direction of the wheelchair on the map. The update of the position of the wheelchair is carried out when starting the movement for the first time. In this method, only the wheelchair location is connected to the input of the MP block for determining the optimal path and then it shows specific Actions with that State of the wheelchair.

One of the most important parts in the wheelchair control system is the wheelchair location in a real indoor environment for navigation. In a real indoor environment, natural landmarks will be automatically collected for creating one database for locating the motion wheelchair. In particular, the Features from Accelerated Segment Test (FAST) method is used to extract features of images captured from the camera system. Therefore, objects in the image that have the largest density of feature points are chosen to be natural landmarks and then the Speeded-Up Robust Features (SURF) algorithm is applied to identify these landmarks [39]. In this research, when the wheelchair is in the real environment as described in Figure 6, its initial location is determined based on three landmarks captured from a camera system installed on the wheelchair. Assume that the wheelchair moves in the flat space OXY with the unknown coordinates *W*(*x*, *y*) and landmarks related to the coordinates in the real indoor environment. Therefore, obstacles selected as landmarks have distinctive characteristics which are different from other landmarks with their coordinates *A*(*xA, yA*), *B*(*xB, yB*), and *C*(*xC, yC*) [40]. The wheelchair position can be determined if the coordinates of the landmarks and the corresponding distances from the wheelchair to the landmarks are known. Based on the wheelchair location determined as above, the wheelchair position on the real grid map with the square cell size (*a* × *a*) is *Sw x a* , *y a* .

The starting point *SW*(1,0) - *Sf* and the target *Ti*(3,2) - *T* are obtained based on the pre-trained map with this target, in which *Sf* is a set of free cells and *T* is a set of known targets. Therefore, the MP gives one optimal path which is a set of Actions including Right, Right, Down, Down as shown in Figure 6. It means that the wheelchair impossibly moves based on these Actions due to the wheelchair model in this research not being an omnidirectional control model. In Figure 7a, the two-input converting block is Action *a*, determined from the MP output, and the initial direction *d* of the wheelchair includes the four directions (Up, Down, Left, Right) as described in Figure 7b. Thus, the output of the

converting block *aw* is an Action that is suitable with the wheelchair orientation/direction in the real environment.

**Figure 6.** Coordinates of the wheelchair, landmarks, and target in simulated 2D grid map.

**Figure 7.** The representation of converting actual control commands from the simulation. (**a**) Converter with the simulated inputs and the actual outputs; (**b**) representation of four control directions.

The training process for finding the optimal path will produce a series of Actions with different States, in which these Actions will produce many optimal paths dependent on the initial position of the wheelchair. Therefore, after each Action *a*, the wheelchair direction *d* will change into a new direction *d* . For the movement of the wheelchair, we propose a novel algorithm based on the WAC as described in Figure 7. In particular, the wheelchair Actions *aw* and the new direction *d* = *a* during its movement in real environment need to be determined and this algorithm is expressed as follows:

$$\begin{cases} \begin{array}{l} \text{ } & \text{Forward} \quad \text{if} \quad a = \mathsf{U}lp\\ \begin{array}{l} \text{ } & \text{Backward} \quad \text{if} \quad a = \mathsf{Down} \\ \text{ } & \text{Left} - \text{Forward} \quad \text{if} \quad a = \mathsf{Left} \\ \text{ } & \text{Right} - \text{Forward} \quad \text{if} \quad a = \mathsf{Right} \end{array} \end{cases} \end{cases} \end{\text{ }}$$

$$\begin{cases} \begin{cases} \begin{array}{l} a\_{w} = \begin{cases} \begin{array}{l} \text{Forward} \quad \text{if} \quad a = Down \\ \begin{array}{l} \text{Backward} \quad \text{if} \quad a = \text{Up} \\ Lcft-\text{Forward} \quad \text{if} \quad a = \text{Right} \\ \text{Right and Forward} \quad \text{if} \quad a = Lcft \end{array} \end{cases} \end{cases} \end{cases} \end{\text{(8b)}}$$

$$d = Down$$

$$\begin{cases} \begin{cases} \begin{array}{l} \text{Forward} \quad \text{if} \quad a = Lcft \\ \begin{array}{l} \text{Backward} \quad \text{if} \quad a = \text{Right} \\ Lcft-\text{Forward} \quad \text{if} \quad a = \text{Down} \end{cases} \end{cases} \end{cases} \end{cases} \right)$$

$$\begin{array}{llll} a\_w = \begin{cases} \multicolumn{3}{l}{\cal{Left}} - \multicolumn{3}{l}{\cal{for}} & \multicolumn{3}{l}{\cal{for}} = \multicolumn{3}{l}{\cal{for}} \\ \multicolumn{3}{l}{\cal{Right}} - \multicolumn{3}{l}{\cal{for}} & \multicolumn{3}{l}{\cal{for}} \\ d = Left \end{cases} \end{array} \tag{8c}$$

⎪⎪⎪⎪⎩

$$\begin{cases} \begin{array}{l} \text{ } & \text{Forward} \quad \text{if} \quad a = \text{Right} \\ \begin{array}{l} \text{ } & \text{Backward} \quad \text{if} \quad a = \text{Left} \\ \text{ } & \text{Left} - \text{Forward} \quad \text{if} \quad a = \text{Up} \\ \begin{array}{l} \text{ } & \text{Right} - \text{Forward} \quad \text{if} \quad a = \text{Down} \end{array} \end{array} \end{cases} \end{cases} \end{\text{ }}$$

in which *a* and *d* are the parameters which are determined based on Action and direction in MP. In Equations (8a)–(8d), the wheelchair Actions *aw* are defined as follows:


## **3. Results and Discussion**

## *3.1. Simulation of Path Training for the Wheelchair Based on 2D Grid Map*

We constructed two grid maps depicting the indoor environment as shown in Figure 8, where the white cells are the spaces, the black cells are the obstacles, and the red cells are the targets. During training and testing the proposed structure, the PC configuration with the Windows operating system was Intel (R) Core (TM) i5-6300U, 2.4 GHz, 16 GB RAM. During each training, the starting position is randomly selected in the map and guaranteed not to overlap with the obstacle cell. Table 1 describes the parameters which are trained in the case as described in Figure 8.

**Figure 8.** Training environment simulated using the proposed model. (**a**) An 8 × 11 grid map; (**b**) 11 × 33 grid map.



To evaluate the effectiveness of the DQN method, we performed the experiment with different steps and different environments, and the stable results of the DQN method are shown in Figures 9 and 10 for each environment. In particular, we worked out the experiment of the proposed model using DQNs with two activations of PReLU and ReLU for comparing the performance between them, where the horizontal axis is the number of episodes and the vertical axis is the Win rate. The Win rate is calculated based on the number of Wins per the total number of selected positions to start a game in an episode. From Figure 9, we can see that the Win rate can increase or decrease or stay the same after each episode.

**Figure 9.** The comparison of Win rates when training the DQN model with two activation types in the case of the 8 × 11 grid map. (**a**) The DQN model with PReLU activation; (**b**) the DQN model with ReLU activation.

According to the results in Figure 9 with a small environment, the two models of DQNs-PReLU and DQNs-PReLU have the same Win rate growth path and also reach the maximum Win rate threshold of 1 after about 600 episodes. Figure 10 shows the Win rate growth of the large environment with the two selected models. With the results of the model of DQNs-PReLU in Figure 10a, when the episode is over 7000, the Win rate starts sharply increasing and then reaches the maximum threshold at episode 15,000. Therefore, the Win rate reaches saturation and this shows that the model meets the training requirements and then ends. In contrast, according to the results shown in Figure 10b using the model of DQNs-ReLU, the Win rate starts sharply increasing when the episode is over 25,000 and reaches the maximum threshold when the episode is 240,000. After that, the Win rate reaches saturation and this means that the model meets the requirements of training and then ends. Thus, it can be seen that in a large environment, the model of DQNs-PReLU more quickly reaches the maximum score than DQNs-ReLU.

In addition, the obtained results are comparable in terms of training time and the number of episodes of the DQN model with the two types of activations as shown in Table 2. In particular, in the small environment with 8 × 11, the difference in training time is not too large, 36.3 s compared to 42.3 s for two ReLU and PReLU activations, respectively. With the episode number of the two models of DQNs-ReLU and DQNs-PReLU used for training, this environment is not much different, with episode numbers of 601 and 607, respectively. However, with the larger environment of 11 × 33, there is a big difference in training time

and the number of episodes between the two models. In particular, the training time of the DQNs-ReLU model is nearly 4 times larger than that of the DQNs-PReLU model. In addition, the average number of episodes per training time using the DQNs-ReLU model is 15 times that of the DQNs-PReLU. This means that the DQNs-PReLU model gives better performance than DQNs-ReLU using this environment.

**Figure 10.** The comparison of Win rates when training the DQN model with two activation types in the case of the 11 × 33 grid map. (**a**) The DQN model with PReLU activation; (**b**) the DQN model with ReLU activation.


**Table 2.** The Relative Performance of Proposed DQN Models.

Table 3 describes the comparison of episode and time using the DQN model with two activations and previous models in training the two environments (small and large). In all experiments of randomly trained models, we performed training of each case 10 times to take the average training time and the average number of episodes. It is obvious that the Traditional Q-Learning model shows a table to record the value of each pair (State, Action), in which the State with the highest value indicates the most desirable Action. Therefore, these values are constantly refined during training and this is a quick way to learn a Policy. The second model, called the SARSA model, uses a setup similar to the previous model, but takes fewer risks during learning. During the training process, depending on the small or large environment, the training time and the number of episodes will be different.


**Table 3.** The Relative Performance of Previous Models.

In particular, with a small environment, the training time and the number of episodes are less than those with a large environment as shown in Tables 2 and 3. Furthermore, in Table 3, the models have a small number of episodes and a lot of time because Traditional Q-Learning works based on finding the maximum reward for each step and the larger the number of States, the larger the Q-table, so the calculation will take a lot of time. Meanwhile, in Table 2, the DQN has a lot of episodes but it takes less computation time because the DQN chooses some random and risky decisions to quickly obtain a high reward and it will accept to lose a certain amount of episodes.

With the statistical results in Tables 2 and 3, although the number of episodes in the training process is much larger than that of the Q-table-based models in Table 3, the DQNs-PReLU model in Table 2 takes a longer training time in two training cases for both small and large environments. In particular, for the small environment, the model of DQNs-PReLU has about 10 times more episodes than the models of Traditional Q-Learning and SARSA, but its training time is almost 5 times less than that of the Traditional Q-Learning and SARSA. In addition, with a large environment, DQNs-PReLU has a large number of about 16,015 episodes, nearly 60 times more than that using the Traditional Q-Learning, and nearly 70 times more than that using the SARSA model. However, the training time is significantly reduced with about 35.24 min compared to that of two models in Table 3, 1.45 h and 57.23 min, respectively. As an extra feature after learning, it saves the model to disk so this can be loaded later for the next game. Therefore, a neural network needs to be used in a real-world situation where training is separated from actual use.
