1. Introduction
Reinforcement learning is a branch of machine learning, the purpose of which is to study how the agent learns through trial and error. Reinforcement learning uses reward or punishment mechanisms to make agents more inclined toward certain behaviors in the future [
1]. Compared with traditional machine learning algorithms, reinforcement learning has no supervisors—only reward signals.
The human brain understands and recognizes things visually and builds and improves its own cognitive system. When traditional reinforcement learning uses images as the training state, reinforcement learning algorithms, such as Q-Learning, SARSA, etc., cannot find the optimal function because they are limited by the considerable possibilities of the state space. Convolutional neural networks (CNNs) have achieved considerable success in the field of computer vision, so their combination with reinforcement learning has been proposed. The first deep reinforcement learning algorithm, Deep Q-Network (DQN), was proposed by DeepMind [
2]. Since its combination with deep learning, reinforcement learning has made considerable breakthroughs. Like humans, deep reinforcement learning agents receive information from high-dimensional input, such as vision, to train their own neural networks and obtain higher scores in each learning environment under the specific reward function.
Deep reinforcement learning is currently one of the most popular areas of artificial intelligence (AI) and has achieved amazing results in many games. It does not require additional information, such as data tags, and is worthy of further exploration for future applications. If a neural network uses a single frame as the input, it is difficult for the neural network to analyze the dynamic information of the image. In deep reinforcement learning, neural networks usually use four consecutive frames as the input to learn the dynamic information contained in the image. Such a design allows the CNN to learn more information.
However, the four consecutive frames retain dynamic information at the cost of a certain amount of redundancy. If this portion of the redundant information could be reduced, the training speed and memory usage of the reinforcement learning agent could be improved. An influence map is an AI decision-making tool that is regularly used in games to describe the current state. In addition, the dynamic influence map can express the motion information of the agent. If reinforcement learning could learn the dynamic information from the influence map, it would be of great help to the learning process of the agent.
The main research goal of this study Is to explore the combination of dynamic influence maps and deep reinforcement learning. If a dynamic influence map can replace four consecutive frames as the neural network input, then all deep reinforcement learning algorithms in the same learning environment will be improved to a certain extent. In deep reinforcement learning, there are many scenarios of sparse rewards, i.e., the agent can only receive rewards after the conclusion of the game, such as in Go and chess. In this situation, the agent does not receive any external reward most of the time, which may result in the inability to optimize the policy. In this paper, we propose that in the task of sparse rewards, an influence map can be used to optimize the agent policy and improve the overall performance of deep reinforcement learning.
To verify the possibility of combining a dynamic influence map with reinforcement learning, Ms. Pac-Man, a popular test environment in the field of AI [
3,
4,
5,
6,
7,
8], is used as the learning and evaluation environment. In this kind of environment, the complete capabilities of the dynamic influence map, which represents the dynamic information of the current state of the game, can be displayed. It is these capabilities that facilitate its combination with reinforcement learning.
The contributions of this article are as follows.
1. One frame of an original image superimposed on the influence map is used to express dynamic information. Our method inputs less data into the neural network than the method using of four consecutive frames in deep reinforcement learning. It achieves better performance than the use of four consecutive frames in deep reinforcement learning.
2. In the sparse reward task, the use of an influence map is proposed to generate an intrinsic reward when there is no external reward. Sparse reward is an issue that must be solved in refined intelligent decision making.
3. The experiments conducted in this study prove that the combination of a dynamic influence map and deep reinforcement learning is effective. The experimental results show that the proposed method exhibits a greater improvement than traditional deep reinforcement learning in terms of agent performance, training speed, and memory usage. The influence map method proposed in this paper can be used in any algorithm. For some training tasks, if the algorithm cannot be further improved, the influence map can be added for training to improve the upper performance limit.
3. Dynamic Influence Map
3.1. Decision Making with an Influence Map
An influence map is a decision-making tool used in AI games. In the game, props such as food, wealth, and tools have a positive impact on the current position, and the agent tends to move to places with greater positive influence. Enemies and negative props have a negative impact on the location, so the agent needs to avoid these locations. Various objects in the game have varying influences on their surrounding positions, and the influence map represents the calculation of the sum of the influence of each position on the map.
When an object spreads influence to the outside world, the numerical influence value in each location on the map is affected by the spread and attenuation modes, so the calculated influence map changes. For example, the Euclidean distance, Manhattan distance, etc., can be used to measure the distance when the influence value spreads. An adjacent spread mode, such as the flood-fill algorithm, can also be used for spreading. However, the spread of the influence value is no longer measured by the distance. In addition, the influence cannot spread infinitely, and the influence value gradually attenuates as spreading occurs. When a threshold is reached, the influence value is set to 0. The attenuation process can be calculated using linear or exponential attenuation.
A traditional influence map usually has the same influence attenuation speed in the surroundings around the source, but this convention cannot reflect the mobility of the moving game units. For some moving game objects, their influence values differ depending on the direction. Specifically, moving units exert greater influence on other objects in the direction of their movement, and locations in the opposite direction of their movement have less influence. This is the core concept of dynamic influence maps.
3.2. Spread and Attenuation Modes
3.2.1. Spread Modes
When the starting point spreads influence to different locations on the map, the spread mode refers to the method used to measure the distance from the starting point to the destination point. The most commonly used methods include the Euclidean distance, Manhattan distance, path length between two points, and adjacent spread. Among these methods, the use of the two-point path length as the spread mode requires all paths to be calculated in advance because the real-time path finding affects the spread efficiency.
The coordinate system is established with the upper-left corner of the game scene as the coordinate origin.
- (1)
Euclidean Distance
When the Euclidean distance is used as the spread mode, the distance between the spread source and each point in the map is expressed as follows:
where
represents the Euclidean distance between point
a and point
b. The coordinates of point
a are (
,
), and the coordinates of point
b are (
,
).
This spread mode calculates the distance quickly because it only needs two coordinates to obtain a relatively acceptable result. However, if there are obstacles and other information in the path, this spread mode cannot perceive the distance.
- (2)
Manhattan distance
When the Manhattan distance is used as the spread mode, the distance between the spread source and the target point is measured as follows:
where
represents the Manhattan distance between point
a and point
b. The coordinates of point
a are (
,
), and the coordinates of point
b are (
,
).
- (3)
Path distance
In addition, the path between two points can be used to represent the distance between these points. The measurement method is as follows:
where
represents the path length obtained using the pathfinding algorithm. Compared with the Euclidean and Manhattan distance spread modes, this method can take into account the existence of obstacles. Because the pathfinding algorithm usually finds the shortest path, the measurement of distance is often quite accurate. However, in an actual game, especially real-time strategy games such as war games, the response time is an extremely important factor, and the calculation speed of the pathfinding algorithm seriously affects the real-time response ability.
- (4)
Adjacent spread
The adjacent spread method only needs to be executed once to spread the initial influence from the starting point to all points on the map. Using this method, a considerable amount of time can be saved as compared to the use of the pathfinding algorithm. Compared with the use of the Euclidean or Manhattan distance as the spread mode, the adjacent spread mode considers obstacles and other factors. Moreover, when a game unit moves, the influence map can be dynamically calculated and changed accordingly, which is more adaptable to dynamic games.
Adjacent propagation does not calculate the path length between any two points directly. Instead, it propagates the influence of the adjacent grids of the game object. If the influence values of some of the adjacent grids are changed, then an AI agent propagates the new influence values from these grids with the same method. The influence value can be propagated to four or eight directions in square grids (as shown in
Figure 1a,b, respectively) or to six directions in hexagonal grids (as shown in
Figure 1c). This process is similar to the flood-fill algorithm.
3.2.2. Attenuation Modes
Attenuation modes are generally divided into exponential and linear attenuation modes. As the distance between the starting point and the target point increases, the influence gradually decreases. In addition, when the influence is attenuated to a certain degree, it is usually set to 0 to reduce the amount of calculation.
Consider point a as the starting point of spreading and point b as a target point on the map. represents the initial influence of point a, and represents the exerted influence from point a to point b.
Exponential attenuation: The relationship between the influence and distance is given by Equation (4).
where
γ is the exponential attenuation parameter,
represents the distance from point
a to point
b, and
represents the maximum spread distance. After exceeding this distance, the influence value drops to 0. The value of the exponential attenuation parameter (
γ) is usually between 0 and 1. The lower the value, the faster the attenuation speed of the starting influence.
Linear attenuation: The relationship between influence and distance is given by Equation (5).
where
β is the linear attenuation parameter.
When these two attenuation modes use adjacent spread, the adjacent spread no longer spreads after the influence value of a certain point is less than a preset value instead of stopping spreading after reaching the maximum spread distance.
3.3. Dynamic Influence Map
In this study, the adjacent spread mode and linear attenuation are used to calculate the spread of influence. Compared with Manhattan distance and Euclid distance, adjacent spread can take into account the factors of obstacles, making the distance measurement more accurate. Compared with the distance obtained by the pathfinding algorithm, the calculation speed of adjacent spread is faster. Both exponential attenuation and linear attenuation can be used. Considering the computational complexity, the exponential operation consumes more time, so linear attenuation is selected.
3.3.1. Dynamic Influence Map Calculation
Because the object is in motion, its influence decays at different rates in different directions. When the target point is in front of the moving direction of the object, the influence of the object decays slowly. When the target point is behind the moving direction of the object, the influence of the object declines faster. One way to adjust the rate at which an object’s influence decays is to change the distance in the formula between the object and the target point.
As shown in
Figure 2, the spread source point is a, and the target point is b.
is the motion direction of node a, and
is the direction from a to b. The specific method for adjusting the distance is given by Equation (6) [
41].
where
indicates the actual distance from point
a to point
b,
represents the adjusted distance,
represents the direction vector from point
a to point
b, and
represents the current movement direction of spread source point
a. Moreover, α represents the parameter of distance adjustment, which is between 0 and 1. The larger the value of α, the greater the value of the spreading influence of source point
a in the direction of its movement. In this study,
is used to calculate the influence of
b according to Equation (5).
Let
indicate the influence value at point
x in the influence map. The calculation formula of the influence map can then be expressed as Equation (7).
where
represents the magnitude of influence exerted by game unit
i on point
x.
3.3.2. Dynamic Influence Map in Ms. Pac-Man
By setting the influence values of different units and using different modes for spreading and attenuation, the influence map can express the current state information, trend, and direction of movement using numerical values. In this study, Ms. Pac-Man is used as the learning environment, and an initial influence value is set for each game unit, such as ghosts, pellets, and Ms. Pac-Man. At each moment, the locations of these units are taken as the starting points, the directions of movements are considered, the influence values are spread to the whole map, and these influence values are added to form the global dynamic influence map. Such a dynamic influence graph contains motion information and can be used to replace the four consecutive frames as the state space in reinforcement learning.
4. Using the Influence Map as the State Space in Reinforcement Learning
4.1. Deep Reinforcement Learning Algorithm Ape-x
Ape-x is a distributed deep reinforcement learning algorithm that achieves state-of-the-art performance in many reinforcement learning tasks. Ape-x is characterized by a distributed architecture for large-scale deep reinforcement learning, which separates data collection from training. Using their local neural networks, multiple actors interact with multiple copies of environments simultaneously. The network selects actions, collects interactive data, and stores them in the global experience replay. The learner is responsible for sampling data from the global experience replay and training the core neural network. The neural networks between actors share the same parameters, and the learner sends the neural network parameters to the actors at intervals. Different actors have different epsilons (an epsilon-greedy algorithm) and remain unchanged during the whole interaction. This design allows the actors to generate sufficiently varies learning data when interacting with the environment.
In this study, DQN is used as the basic algorithm to train the Ape-x distributed algorithm. DQN uses a neural network to fit the state-action value function of Q-Learning. The DQN neural network takes consecutive frames as input and outputs the Q-value for each possible action to evaluate the quality of the actions. Because the action space is discrete, it is difficult to use the objective function to directly optimize the policy. Moreover, DQN estimates the state-action value function of each action and then selects the appropriate action, so it is suitable for use in the discrete action space in the Atari environment. The architecture of the Ape-x algorithm is presented in
Figure 3.
As shown in
Figure 3, the Ape-x architecture has multiple actors, each of which interacts in its independent environment instance. First, the initial priorities in the local experience replay pool are calculated, and local experience is then sent to the global experience replay pool. There is only one learner, the training data of which are sampled from the global experience replay pool; the core neural network is trained, and the data priorities are updated.
The Ape-x algorithm can support many actors to collect data simultaneously, and the learner is only used to train the core neural network. In this way, the training speed and data collection speed can be considerably improved.
4.2. Using an Image as State Space
The reinforcement learning toolkit Gym contains many reinforcement learning tasks. Among them, the Atari 2600 game simulator is often used to compare the performance of various algorithms. All these tasks use images as the state space, so neural networks are also built based on images in a large number of reinforcement learning algorithms. An image can represent very large states as a high-dimensional input, so traditional reinforcement learning algorithms, such as Q-Learning, usually cannot be used for training. At this time, a neural network must be added to understand the image state space and fit the state-action value function.
When using images as the state space, the environment returns an RGB image at every moment of interaction. To save storage space and speed up training, images are usually cropped and downsampled to low-resolution images, then converted to grayscale images. The grayscale conversion can be calculated by Equation (8).
The convention of deep reinforcement learning algorithms is to return one image every four frames to speed up the interaction. After four images are accumulated, they are combined into one four-channel image in which each channel is the raw image. The third and fourth frames are maximized to prevent image flicker. In addition, at the beginning of each episode, the agent usually chooses to act randomly or to stand by for as many as 30 frames to increase the randomness of the interaction.
4.3. Combining the Original Image and Influence Map as the State Space
Influence maps allow Pac-Man to sense danger and make smarter decisions. In
Figure 4, Pac-Man moves from point A to point B. Although there are pellets ahead of his movement direction, the dangerous influence of the ghost also spreads to Pac-Man’s point B, so he chooses to change direction. Moreover, the absorbing influence of the pellets below him also spreads to point B, so Pac-Man chooses to move downward.
When the reinforcement learning agent is trained, the agent receives the state returned by the environment and continuously learns from the interaction by understanding the information contained in the state. Therefore, the state returned by the environment directly determines the knowledge learned by the agent. Different state spaces cause the agent to prefer different policies, thereby affecting the degree of intelligence and the decision-making ability. As a high-dimensional expression, an image has considerably vast state possibilities. For example, a grayscale image with a resolution of 100 × 100 has different states, which contain large amounts of information. The more an agent can learn from such a state space, the stronger its decision-making ability.
To learn as much knowledge as possible, the original image was used as the state space in most previous research. However, an influence map is not actually an image but a matrix. The values in the matrix represent the influence value of a certain location, not the pixel value. If the values in the matrix are scaled to the range of RGB pixel values, such as 0–255, according to certain rules, or if RGB pixel values are represented by floating-point numbers, such as 0–1, the influence map can also be regarded as an image.
Now, the influence map can be superimposed on the original image. However, the number of pixels differs between the original image and the influence map. For the two images to be superimposed, the resolution must be the same. The original image can be changed to a tensor with a shape of (84, 84) by cropping, downsampling, grayscale, etc. The influence map can be upsampled by bilinear interpolation to (84, 84). In this way, two images with the same resolution can be superimposed on the z-axis to form a dual-channel image with a shape of (84, 84, 2). The two channels generated in this way are shown in
Figure 5.
Because the dynamic influence map has different attenuation rates in each surrounding direction, the influence can express certain motion information in an image. Compared with the traditional four consecutive frames, the influence map can also represent motion information while reducing memory usage.
4.4. Hyperparameter Selection for the Influence Map
To obtain the influence map, if the influence of all pixels is calculated, the amount of calculation is inevitably increased, causing the interaction between the reinforcement learning agent and the environment to be much slower. Therefore, in addition to removing some text from the image, an 8 × 8 pixel is used as a single grid to calculate the influence map.
All the objects in the game must be considered when calculating the influence map, and each object must be allowed to spread its influence in the whole maze. The setting of the initial influence of the object directly affects the influence map generated by it, and the global influence map is affected when all influence maps are added up. Several hyperparameter search algorithms, including grid search [
44], random search [
45], and genetic algorithms [
46], are used to find the most suitable hyperparameters the influence map for Ms. Pac-Man.
To obtain the optimal hyperparameters, a game AI based on the influence map is implemented, and the actions performed at each time step are selected according to the influence values in the four surrounding directions. Adjacent spread and exponential attenuation are used when influence is spreading.
The genetic algorithm is a global search algorithm that imitates the biological evolution mechanism. It selects a relatively suitable hyperparameter combination via evolution, mutation, and survival of the fittest. In the genetic algorithm used in this study, each hyperparameter is represented by 4 bits, i.e., the range of all the hyperparameters of the influence map is divided into 16 (24) parts. The gene of each individual is a bit array composed of 7 hyperparameters, with a total of 28 bits. The genetic algorithm is implemented for 50 generations of evolution, and there are 100 individuals in each generation.
4.5. Using the Influence Map to Solve the Sparse Reward Problem
4.5.1. Sparse Reward
In deep reinforcement learning tasks, many environments have sparse rewards problems. For example, in tasks such as those in Go and chess, the agent can only receive rewards after the entire game is over, causing the agent to receive no reward most of the time and possibly resulting in the inability to optimize the policy.
There exists a handling technique called curiosity-driven learning [
47], the idea of which is that in addition to the external reward provided by the environment, the agent provides an intrinsic reward for itself. The intrinsic reward is generated through an intrinsic curiosity module (ICM). The module predicts what will happen in the next state and compares it with the actual next state. If the predicted result differs from the actual result too much, it means that the next state is encountered less frequently, and a larger intrinsic reward is then be generated to encourage exploration of the uncertainty. Through this intrinsic reward, the agent tends to explore more diverse states, which can increase the diversity of collected data to a certain extent.
Curiosity-driven learning fails in some situations, which is called the noisy TV problem. In these situations, the state always gives the agent a higher intrinsic reward when the next state is randomly generated. The generated reward causes the agent to tend to stay in the current state, even if this state is actually worthless.
To solve the noisy TV problem, OpenAI proposed random network distillation (RND) [
48]. The idea behind this technique is similar to curiosity-driven learning, but two neural networks are used to evaluate the current state. The parameters of one neural network are fixed after random initialization and remain frozen, whereas the other neural network continues learning and training. When the two neural networks have a large difference in evaluating the same state, the current state must be explored, and the agent is given a larger intrinsic reward. If a certain state is experienced many times, another neural network gradually reduces the evaluation error during the training, and the intrinsic reward obtained becomes increasingly smaller. This algorithm is based on a simple idea, but the effect of solving the noisy TV problem is desirable. For some reinforcement learning algorithms that already have two neural networks, it is very time-consuming to maintain the inference and training of the four neural networks simultaneously; thus, the training speed of RND is considerably reduced.
4.5.2. Using the Influence Map to Generate an Intrinsic Reward
When Ms. Pac-man eats a large number of pellets in the maze, she does not try to explore the pellets in the corners of the maze. This is a problem similar to the sparse reward problem, i.e., the agent always obtains a reward of zero in continuous exploration, which makes it more difficult to optimize the policy of the agent at this time.
We propose the use of influence maps to solve the sparse reward problem. If the external reward obtained is 0, the value in the influence map is used to generate an intrinsic reward for the agent. This can solve the sparse reward problem to a certain extent. Suppose the agent performs action an in the process of exploring the environment, and the external reward obtained is 0; the intrinsic reward is then calculated by Equation (9).
where “clip” is a function that limits the upper and lower bounds of an array, and
indicates the influence value of the location reached after performing action a.
6. Discussion
Compared with the direct use of four consecutive frames, the influence map cannot be obtained directly from the game environment but must be calculated from the learning environment. The calculation of the influence map requires the consideration of the initial influence value, the spreading mode, the attenuation mode, and the hyperparameters. These calculations add complexity to the learning process. However, if these influence-related attributes can be obtained from the environment, the use of a dynamic influence map in reinforcement learning can improve the performance of almost all reinforcement learning algorithms, reduce the occupancy rate of hardware resources, and increase the training speed.
In recent years, computing technology, especially machine learning, has been used for many applications in engineering [
52,
53,
54]. An influence map provides additional preconditions for decision making, which can make machine-learning-based decision-making processes more scientific. Therefore, an image-based machine learning model with an influence map can achieve improved engineering performance.
7. Conclusions
In this study, Ms. Pac-Man was used as the learning environment to explore the method of combining an influence map with reinforcement learning. Almost all other deep reinforcement learning algorithms use four consecutive frames of images as the input of the neural network. In this study, a raw frame of an image and the influence map were used for superimposition. The performance achieved using a influence map superimposed image was about 11.8–12.6% higher than that achieved using four consecutive frames. Compared with the use of four consecutive frames as the state space, the training speed of the proposed method was increased by 59%, the video memory usage was reduced by 30%, and the memory used by the experience replay was reduced by 50%. The results prove the feasibility of the use of a dynamic influence map in deep reinforcement learning algorithms. With respect to the sparse reward tasks in deep reinforcement learning, the influence map value was used to generate an intrinsic reward when there was no external reward. Even in the case of Ms. Pac-Man, which is not a completely sparse reward task, the influence map improved the score by about 10%.
At present, the strategy learned by the algorithm does not contain enough long-term goals. To allow the agent to have long-term memory, a recurrent neural network, such as long short-term memory (LSTM), can be added to the Ape-x deep reinforcement learning algorithm to further improve the decision-making intelligence of the agent. In the future, influence maps will be combined with other reinforcement learning algorithms to study the universality of influence maps.