Representing the Information of Multiplayer Online Battle Arena (MOBA) Video Games Using Convolutional Accordion Auto-Encoder (A2E) Enhanced by Attention Mechanisms

Torres-León, José A.; Moreno-Armendáriz, Marco A.; Calvo, Hiram

doi:10.3390/math12172744

Open AccessArticle

Representing the Information of Multiplayer Online Battle Arena (MOBA) Video Games Using Convolutional Accordion Auto-Encoder (A²E) Enhanced by Attention Mechanisms

by

José A. Torres-León

,

Marco A. Moreno-Armendáriz

^*

and

Hiram Calvo

Computational Cognitive Sciences Laboratory, Center for Computing Research, Instituto Politécnico Nacional, Mexico City 07738, Mexico

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(17), 2744; https://doi.org/10.3390/math12172744

Submission received: 16 July 2024 / Revised: 23 August 2024 / Accepted: 26 August 2024 / Published: 3 September 2024

(This article belongs to the Special Issue Mathematical Optimization and Control: Methods and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we propose a representation of the visual information about Multiplayer Online Battle Arena (MOBA) video games using an adapted unsupervised deep learning architecture called Convolutional Accordion Auto-Encoder (Conv_A²E). Our study includes a presentation of current representations of MOBA video game information and why our proposal offers a novel and useful solution to this task. This approach aims to achieve dimensional reduction and refined feature extraction of the visual data. To enhance the model’s performance, we tested several attention mechanisms for computer vision, evaluating algorithms from the channel attention and spatial attention families, and their combination. Through experimentation, we found that the best reconstruction of the visual information with the Conv_A²E was achieved when using a spatial attention mechanism, deformable convolution, as its mean squared error (MSE) during testing was the lowest, reaching a value of 0.003893, which means that its dimensional reduction is the most generalist and representative for this case study. This paper presents one of the first approaches to applying attention mechanisms to the case study of MOBA video games, representing a new horizon of possibilities for research.

Keywords:

unsupervised learning; attention mechanisms; video games information

MSC:

68T07

1. Introduction

Artificial intelligent agents are one of the main study cases of artificial intelligence (AI), as they are perfectly suited to a broad range of applications. One of these applications relies on video game player agents, in which an intelligent agent is designed or trained to beat the video game. This is an interesting application for intelligent agents, as video games present a convenient test bed for algorithms, providing cheap and easily tunable environments to evaluate them for solving complex problems.

However, selecting and representing the information that is used by an algorithm to solve a specific task is always an essential decision in computing, and intelligent agents for video games are not the exception to this rule.

Although there are several cases where intuitive information representations can be found, there are video games in which a lot of data must be taken into account in decision making. In other words, in many video games, there are several sources and types of information, and determining the relevance of these data for decision making is crucial to accurately playing the game.

For example, in games like Chess or Go, a matrix that directly maps the game board into a computational data structure can be useful for simple player agents. However, a more complex structure, such as embeddings produced by machine learning models, could be even more representative of the game state, thus serving as a better representation than a simple matrix depicting the current board [1,2].

Some of the representations for complex information in video games are auto-encoder dimensional reduction [3], auto-generated graphs [4], word embeddings [5], and tabular data [6].

In this paper, we look for the best dimensional reduction representation of the information about a complex team video game genre, Multiplayer Online Battle Arena (MOBA) video games. This reduced representation may serve to train intelligent agents to play MOBA video games.

1.1. Hypothesis

For this research, we ask the following questions:

Hypothesis 1 (H1).

Is there any other representation for MOBA video game information apart from the existing ones?

Hypothesis 2 (H2).

Given that representation, what features are the most representative of it?

Hypothesis 3 (H3).

What is the best artificial intelligence architecture to extract those representative features?

Hypothesis 4 (H4).

Does computer vision use different attention mechanisms to conduct different feature extraction?

Hypothesis 5 (H5).

Of those attention mechanisms, which is the best of them all for the given problem?

1.2. Contribution

This proposal differs from the state-of-the-art. Unlike typical deep learning models, which are given raw, specific, commercial video game data, our approach first creates a spatial representation of the MOBA information in which the metadata of all the objects are positioned at the same point of the object but in the additional channels. Even more, our aim is to generate a scheme where any game-specific encoder trained with our trained decoder would be capable of generating frame embeddings that capture representative information of the visual data given to the players. Such behavior is achieved by training the Conv_A²E on the synthetic dataset, which illustrates valid MOBA game situations.

In other words, compared with existing methods, our spatial representation relates all the objects and their attributes by positioning them at the same point but in different channels; this is a new way to analyze the MOBA information by spatially relating objects and their information. On the other hand, the frame embedding, produced by the encoder of the Conv_A²E, is an abstract representation of this spatially related map of objects; thus, it is the result of a refined feature extraction that may be used for different purposes. Therefore, in comparison with state-of-the-art approaches, our proposal first poses a new analysis perspective by making the spatially related map of objects, and second, it generates a refined extraction of the relevant information about the object map by producing the frame embedding. This is a representation that is adaptable to different decision-making mechanisms, as the frame embedding can be used as the input of different algorithms, from non-deep (decision tree, SVM, MLP), to deep neural networks.

1.3. Paper Map

In Section 2, we introduce the main topics of the research, the MOBA video games genre, and the attention mechanisms. Section 3 considers related works, talking about applications of synthetic data generation and how current works have represented complex MOBA video game information. Section 4 presents our proposal, entering the details of its design. In Section 5, we present the experimental results of this study as well as a discussion on them. Finally, in Section 6, we give conclusions.

2. Theoretical Framework

2.1. MOBA Video Games

Multiplayer Online Battle Arena (MOBA) games have emerged as a rigorous testing ground for artificial intelligence (AI) algorithms, exemplifying the complexities of collaborative decision making over both short and long terms. In MOBA games, two opposing teams engage in combat within an arena to annihilate the opponent’s base. To achieve this overarching goal, each team must adeptly coordinate their actions, addressing a plethora of subchallenges along the way to bolster their prospects of victory.

The strategic planning, tactical maneuvering, and real-time adaptability demanded by these games make MOBA games an optimal environment for examining and refining AI algorithms.

Indeed, a primary challenge for AI algorithms in this genre lies in effectively representing the extensive information available to players. This complexity is compounded by the information’s inherently incomplete nature, as the actual state of the game board remains obscured from players. Consequently, players must rely on predictions and inferences based on partial information for their decision-making process.

To effectively represent the visual information presented to the players of the MOBA video games, this paper presents an auto-encoder architecture that aims to make an unsupervised extraction and a dimensional reduction of it, generating, as the output of the encoder block, what we call a frame embedding. This vector retains all the available representative information of the frame in a dimensional reduction compared with the original data. In the search for the best frame embedding, this paper presents a study wherein various attention mechanisms are employed within a convolutional accordion auto-encoder (Conv_A²E) framework. The objective is to reconstruct synthetic frames from an MOBA video game, depicting specific in-game situations. The aim is to identify the most effective attention mechanism that enhances the performance of the original Conv_A²E model for this case of study.

2.2. Game Situations

One approach to analyzing the complexity of MOBA video game information is by considering how matches are viewed from the players’ perspective. As outlined by [7], an unpredictable flow of events characterizes MOBA game matches, as players encounter a diverse set of scenarios known as game situations. These game situations arise dynamically, presenting players with individual challenges to navigate. Encompassing a wide range of scenarios, these situations changes from unexpected encounters with opponents to strategic decision-making dilemmas, requiring players to adapt and respond in real time to achieve success in the match.

From the description of MOBA video games as AI algorithm test platforms made in [7], there are 13 different game situations, according to the challenges mentioned for an MOBA match. However, for this research, we choose one of them, which is described as follows; however, the same idea is appliable to any game situation:

Lane push. In this situation, several members of a team attack an enemy structure, aiming to destroy it and reduce the opposing team’s safe zone. The attacker team wins if they manage to destroy the structure, and the defender team wins if they kill all of the attackers of the structure.

An example of this game situation is when two avatars of Team 1 attack a Tower of Team 2 along with their Creeps; eventually, the attackers will destroy the tower.

2.3. Attention Mechanisms

Although deep learning (DL) models perform significantly well in computer vision tasks, forming part of systems that create useful unsupervised representations of video games, in recent years, attention mechanisms have emerged as a significant enhancement to those algorithms, refining the feature extraction process by selectively focusing on the most relevant features for a given task within a DL model. Initially developed for natural language processing (NLP) tasks, attention mechanisms have undergone extensive research and have now permeated various domains, including computer vision, tabular data analysis, and generative models.

As noted by [8], attention mechanisms have assumed a pivotal role in computer vision, yielding notable benefits across a spectrum of visual tasks. These include, but are not limited to, object selection, spatial region detection, dynamic time selection mechanisms, image classification, face recognition, and medical image processing. In fact, ref. [8] categorizes attention mechanisms in computer vision according to the data domain.

Note that during that research, there were still some void spaces (intersections), meaning that for the time being, there was no published work about those attention mechanisms.

At its core, attention mechanisms introduce intermediate operations between neural layers to adjust the weighting of features extracted by preceding layers dynamically. This process refines the information propagated through subsequent feed-forward steps of a deep neural network.

2.3.1. Channel Attention Mechanisms

Channel attention is a type of attention mechanism commonly used in computer vision tasks. It operates at the channel level of feature maps produced by convolutional neural networks (CNNs). The primary goal of channel attention is to dynamically recalibrate the importance of different channels within a feature map. These features have made channel attention shine in problems like image steganography [9] and video frame interpolation [10]. The following two channel attention variants mentioned were used in this study.

1: ECA-Net. Efficient channel attention (ECA) block for deep CNNs was proposed by [11] to avoid dimensionality reduction and capture cross-channel interaction efficiently. An ECA block employs a 1D convolutional operation that allows the model to directly assess the channel relationship without requiring dimensionality reduction. This 1D convolution is applied over the global average pooling of the received features map. ECA block or ECA-Net can readily be incorporated into various CNNs.
2: FcaNet. Frequency channel attention (FCA) was proposed by [12] over the mathematical proof that conventional global average pooling (GAP) is a particular case of the feature decomposition in the frequency domain. This proof led to the proposal of an attention mechanism that conducts channel-wise analysis by evaluating the importance of each frequency channel for the task at hand. The channel-wise analysis is made mainly by using this network’s multi-spectral channel attention module, a specialized component designed to enhance feature representation and extraction in multi-spectral imagery processing tasks.

2.3.2. Spatial Attention Mechanisms

Spatial attention is another type of attention mechanism commonly used in computer vision tasks. Unlike channel attention, which operates at the channel level of feature maps, it focuses on enhancing certain spatial regions of the feature map. In addition, it allows the model to selectively attend to specific spatial regions of the input data, enabling it to capture fine-grained details and spatial dependencies crucial for tasks such as object localization, image segmentation, and scene understanding. Owing to its enhanced performance in identifying relevant spatial relationships in the input data, spatial attention has been used in trajectory prediction for intelligent vehicles [13], semantic segmentation for medical images [14], among others. The spatial attention mechanism variants detailed below were used for this study.

3: Gated Attention Unit. Also referred to as the attention gate (AG), this attention mechanism was introduced by [15] as a straightforward yet highly efficient method to direct the focus of neural networks to targeted regions while simultaneously dampening feature activations in irrelevant areas. Throughout the training phase of a deep neural network (DNN), AGs systematically suppress feature responses in extraneous background regions. Leveraging attention coefficients, AG identifies salient regions within images and selectively prunes feature responses, retaining only the activations pertinent to the particular task.
4: Deformable Convolution. To achieve geometric transformation invariance, ref. [16] introduced the concept of deformable convolutional networks. These networks redefine the conventional convolutional layers by breaking the operation into two steps. Initially, the first step involves sampling features from the input map on a regular grid, which is followed by aggregating these sampled features using a convolutional kernel. What sets deformable convolution apart is its augmentation of the sampling process with learnable offsets. These offsets dynamically adjust the positioning of the kernel grid to align with the most relevant regions of the feature map for the given task.

2.3.3. Channel and Spatial Mechanisms

Attention mechanisms that combine spatial and channel attention leverage the strengths of each approach to enhance the representation power of feature maps in computer vision tasks. These combined attention mechanisms aim to capture spatial relationships between features and channel-wise correlations within feature maps.

By combining spatial and channel attention mechanisms, these hybrid approaches enable the model to capture both fine-grained spatial details and inter-channel correlations of the feature maps, leading to improved performance in computer vision tasks. This synergy facilitates more effective feature extraction and representation learning, ultimately enhancing the model’s understanding and interpretation of complex visual data. This combination brings an engaging data analysis, which has been used for cases like hand gesture recognition [17] and image super-resolution [18]. Below, two channel and spatial attention variants are explained, as they formed part of the study presented in this paper.

5: CBAM. The convolutional block attention module (CBAM), introduced by [19], comprises two consecutive sub-modules: the channel attention module (CAM) and the spatial attention module (SAM), which are applied in that order. The CAM module dynamically identifies the most pertinent channels within the feature map, while the SAM module focuses on learning to localize the significant regions within those channels. This hierarchical approach enables the CBAM to effectively capture both channel-wise and spatial-wise dependencies, enhancing the discriminative power of the feature representation.
6: Triplet Attention. Initially proposed by [20], this attention mechanism relies on three branches, each focusing on distinct aspects of the input data. The first branch computes attention across the channel dimension (C) and the spatial dimension (W). In contrast, the second branch captures attention across the channel dimension (C) and the spatial dimension (H). Additionally, the third branch computes spatial dependencies between H and W. The outputs from these branches are then combined, typically through averaging, to generate the final attention map. The primary objective of this attention mechanism is to encourage dissimilar pairs of features to be separated by a certain margin from similar pairs. By doing so, the mechanism enhances the network’s capacity to discern meaningful representations from the input data, ultimately improving its performance in tasks such as classification or segmentation.

3. State of the Art

3.1. Synthetic Data Generation for Diverse Purposes

In the context of machine learning (ML), the main factor that determines a project’s success is the data. High-quality and large-quantity data are hard to find, especially when dealing with unsolved or complex problems. Luckily, synthetic data can be generated to complement or, in some cases, even substitute the existing datasets.

As suggested in [21,22,23], synthetic data may find application in various industries, including robotics, autonomous vehicles, and medical simulations, where gathering large volumes of diverse and realistic data is often prohibitively expensive or impractical.

Techniques for synthetic data generation offer valuable tools for improving computer vision systems’ capabilities. In this field, synthetic data refers to artificially generated images or datasets that mimic real-world scenarios. These techniques serve several purposes, including data augmentation, domain adaptation, object detection, object segmentation, and more [24,25,26,27].

One primary application of synthetic data in computer vision is data augmentation. Synthetic data approaches can create existing image variations through rotation, scaling, flipping, and adding noise. This augmented data enriches the training process by exposing models to a broader range of scenarios, thereby improving their robustness and performance [27].

Synthetic data can also be applied in domain adaptation by adding synthetically generated conditions to previously existing datasets. For example, simulating different lighting conditions or camera perspectives may add data to train models that generalize better across diverse real-world scenarios [23,26].

Particularly in video games, the conditions and rules are typically so well-defined that they can be considered a synthetic data source, as suggested by [21], because they can simulate real-world scenarios, like cities. Although video games should not be considered accurate real-world simulators, they may grant valuable data for diverse AI algorithms, from player agents to robots that navigate complex environments.

These techniques help create annotated datasets for object detection and segmentation by generating images with different object configurations, backgrounds, and occlusions. Synthetic data aid in training models to accurately detect and segment objects under diverse conditions.

3.2. MOBA Video Games Visual Information Representation

As discussed in Section 1, MOBA video games are known for being a challenging problem for artificial players, which is primarily due to the complexity of their information and how it should be represented for the player algorithms. This section reviews what other research addresses this problem, pointing out their limitations and opportunity areas.

One of the first approaches to analyze an MOBA match is presented by [6], where the information of team composition is represented as tabular labeled data and is used to make win predictions of bot matches. However, much more related to bots behavior, ref. [28] proposed a symbolic representation of the game state, aiming to make a bot for a specific avatar of Dota 2 [29] designed under the Tangled Program Graphs (TPGs) framework, to solve the “1-on-1 mid-lane task”, managing to obtain the expected behavior from the bot.

Although their reported results are just as expected and set the playground of the applications and research of AI in MOBA video games, the approaches presented in [6,28] use a data representation that does not take into account one of the primary sources of information in a video game: the visual data.

However, since the research of [30], visual information has been considered a fundamental part of the representation used as inputs for the AI applications in MOBA video games. In this work, Berner et al. create a features vector that considers more than 600 features, where much of those features are related to visual information. However, the deep learning algorithm they trained does not directly process images but considers information about the player’s visual context, as some of those features describe the map and the mini-map. Since then, practically all of the research focused on player agents, like [31,32,33,34,35], and win prediction [36], includes a module solely for the image processing.

Talking about [34], in this work, the authors propose a hierarchical architecture for a player agent. Their algorithm considers visual and non-visual information to create the information representation that will serve as the input for a player agent to chose the best strategy. To do so, in the algorithm architecture, they incorporated convolutional neural networks to extract features of the current visual information of the game screen. Moreover, their architecture incorporates a fully connected layer to process non-visual information of the game data selected by the authors. These two branches are combined and form the input of the player agent. In other words, the deep learning algorithm generates the information representation for every player agent’s decision during the match.

Another hierarchical player agent proposal was made by [35]. In this work, the agent takes images and vector features as input, carrying visual and global features of the match, respectively. The visual information comprehends 85 features, such as the position and hit points of all units, and then blurs the visual features into a 12×12 resolution; for the global features, they extracts 181 features, such as roles of heroes, time of game, hero ID, heroes’ gold and level status and Kill–Death–Assistance statistics. Similar to the authors of [34], the visual information is processed by a CNN and the global features using fully connected layers. However, the resulting information is merged and sent to two separate tasks: one that identifies the game phase and the other that selects the best micro-level strategy. In other words, the agent structure is different from that used in [34], but its MOBA information representation is practically the same.

Two more player agents architectures were proposed by Tencent AI Lab researchers, as presented in [31,32]; both architectures are formed by three branches, aiming to efficiently process the visual and non-visual information of the match using convolutional layers for processing the visual information, represented by the raw images of the player screen and mini-map, and fully connected layers for processing the non-visual information described in feature vectors. In the first model, the first branch takes the hero’s local-view map image as input; the other two branches takes features of both the local-view map and mini-map as input. The second model takes more scalar information than the first one; the first of its branches receives the in-game statistics. The second branch takes spatial features and processes them through a CNN. Finally, the third branch takes observable units’ attributes and invisible opponent information, such as health points, skill cool down, gold, level, etc. In both cases, the output of the three branches is combined into a new vector that will be the input for the player agent. In the resume, similar to [34,35], these researchers obtain a representation as the output of a multi-branch deep neural network.

In general, all of the player agents in [31,32,33,34,35] use the player’s screen images as the visual input information for their algorithms. However, it is usually complemented with non-visual information, typically in the form of feature vectors.

In contrast to the state of the art, we propose a spatial representation, explained in detail further in Section 4.1, where metadata (health points, level, tower key, team affiliation) are represented as more channels of the same “pixel”, deriving in a representation where the objects and its information are spatially related. Furthermore, compared with the state of the art, our representation eliminates the necessity of a multi-branch DL model, as the information of the object and the object itself are in the same multi-channel matrix, requiring just one branch to analyze this information.

4. Method

4.1. Proposed Solution

Based on the premise that a collaborative artificial team capable of solving each game situation can effectively tackle the entire game, we propose the creation of a synthetic dataset. This dataset will consist of frames depicting variations of a selected game situation: the lane push, given that this is a mechanic that centers around the primary objective of either destroying the enemy base or protecting the allied base. By focusing on this fundamental aspect of the gameplay, we aim to capture the visual patterns of this crucial situation. The original proposal was made by [37], and a partial implementation of that system is studied in this paper.

This synthetic dataset will serve as a valuable resource for training an accordion auto-encoder to create dimensional reductions of the visual information provided to the players, aiming to create a model that understands the complexities of lane push variants. The overall diagram of the proposed solution is shown below in Figure 1.

The objects and their representing characters mentioned below were used for this research purpose.

B: Base	D: Defender avatar	M: Medicine	Wb: Breakable wall
T: Tower	J: Jungle avatar	Ad: Adrenaline	K: Tower key
A: Attacker avatar	F: Fighter avatar	B: Bush	X: Void space
S: Support avatar	C: Creep	W: Wall

The steps of the method for dimensional reduction are explained below:

4.1.1. ➀ Game Situation Definition

As mentioned in Section 2.2, the ‘lane push’ situation can be briefly described as an attempt to destroy an enemy structure. Thus, a scenario where this situation occurs must meet the following rules:

The map segment must have a tower or structure;
The map segment must contain an avatar from the opposing team of the tower/structure;
The map segment must contain a creep that holds a Tower key for the attacker team.

To simulate possible map configurations of this situation, we delimited a matrix of 13 × 15, in which the representing characters of the game objects will be placed randomly, and saving only the configurations that satisfy the previously defined rules.

4.1.2. ➁ Synthetic Data Generation

In the context of MOBA video games, obtaining labeled data for training deep learning models presents significant challenges due to the complexities associated with data collection and annotation. Additionally, we aim to study a general MOBA game state representation, meaning that it may not be directly associated with a specific commercial game of this genre but rather with a representation valid for all of them.

To overcome this challenge, we procedurally generated a set of lane push map segments that fit the definition. The procedure, shown in Algorithm 1, is described below:

An empty matrix of 13 × 15 cells is generated.
A random number in a range {min, max} are modified with a random object of the game.
The matrix is revised to fulfill the definition of the game situation.
If the matrix fulfills the definition, the process continues to random metadata annotation; else, it returns to step 1.

Algorithm 1: Synthetic Data Generation.

Figure 2 shows an example of a generated map segment, where some characters that represent the MOBA video game objects were randomly allocated in the map segment, fulfilling the lane push definition. Note that around each avatar (S and A), there is a marked area that is the visible area of that avatar, meaning that that is the zone from where that player would receive visual information. A total of 50,000 map segments were generated using this procedure.

4.1.3. ➂ Random Metadata Annotation

The map segments are randomly annotated to create a richer dataset, simulating diverse variations of a given game situation by changing the health points, level, tower key (a flag indicating whether an avatar or creep possesses a tower key), and alliance. The tower key is a unique item of our MOBA video game. If a player holds a tower key, then the damage that he and his nearby allies deal to the tower is doubled; if no enemy avatar of a tower holds one, then the damage the tower receives is half the damage those attacks would normally deal. The alliance is a flag for the avatars, structures, and creeps, and it indicates the team to which they belong, 1, 2, or neutral (0). This data augmentation procedure is explained in detail in Algorithm 2.

Algorithm 2: Random Metadata Annotation.

Figure 3 shows two variants of the same lane push situation but with different values in the metadata channel due to random annotation. Although the elements or game objects are the same for each matrix, there are crucial variations; for example, in the “health points” channel (ii and vii), almost every object has a different value, except for the creep (c), which has the same points in both variants. Another difference worth mentioning is that both the attacker avatar (A) and defender avatar (D) belong to different teams; this can be seen by the colors used in i and vi but also for the value of these objects in the alliance channel (v and x). So, even when using the same elements, important variations to the game situations can be generated just by effectuating this random metadata annotation. Five random annotated variations were created from the 50,000 generated map segments, resulting in 250,000 randomly annotated map segments.

4.1.4. ➃ Frames Extraction

The block ➂ generates map segments of the game situation of size 13 × 15. However, the data the player receives are not that matrix but rather the piece that a player sees during the match. As shown in Figure 2, the avatar perceives a matrix of size 7 × 9, where the avatar is located at the center. Then, it is necessary to make the extraction for the dataset, as it is the information that a player receives. For this purpose, we followed the Algorithm 3, which works as follows:

The center of the frame matrix is the avatar’s current cell.
The top left of the matrix is positioned four columns to the left of the avatar and three rows above it.
The bottom right of the matrix is positioned four columns to the right of the avatar and three rows below it.
For those limits, if all of the considered positions are in the defined cells of the map segment, then they are taken to create the frame matrix; else, the undefined positions are considered void cells (no game object exists in them).

Figure 4 shows an example of the frames extraction deriving from the map segment of Figure 2. From the 250,000 map segments synthetically generated, 1,600,000 frames were extracted with a mean of 6 frames extracted per map segment.

Algorithm 3: Frames Extraction.

4.1.5. ➄ Data Normalization

Normalization is a common technique in machine learning; it prevents overfitting to a specific channel or feature due to its magnitude rather than its actual relevance. In this case, the health points channel has values around the thousands, whereas the other channels have ranges like [0, 2] or [1, 20]. This discrepancy indicates the necessity of data normalization so that each channel’s values fall within the range [0, 1].

4.1.6. ➅ Frames Dataset

Once normalization is performed in step ➄, the dataset is ready. Finally, for its implementation in Pytorch, all data are saved in the format [batch size, number of channels, height, width], resulting in 200 files, each one containing 8000 frames. The dataset specifications are shown in Table 1.

4.1.7. ➆ Accordion Auto-Encoder Adapted Architecture

We proposed using an A²E network because of its proven efficacy for image reconstruction. Originally published in [38], this deep learning model was designed to improve the performance of the classifier and anomaly detection algorithms by using several sets of lower-dimensional space to generate more meaningful generalizations of the data. This effectiveness stems from its refined feature extraction capabilities facilitated by expansion–reduction processes. These processes enable the network to extract high-quality features by initially expanding the data dimensions and refining them through dimensionality reduction.

The proposed A²E architecture differs, in the first place, as our data representation contains spatial solid relations; thus, we are making it a convolutional A²E (Conv_A²E). Regarding how many expansion–reduction processes there are, we limited our experiment to two expansions and two reductions, as the frame data are not highly dimensional (7 × 9 × 5). Then, we thought that, on average, the players consider nearly 24 features to decide (features like object positions, available abilities of its avatar, available buttons on its keyboard/controller, etc.). However, we conducted an experimental process reported in Section 5.2 to search for an effective architecture that generates frame embeddings sufficiently representative to improve image reconstruction.

It is crucial to emphasize that the frame embedding’s dimensionality should be lower than that of the original data. Specifically, it should contain fewer features than the original data, which amounts to

9 \times 7 \times 5 = 315

features in this case. This constraint prevents the frame embedding’s latent space from merely replicating the original frame space. Such a replication would contradict the purpose of this architecture, which aims to distill and represent the essential features of the input data in a more compact and informative manner.

For each family, two algorithms were chosen, representing its set. These algorithms are described in Section 2.3.

4.1.8. Empowering the Conv_A²E Using Attention Mechanisms

The study aims to identify the most suitable attention mechanism for reconstructing MOBA frames. To achieve this goal, the research examines all applicable families of attention mechanisms in computer vision to the defined task. These include channel attention, spatial attention, and their combination, which were drawn from existing computer vision attention mechanisms.

To enhance the Conv_A²E architecture presented in Section 4.1.7, and illustrated in Figure 5, we added computer vision attention mechanisms as illustrated in Figure 6. With the exception of the deformable convolution, all the other attention mechanisms were implemented, as seen in this figure. For the deformable convolution, the architecture is more like the one illustrated in Figure 5, as this convolutional layer already integrates the attention mechanism as part of its normal operations.

4.1.9. A²E Test

To evaluate the models, the metric used for an objective evaluation was the mean squared error (MSE), shown in Equation (1), where the expected Y corresponds to the original frame and the

\hat{Y}

corresponds to the reconstructed frame. In addition, we used examples of the reconstructed images of the frames to make a deeper analysis of the Conv_A²E outcomes.

M S E = \frac{1}{n} \sum_{i = 0}^{n} {(Y_{i} - \hat{Y_{i}})}^{2}

(1)

4.1.10. Trained MOBA Encoder and Decoder

Once the model is trained and is good enough during the test, the architectures are saved as Pytorch checkpoint files and ready to be implemented.

4.1.11. Evaluation

Our goal is to prove that the frame reconstruction will be performed better with attention mechanisms, especially those centered on spatial attention, as our representation is in fact designed to keep all the information spatially related. For this purpose, we used the MSE and the visualization of the reconstructed frames to evaluate the performance of the different auto-encoders.

5. Experimental Results

In this section, we present three experimental procedures; in the first one, we train the proposed Conv_A²E in order to find a good dimensionality for the frame embedding; a better reconstruction made by the model with a dimensionality of this embedding implies that the features extracted in that frame embedding are more representative of the original frame.

The second experimental section aims to determine how much data should be used to train our proposed models.

Finally, the third experimental procedure aims to determine which attention mechanism for computer vision best fits this case study.

5.1. Conv_A²E Model

From the ➅ Frames dataset, we used a portion of 50 files for training (i.e., 400,000 frames) and four files for testing (i.e., 32,000 frames). All models were trained for 100 epochs with a batch size of 125 frames within an Adam optimizer with a learning rate of 0.0015.

The first experiments were conducted to establish the optimal frame embedding size that yields the lowest MSE by only changing that feature of the Conv_A²E architecture presented in Figure 5. Those experiments are reported in Table 2. From these results, we decided to use the frame embedding of

1 \times 1 \times 48

dimensions, as the Conv_A²E_V3 achieved the lowest MSE in both training (0.012779) and testing (0.010314) stages, meaning that the reconstructed frames made by this model were, on average, more similar to the original ones.

The following stage helped determine how much data should be used to train the Conv_A²E. As our first experiments were conducted using 50 files (400,000 frames), we conducted two more experiments, one using half the data and another using twice as much. Those results are shown in Table 3. As a specification, the architecture trained with 800,000 frames (100 files) needed a different batch size of 64 instead of 125. Counter to intuitive thinking, incrementing the amount of data does not improve, for this case, the performance of the model, whether during training or testing, suggesting that for an improvement using more data, we also would have needed to add more layers to our architecture. Nonetheless, decreasing the amount of data did not help to enhance the model’s performance, as the lower MSE remained in the original architecture, trained with 400,000 frames.

The resulting architecture, V3, is illustrated in Figure 5.

5.2. Conv_A²E_V3 with Attentions Mechanisms

To include the attention mechanisms with the Comv_A²E_V3, we added the attention layers after every convolutional block, as shown in Figure 6, where the original layers are in gray and the attention layers are in red.

To gain an initial understanding of the performance of the different Accordion Auto-Encoders, Figure 7 shows an example of a frame being reconstructed by each architecture. In this image, the first row shows the original frame to be reconstructed, and the rest of the rows are the outcomes of the different Conv_A²E. From these results, the architecture that mainly draws our attention is the deformable convolution, as this architecture seems to add information in every channel, in contrast with the rest of the models, which add little information in one or two channels. Another interesting fact is that both channel and spatial attention variations (namely, CBAM and triplet attention) show more similar reconstructed frames. However, it is also interesting that Deformable Conv seems to be the only variation to the Conv_A²E that captures information regarding the bottom row of the frame, as few others, like FCA-Net, seem to capture it at least in one channel, making it the best variation to the Conv_A²E_V3, as none other captures this relevant information of the original frame.

For a quantitative evaluation of the Conv_A²E using the selected attention mechanisms, the training results are shown below in Table 4.

From this quantitative evaluation, deformable convolution is the best models reaches the lowest MSE of them all, meaning that, on average, the reconstructed frames produced by this architecture are more similar to the original ones, thus creating a better frame embedding that contains relevant information of the frames to reconstruct them at a high quality. Even though the improvement in this metric may seem marginal, it is a significant enhancement to the model. Without making major modifications to the architecture, this approach helps improve the image reconstruction process and, by extension, the dimensional reduction achieved by the Conv_A²E.

About both the quantitative and qualitative evaluations, we noted that every attention mechanism had a positive impact over the original Conv_A²E, which is more evident when noting that in Figure 7, the original architecture mainly focuses only on the avatar (the yellow square at the center of the frame), and all of the attention mechanisms achieve reconstructions where other information is added. Another interesting insight from Figure 7 is that most models replicate the ’tower key’ channel perfectly, as it is the most simple. However, the deformable convolution derives in a quite different frame, which, taking the quantitative result, may imply that this is a more generalist representation for the original input frame.

6. Conclusions

In this section, we first answer the research questions defined in Section 1.1; then, a deeper discussion about the experimental results and ideas for future work are presented.

Hypothesis 1. Is there any other representation for MOBA video game information apart from the existing ones?

Response: A new representation of MOBA video game information can be found. Here, we proposed one that spatially relates the video game objects and their information. However, many others may exist, so future researchers may use or find the one that better fits their research needs.

Hypothesis 2. Given that representation, what features are the most representative of it?

Response: The most representative features of this representation are related to the spatial relationship between the game objects and their metadata. However, we do not explicitly determine those features in this research, as our focus was to extract them automatically.

Hypothesis 3. What is the best artificial intelligence architecture to extract those representative features?

Response: Given this representation, we found that representative features of the original frames can be extracted using the Conv_A²E_V3 and their variations with attention mechanisms.

Hypothesis 4. Does computer vision use different attention mechanisms to conduct different feature extraction?

Response: As expected, all the attention mechanisms enhance the original Conv_A²E_V3 architecture, resulting in better frame reconstructions than the original model. Nevertheless, only some of them achieve significant improvements in the MSE evaluation. The CBAM and gated attention results are not significantly different from the metrics obtained for the original architecture during training. Even though their testing performance shows a more significant difference compared to the training, these two variations exhibited the worst quantitative evaluation.

Hypothesis 5. Of those attention mechanisms, which is the best of them all for the given problem?

Response: The best alternative found was the Conv_A²E_V3 + Deformable convolution. In both training and testing sets, this was the variation that achieved the lowest MSE, meaning that the reconstructions, are on average, the most similar to the original frames, thus generating the best frame embeddings: those that contain the most representative features of the data.

Although the quantitative evaluation is more relevant as it provides an assessment across the entire test set, the reconstructed images offer a qualitative perspective that allows us to see the actual outcomes of the models and compare them with the ground truth. In this regard, the best quantitative result may not imply the best reconstruction. This variation to the Conv_A²E using deformable convolutions shows remarkable results, differing from all the other architectures by adding what appears to be random information across all the channels. However, as the quantitative results suggest, the reconstruction—and by extension, the representation—this model obtains of the frames is the most precise of all the tested models.

This aligns with the hypothesis that spatial attention may be more suitable for our proposed representation. On the other hand, this study also demonstrates the potential of different attention mechanisms to improve information representation for MOBA video games.

For future research, we first suggest that the presented study serves as a guideline for choosing visual information representation techniques. Furthermore, the trained model can directly be used for an intelligent agent solving an MOBA video game that fits the used definition of objects and game space, as this may be part of the decision-making mechanism of the agent by producing useful frame embeddings describing the visual information. Finally, our proposed spatial representation of the visual information, the one used before creating the frame embeddings with the Conv_A²E, also differs from the state-of-the-art representations and can be used in a procedure different from the presented in this study to solve MOBA video games.

Author Contributions

Conceptualization, H.C., J.A.T.-L. and M.A.M.-A.; methodology, H.C., J.A.T.-L. and M.A.M.-A.; software, J.A.T.-L.; validation, M.A.M.-A. and H.C.; formal analysis, M.A.M.-A., J.A.T.-L. and H.C.; investigation, J.A.T.-L., M.A.M.-A. and H.C.; resources, M.A.M.-A.; data curation, J.A.T.-L.; writing—original draft preparation, J.A.T.-L.; writing—review and editing, M.A.M.-A. and H.C.; visualization, J.A.T.-L.; supervision, M.A.M.-A.; project administration, M.A.M.-A.; funding acquisition, M.A.M.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Instituto Politecnico Nacional (IPN) through Secretaria de Investigación y Posgrado (SIP-IPN) research grants SIP-2259, SIP-20240666, SIP-20240610; Comisión de Operación y Fomento de Actividades Académicas del IPN (IPN-COFAA) and Programa de Estímulos al Desempeño de los Investigadores (IPN-EDI) and Consejo Nacional de Humanidades, Ciencias y Tecnologías, Sistema Nacional de Investigadores (CONAHCYT-SNII).

Data Availability Statement

The dataset created to conduct the reported experiments is available at https://github.com/JAlbertoTorres/MOBA_ConvA2E_dataset (accessed on 12 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DL	deep learning
A²E	Accordion Auto-Encoder
MOBA	Multplayer Online Battle Arena
MSE	mean squared error

References

Silver, D.; Huang, A.; Maddison, C.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
Alvernaz, S.; Togelius, J. Autoencoder-augmented neuroevolution for visual doom playing. In Proceedings of the 2017 IEEE Conference on Computational Intelligence and Games (CIG), New York, NY, USA, 22–25 August 2017; pp. 1–8. [Google Scholar]
Bagatella, M.; Olšák, M.; Rolínek, M.; Martius, G. Planning from pixels in environments with combinatorially hard search spaces. Adv. Neural Inf. Process. Syst. 2021, 34, 24707–24718. [Google Scholar]
Sudhakaran, S.; González-Duque, M.; Freiberger, M.; Glanois, C.; Najarro, E.; Risi, S. Mariogpt: Open-ended text2level generation through large language models. Adv. Neural Inf. Process. Syst. 2024, 36, 1–15. [Google Scholar]
Andono, P.; Kurniawan, N.; Supriyanto, C. Dota 2 bots win prediction using naive bayes based on adaboost algorithm. In Proceedings of the 3rd International Conference on Communication and Information Processing, Tokyo, Japan, 24–26 November 2017; pp. 180–184. [Google Scholar]
Nascimento Silva, V.; Chaimowicz, L. Moba: A new arena for game ai. arXiv 2017, arXiv:1705.10443. [Google Scholar]
Guo, M.; Xu, T.; Liu, J.; Liu, Z.; Jiang, P.; Mu, T.; Zhang, S.; Martin, R.; Cheng, M.; Hu, S. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Tan, J.; Liao, X.; Liu, J.; Cao, Y.; Jiang, H. Channel Attention Image Steganography with Generative Adversarial Networks. IEEE Trans. Netw. Sci. Eng. 2022, 9, 888–903. [Google Scholar] [CrossRef]
Choi, M.; Kim, H.; Han, B.; Xu, N.; Lee, K. Channel attention is all you need for video frame interpolation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 10663–10671. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 783–792. [Google Scholar]
Yan, J.; Peng, Z.; Yin, H.; Wang, J.; Wang, X.; Shen, Y.; Stechele, W.; Cremers, D. Trajectory prediction for intelligent vehicles using spatial-attention mechanism. IET Intell. Transp. Syst. 2020, 14, 1855–1863. [Google Scholar] [CrossRef]
Cheng, Z.; Qu, A.; He, X. Contour-aware semantic segmentation network with spatial attention mechanism for medical image. Vis. Comput. 2022, 38, 749–762. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.; Kainz, B. Others Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Du, C.; Zhang, L.; Sun, X.; Wang, J.; Sheng, J. Enhanced multi-channel feature synthesis for hand gesture recognition based on CNN with a channel and spatial attention mechanism. IEEE Access 2020, 8, 144610–144620. [Google Scholar] [CrossRef]
Lu, E.; Hu, X. Image super-resolution via channel attention and spatial attention. Appl. Intell. 2022, 52, 2260–2268. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.; Kweon, I. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Misra, D.; Nalamada, T.; Arasanipalai, A.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3139–3148. [Google Scholar]
Nikolenko, S. Synthetic Data for Deep Learning; Springer: Cham, Switzerland, 2021. [Google Scholar]
Melo, C.; Torralba, A.; Guibas, L.; DiCarlo, J.; Chellappa, R.; Hodgins, J. Next-generation deep learning based on simulators and synthetic data. Trends Cogn. Sci. 2022, 26, 174–187. [Google Scholar] [CrossRef] [PubMed]
Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; Birchfield, S. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 969–977. [Google Scholar]
Hinterstoisser, S.; Pauly, O.; Heibel, H.; Martina, M.; Bokeloh, M. An annotation saved is an annotation earned: Using fully synthetic training for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Peng, X.; Usman, B.; Kaushik, N.; Wang, D.; Hoffman, J.; Saenko, K. Visda: A synthetic-to-real benchmark for visual domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2021–2026. [Google Scholar]
Dundar, A.; Liu, M.; Wang, T.; Zedlewski, J.; Kautz, J. Domain stylization: A strong, simple baseline for synthetic to real image domain adaptation. arXiv 2018, arXiv:1807.09384. [Google Scholar]
Van Dyk, D.; Meng, X. The art of data augmentation. J. Comput. Graph. Stat. 2001, 10, 1–50. [Google Scholar] [CrossRef]
Smith, R.; Heywood, M. Evolving Dota 2 shadow fiend bots using genetic programming with external memory. In Proceedings of the Genetic and Evolutionary Computation Conference, Prague, Czech Republic, 13–17 July 2019; pp. 179–187. [Google Scholar]
IceFrog. Valve Corporation, Dota 2, Released: 9 July 2013. Available online: https://www.dota2.com/home (accessed on 12 July 2024).
Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Dębiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C. Others Dota 2 with large scale deep reinforcement learning. arXiv 2019, arXiv:1912.06680. [Google Scholar]
Ye, D.; Chen, G.; Zhao, P.; Qiu, F.; Yuan, B.; Zhang, W.; Chen, S.; Sun, M.; Li, X.; Li, S.; et al. Supervised learning achieves human-level performance in moba games: A case study of honor of kings. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 908–918. [Google Scholar] [CrossRef]
Ye, D.; Chen, G.; Zhang, W.; Chen, S.; Yuan, B.; Liu, B.; Chen, J.; Liu, Z.; Qiu, F.; Yu, H.; et al. Towards playing full moba games with deep reinforcement learning. Adv. Neural Inf. Process. Syst. 2020, 33, 621–632. [Google Scholar]
Gao, Y.; Shi, B.; Du, X.; Wang, L.; Chen, G.; Lian, Z.; Qiu, F.; Han, G.; Wang, W.; Ye, D. Others Learning diverse policies in moba games via macro-goals. Adv. Neural Inf. Process. Syst. 2021, 34, 16171–16182. [Google Scholar]
Zhang, Z.; Li, H.; Zhang, L.; Zheng, T.; Zhang, T.; Hao, X.; Chen, X.; Chen, M.; Xiao, F.; Zhou, W. Hierarchical reinforcement learning for multi-agent moba game. arXiv 2019, arXiv:1901.08004. [Google Scholar]
Wu, B. Hierarchical macro strategy model for moba game ai. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 1206–1213. [Google Scholar]
Yang, Z.; Pan, Z.; Wang, Y.; Cai, D.; Shi, S.; Huang, S.; Bi, W.; Liu, X. Interpretable real-time win prediction for honor of kings—A popular mobile MOBA esport. IEEE Trans. Games 2022, 14, 589–597. [Google Scholar] [CrossRef]
León, J.; Armendáriz, M.; Castro, F. Esquema de aprendizaje hıbrido de agentes colaborativos en videojuegos MOBA. Res. Comput. Sci. 2023, 152, 7–20. [Google Scholar]
Kelso, K.; Lee, B. Accordion AutoEncoders (A 2 E) for Generative Classification with Low Complexity Network. In Proceedings of the 2021 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 15–17 December 2021; pp. 502–505. [Google Scholar]

Figure 1. Method for dimensional reduction of the MOBA video game information using an A²E.

Figure 2. An example of a map segment with lane push situation. In this segment, the red attacker avatar (A) at (9,8) and the red support (S) at (7,4) are in an excellent position to attack the blue tower (T) at (5,7). Along with them, the creeps from both teams (C) are ready to attack the opposing team. There are other elements, like walls (W) and Medicine (M), that can also contribute to the flow of situation by healing the avatars (M) or limiting their escape routes (W).

Figure 3. Two examples of randomly annotated frames for the same game situation.

Figure 4. An example of frames extraction from a map segment of lane push situation.

Figure 5. Adapted Convolutional Accordion Auto-Encoder (Conv_A²E) architecture.

Figure 6. Proposed Convolutional Accordion Auto-Encoder (Conv_A²E) architecture empowered with attention mechanisms.

Figure 7. Comparison of a frame being reconstructed by each Conv_A²E trained. Note that the light squares, like yellow and green ones, denote game objects, and the dark squares, like the purple ones, are void spaces.

Table 1. Specifications of the synthetically generated dataset.

Number of Generated Map Segments	Number of Map Segments after Random Annotation	Number of Extracted Frames	Frames per File	Files Generated
50,000	250,000	1,600,000	8000	200

Table 2. Experiments changing the frame embedding dimensions.

ID	Frame Embedding Size	Training Data	Testing Ba	MSE (Training)	MSE (Testing)
Conv_A²E_V1	2 × 2 × 24	400,000	32,000	0.024535	0.021604
Conv_A²E_V2	1 × 1 × 24	400,000	32,000	0.025934	0.022694
Conv_A²E_V3	1 × 1 × 48	400,000	32,000	0.012779	0.010314

Bold font indicates the best values.

Table 3. Experiments changing the amount of data during training.

ID	Frame Embedding Size	Training Data	Testing Data	MSE (Training)	MSE (Testing)
Conv_A²E_V3_3	$1 \times 1 \times 48$	200,000	32,000	0.019547	0.022702
Conv_A²E_V3	$1 \times 1 \times 48$	400,000	32,000	0.012779	0.010314
Conv_A²E_V3_2	$1 \times 1 \times 48$	800,000	32,000	0.024075	0.022692

Bold font indicates the best values.

Table 4. Training results for every proposed Conv_A²E.

Architecture (Conv_A²E)		MSE (Training)	MSE (Testing)
	Conv_A²E_V3	0.012779	0.010314
Channel Attention	+ECANet	0.008232	0.006071
Channel Attention	+FcaNet	0.007579	0.005735
Spatial Attention	+Gated Attention Unit	0.010963	0.007862
Spatial Attention	+Deformable Conv	0.006585	0.003893
Channel & Spatial Attention	+CBAM	0.010563	0.007733
Channel & Spatial Attention	+Triplet Attention	0.007534	0.005379

Bold font indicates the best values.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Torres-León, J.A.; Moreno-Armendáriz, M.A.; Calvo, H. Representing the Information of Multiplayer Online Battle Arena (MOBA) Video Games Using Convolutional Accordion Auto-Encoder (A²E) Enhanced by Attention Mechanisms. Mathematics 2024, 12, 2744. https://doi.org/10.3390/math12172744

AMA Style

Torres-León JA, Moreno-Armendáriz MA, Calvo H. Representing the Information of Multiplayer Online Battle Arena (MOBA) Video Games Using Convolutional Accordion Auto-Encoder (A²E) Enhanced by Attention Mechanisms. Mathematics. 2024; 12(17):2744. https://doi.org/10.3390/math12172744

Chicago/Turabian Style

Torres-León, José A., Marco A. Moreno-Armendáriz, and Hiram Calvo. 2024. "Representing the Information of Multiplayer Online Battle Arena (MOBA) Video Games Using Convolutional Accordion Auto-Encoder (A²E) Enhanced by Attention Mechanisms" Mathematics 12, no. 17: 2744. https://doi.org/10.3390/math12172744

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Representing the Information of Multiplayer Online Battle Arena (MOBA) Video Games Using Convolutional Accordion Auto-Encoder (A2E) Enhanced by Attention Mechanisms

Abstract

1. Introduction

1.1. Hypothesis

1.2. Contribution

1.3. Paper Map

2. Theoretical Framework

2.1. MOBA Video Games

2.2. Game Situations

2.3. Attention Mechanisms

2.3.1. Channel Attention Mechanisms

2.3.2. Spatial Attention Mechanisms

2.3.3. Channel and Spatial Mechanisms

3. State of the Art

3.1. Synthetic Data Generation for Diverse Purposes

3.2. MOBA Video Games Visual Information Representation

4. Method

4.1. Proposed Solution

4.1.1. ➀ Game Situation Definition

4.1.2. ➁ Synthetic Data Generation

4.1.3. ➂ Random Metadata Annotation

4.1.4. ➃ Frames Extraction

4.1.5. ➄ Data Normalization

4.1.6. ➅ Frames Dataset

4.1.7. ➆ Accordion Auto-Encoder Adapted Architecture

4.1.8. Empowering the Conv_A2E Using Attention Mechanisms

4.1.9. A2E Test

4.1.10. Trained MOBA Encoder and Decoder

4.1.11. Evaluation

5. Experimental Results

5.1. Conv_A2E Model

5.2. Conv_A2E_V3 with Attentions Mechanisms

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Representing the Information of Multiplayer Online Battle Arena (MOBA) Video Games Using Convolutional Accordion Auto-Encoder (A²E) Enhanced by Attention Mechanisms

4.1.8. Empowering the Conv_A²E Using Attention Mechanisms

4.1.9. A²E Test

5.1. Conv_A²E Model

5.2. Conv_A²E_V3 with Attentions Mechanisms