**5. Discussion**

During the experimentation phase, the microRTS game engine environment performed as expected (i.e., without visible or known bugs). Our presumption from the start of the experiment was that all of the Game Features were valid, ye<sup>t</sup> the experiments showed that two of the Game Features were actually invalid (GF3 and GF8). A closer inspection of the GF3 results, specifically its invalidation number, revealed that not all of the playtesting agents caught the invalid game feature, and that some of them only invalidated it in a fraction of tries. Additionally, if the number of scenario repeats would be set to lower than fifty, it is possible that only the playtesting agents with a better performance would be successful in finding GF3 to be invalid.

GF3 was invalidated by eight playtesting agents, while GF8 was invalidated by all of them, with the only exception being the basic RandomAI. The difference in the number of playtesting agents that invalidated the game features, GF3 and GF8, successfully shows us that some game features are more sophisticated and require agents that intelligently explore and exploit the search space in question.

We discovered two important guidelines for validation testing:


Our purpose was not to judge the existing gameplaying agents created by the research community based on the score they achieved. We did, however, use the invalid number part that they attained to calculate the metric score for metric testing purposes. The results were encouraging. The state-of-the art evolutionary and tree-based agents were good performers, not just for gameplaying, but also for playtesting. The line between basic agents (e.g., G1GF3(50, 0)) and advanced ones (e.g., G1GF3(21, 29)) can also be clearly seen. We did not measure the average time for an agen<sup>t</sup> to complete a scenario, but during playtesting, we noticed that agents that were either basic (e.g., RandomAI) or very good performers (e.g., NaiveMCTS) completed the validations in the fastest amount of time. We believe that this resulted from decisions being made quickly (either bad or good).

At this point, we can also provide answers to the research questions presented in the Introduction. RQ1: the adaptation of a gameplaying agen<sup>t</sup> as a playtesting agen<sup>t</sup> is straightforward, provided that the game engine follows good software design techniques (components, interfaces, etc.). In our estimation, this is very important, because it allows for research discoveries in the gameplaying domain to be transferred to the playtesting domain and probably also for higher adaptation rates of such discoveries for commercial use. RQ2: In comparing different playtesting agents, our metric relies on the groups presented in Table 2. The groups belong to different classes (Table 3), each with their own weights. Additional information for comparisons can also be found based on the calibration of these weights. For that purpose, Table A1 was created in Appendix A, which shows how the metric score changes in relation to the changes of the weights. In this way, we can give importance to a specific set of groups and achieve a greater differentiation between the playtesting agents covering them. RQ3: playtesting agents are evaluated through game feature definitions using the created metric. The most beneficial Game Feature definitions are the ones that belong to the groups that are in the high-importance class, shown in Table 3. RQ4: evolutionary and non-evolutionary approaches in the state-of-the-art segmen<sup>t</sup> both performed well, and their playtesting abilities are high. No major differences were detected for the game features and scenarios tested. RQ5: the validity of the game feature was defined, with the condition of the validation procedure inside the component used for the adaptation of the gameplaying agents.
