Automated Game Testing

Combining classic game AI (Artificial Intelligence) and modern machine learning for creation and execution of functional tests.

Video games stand out as a software domain because the driving forces in their creation are artistic vision and interaction. These qualities of game projects are evident even in the rhetoric that game designers use, such that the jargon expression “finding the fun” is deemed a crucial first step. Once the core mechanics of a game are identified and development begins, the role of quality assurance (QA) professionals is to ensure that those mechanics function as intended. Although game mechanics may be altered, added, or removed during development, a large amount of a QA professional’s time is spent verifying that the existing mechanics work as expected after each developmental change. As the game grows in both size and complexity, so does the number of tests, as trivial as they may be, consuming much of the QA professional’s time.

This is a shared issue in the videogame industry, as exemplified by the introduction of the Automated Testing Roundtable at GDC since 20181, as well a growing number of communications on the subject2345. To tackle this growing issue, the Automated Game Testing (AGT) team introduced a new testing methodology that decouples the validation of game mechanics from the reproduction steps of a functional test by using a simple data structure called the Explicit Plan.


Software Testing

Software tests typically follow three phases: 

  1. Initialize – Set the system to some prior, known state 
  2. Execute – Run the function(s) we want to test 
  3. Validate – Measure whether the new state corresponds to the expectation
Figure 1 – Left: Simple example pseudo-code for a functional test; Right: Other simulated example in the context of a video game.

Let us consider an example functional test for a video game, say the classic Super Mario Bros. We could test that pressing Start from the main menu causes World 1-1 to load. Here, the Execute phase of the test is simple, as it only consists of one step. This simplistic test is illustrated as a function in Figure 1. 

To complexify things, we could test that we can jump on the first Goomba, within World 1-1 of World Start. The three phases of testing (initialize, execute, and validate), could be:

  1. Initialize the Game State to World 1-1. 
  2. Execute a list of Actions (e.g., Right, Right, Jump, …) to move Mario around and jump (hopefully) onto the Goomba.  
  3. Validate that Mario did jump onto the Goomba. 
Figure 2 – Pseudo-code of a hypothetical Jump onto Goomba test, with corresponding visualisation of the 3 testing phases (Super Mario Bros1999 Nintendo of America Inc.)

This is still simple enough to imagine automating this test with a script. Now, what if we wanted to test if Mario could traverse the entire game and defeat Bowser? We could still initialize the game to the starting level, but the list of all inputs would be huge. It would be entirely impractical to script such a test.  

It is also worth noting that the naive method of replaying lists of predetermined inputs is only feasible if the game is completely deterministic. Indeed, if the game features even a little bit of randomness, either intentional (e.g., random enemy spawning) or not (e.g., multiprocessing or network communication), then the list of inputs may not be robust to this variability and the test may fail from compounding errors.  

Testing a game under development brings additional difficulties as the game is in a constant state of change. The high cost of maintaining precise reproduction steps for every test simply does not scale. 

Figure 3 – Pseudo-code of a hypothical complex Beat the Game test and visualisation of the three testing phases.

The usual solution for tests that are too complex to automate is to get a person, or playtester, to run them (Fig. 4). Humans can adapt to different input settings, have an intuition of how to play games, can write down what they did to reproduce bugs, and potentially even offer solutions to these bugs.

Figure 4 – The 3 testing phases applied to tests carried out by people.


The Explicit Plan

We pointed out the two main issues with using a list of inputs to automate tests are that 1) even for conceptually simple tests, the list would grow quickly and become unmanageable, and 2) fixed input steps cannot account for randomness and non-determinism. To mitigate these difficulties, we have devised the explicit plan as a more powerful and flexible structure.

Scaling Reproduction Steps 

Consider a generic game where the player controls a character that can navigate an environment. Functional tests in such an environment often involve traversing map segments and performing actions along the way. We discovered that most controller inputs during functional tests (that is, tests that validate functionality of the game) pertain to navigation between actors or locations of interest. The remaining inputs are usually actions performed on the actors, or at the locations once we had reached them. Let us formalize the use of the word action to refer to such gameplay actions other than navigation. Furthermore, let us define a node as a target destination (another actor, or world location) where such an action is performed. Assuming that navigation could be handled automatically, instead of storing a full list of inputs to reproduce the test, we could store only the target locations and actions in the form of nodes. Figure 5 shows how a gameplay sequence is reduced to such a linked list.

Figure 5 – Modeling a sequence of inputs as actor-action pairs called nodes.

To automate the navigation, we can take advantage of several algorithms already present in modern game engines, including simply teleporting the player or using a navigation mesh. Additionally, since navigating to an actor reference is a common task for in-game AI, the navigation to the next node’s actor is robust even if an actor moves (either by its own AI or from a developer’s modification), allowing us to handle much of the non-determinism caused during a test. 

Although these navigation strategies have their own limitation (which we discuss in more detail below), they are readily available solutions and are sufficient in many situations. 

Handling Unforeseen Events

More sources of variability may cause a test to occur differently from one time to the other. Complex videogames feature dynamic behavior, and the player may have to interrupt his pre-set explicit plan to address a sudden threat. 

To account for this, we allow a QA professional to specify conditional behavior in the form of triplets containing:

  • Preconditions
  • Priority
  • Action

We call such triplets rules. When a rule has all its preconditions fulfilled, the ascribed AI behavior is performed. If many rules apply at once, the one with the highest priority precedes. An example of a rule could be to perform the attack action whenever an enemy actor is within a certain distance of the player. This would mimic the expected behavior of a human playtester who would be instructed to “Attack any enemy encountered”.

We call the explicit plan (EP) of a test the combination of a list of nodes and a set of rules that specify the expected behavior (Figure 6). To follow an explicit plan, an agent must traverse towards each node in sequential order and, upon arrival at the actor or location specified in the node, perform the associated action if there is one. If at any moment, a subset of rules have their preconditions fulfilled, the node sequence is temporarily interrupted and the behavior of the rule with the highest priority is performed. When no more rule is applicable, the sequence of nodes is resumed where it was left off. Leveraging the aforementioned navigation strategies, it is possible to create a simple AI agent able to follow such a plan. We refer to such entities as explicit plan agents (EPA).

Figure 6 – A sequence of nodes and a set of rules fully specify an explicit plan.


Action Validation

An explicit plan corresponds to the execute phase of a functional test. We now need to account for the two other phases (initialize and validate) in order to conduct full tests. 

With the EPA, we always know when a gameplay action is about to be performed. We can therefore measure the state of the game with a callback at that moment, thus implementing the initialize phase of functional tests. For the final validation phase, we can create another callback and run it either after a delay or after an in-game event is triggered.  

Let’s look at an example putting the three phases together into a complete test. Consider an EPA executing our previous rule example “Attack any enemy encountered”. Imagine the agent does come close to an enemy actor, and interrupts the sequence of nodes to navigate towards it. When within a certain distance of the enemy, it would broadcast a message saying that it is about to trigger the attack action. At this point, before the action is performed, a custom pre-action callback is executed to collect relevant data about the current game state (e.g., health of the enemy, kill count of the player). Suppose that the player’s attack damage is sufficient to kill the enemy and that when an enemy is killed, a corresponding event is broadcasted. We can then setup a post-action validation function that is bound to the event triggered by the enemy’s possible destruction. The post-action validation function will measure the new values of the variables observed by the pre-action callback and verify that an expected change happened. Figure 7 illustrates the required callback flows.

With variables collected and the validation function set to run, we can now let the EPA perform the attack action. If our validation function is called, it will check whether the new state of the game meets our expectations. For this example, we might want to check that the player’s number of kills has increased by 1. If that is not the case, or if the post-action validation function was not called after some amount of time, the test fails.

Figure 7 – Callbacks before and after executing an in-game Action allow to measure the effect on the game state.

Expanding this idea to other actions, we can define all preaction and postaction validation functions in one place. That way, it is not a problem if some actions are carried out by the EPA in an unforeseen order (due to rules being triggered) . In doing so, we decoupled the validation of game mechanics from the execution order of the actions necessary to enact those mechanics. One of the biggest benefits of this decoupling is that we can exploit the simplicity of explicit plans to rapidly create functional tests. As a bonus, we can use the same system for collecting performance metrics about the game while following certain plans.

Figure 8 – The decoupled architecture of AGT allows wrapping the testing code around game logic segments, facilitating the creation and combination of tests.

Action validation is the smallest unit of testing in AGT. A test composed of many nodes and rules will fail if any action validation yields unexpected results or times out. Video 1 showcases how a test trying to pass a level fails on a single action block, where the interact action performed on the actor switch should have toggled the state of the switch but did not because of a change in placement.


Video 1 – Video showing that moving a button that opens a door too high for the agent to reach fails a test.

Explicit Plan Creation

To use the AGT framework, QA professionals must write the steps of their test as explicit plans. This is analogous to writing the reproduction steps a human must follow when playtesting. We have experimented with three methods for plan creation:

Manual editing: Setting the data using editor tools.
Play-Session Recording: Recording yourself play the game.
Text to Plan: Parsing a text description of the steps.

All three approaches share data types and are interoperable. In this section, we demonstrate how any of these methods can be used to complete a simple dungeon level (Figure 9). In particular, we expand on how we leveraged natural language understanding to generate explicit plans.

Figure 9 – A simplistic dungeon level with a locked door and an opening switch, a lava pit and an activable end-of-level shrine.

The Manual Method

Plans are designed as a serializable type, making it easy to edit them using game engine editors. Plan assets are created with one node by default, with the possibility of adding as many as needed. As discussed in the previous section, each node specifies a location or actor to navigate to, and optionally an action to be performed at the target. Every node also features a variety of utility parameters, such as tolerance radius, movement mode, waiting time, etc.

In our example game, the available actions are interact, attack, and jump. To complete the simple dungeon, three nodes are needed: the player character must first toggle the switch, then jump over the lava pit, and finally activate the shrine.

Video 2 Manual creation of an explicit plan within the Unreal Editor.

This method is the most basic and assures the most flexibility. It is best suited for fine adjustments and precise edits. It is the default go-to for quick experiments, or for modifying existing plans, regardless of their creation method.

Play-Session Recording  

More involved or lengthy tests may require the creation of a lot of nodes. Consider a test that verifies if a long quest can be completed successfully. It is impractical to think about it in terms of how many nodes are required, and what their contents should be. To address similar cases, we let the QA professional control the character and play the game while recording the locations periodically, as well as every action performed. The obtained explicit plan is a good approximation of the playthrough, but it is sensitive to small changes in the game when it is replayed. Nevertheless, play-session recording is particularly useful to quickly lay out many nodes and build a rough skeleton of the plan. Manual editing can then be used to make it more robust by adding or removing nodes, specifying actors for certain nodes, or providing rules.

Video 3Creation of an explicit plan using a recorded play session within the Unreal Editor.

When using the recording method, the position of many nodes are world coordinates. This makes the plan sensitive to terrain changes or map edits in general. This is because the recorded plan samples data from the play session but has no higher-level notion of what the plan is trying to accomplish. The following approach is an attempt at capturing the higher-level idea of plans.

Text to Plan  

Tasks of QA professionals include determining which tests should be conducted, and producing instructions on how to perform them. These instructions may be written as a list of steps. For example, the test we recorded in the previous section may be described in three steps:

  1. Activate the switch.
  2. Jump over the ledge.
  3. Interact with the shrine.

This is enough for a human tester to understand and carry out the test. This format has the advantage of being intuitive, compact, and expressive. What if the text instructions could also be used for the AI agent? 

Leveraging NLU (spaCywas usedfor NLU results presented in this article6), the Text to Plan approach can be used to turn text written in natural language into explicit plan nodes. The Text to Plan pipeline can be broken down like so:

  1. Using sentence-level classification, determine whether a sentence represents a node or a rule.

     2. Using token-level classification, extract an action keyword, and an actor type keyword

     3. Match the keywords to valid values using similarity analysis7

4. Find an in-game instance of the matched actor type and create a node with that instance as the target, and the matched action. If no instance is matched, the action will simply be performed at a fallback location. 

    5. Repeat for every line of text instruction and concatenate all yielded nodes into a list. 

A set of rules is built similarly with sentences classified as rules at the first step of the pipeline. The list of nodes and the set of rules are combined to form the explicit plan. 

This approach uses a combination of machine learning techniques construct an explicit plan from a text description. We have found that reproduction steps formulated for humans by QA professionals tend to have a simple grammatical structure which is easy for statistical models to handle.  

Video 4Creation  of an explicit plan using the Text-to-Plan widget within the Unreal Editor. 

Using Text to Plan lets the QA professional build plans at a high level of abstraction, encoding the steps in a text form that can be easily understood by people. In contrast to play session recording, the explicit plan produced here contain very little indication on how to navigate between nodes. They are in that way robust to changes of the terrain but are in turn more reliant on the quality of the navigation strategy.


Reinforcement Learning Integration

Although the EPA can take advantage of existing game AI to follow a plan, corner cases still require some amount of engineering to handle all cases of interest. Being able to give back more of this engineering time to actual game development is always valuable, especially if the solutions developed can be re-used in subsequent productions. This is where reinforcement learning (RL) may be able to help.  

RL is a branch of machine learning that focuses on training an agent to choose actions while interacting with an environment over multiple steps. First, two types of information are obtained from the environment: the current state of the environment, and the score (called reward in the context of RL) the agent has achieved. The state and the reward are given to the agent, which chooses what action to perform next according to this data. Using machine learning methods, the agent is trained to pick the actions that maximize the future expected sum of rewards. Conveniently, much of the research in reinforcement learning has focused on domains where the environment is a video game.

Figure 10 – The reinforcement learning process can be modeled as an iterative loop of interaction between an agent performing an action and an environment that continuously outputs a new state and reward.

However, RL agents often do not have the desired behaviour and are known to not generalize well to unseen tasks. Often this is because the reward given to the agent does not reinforce the behaviour that the designer intended. For example, here is a video by OpenAI of a trained RL agent in the boat racing game CoastRunners that was told to maximize its score. Rather than finishing the race, the agent found that it could rack up points by going in circles collecting power-ups. Situations like these make it difficult to debug RL agents and the issues can become more obscure to debug as the number of actions increases and the environment becomes more complex.  

By keeping the agent models simple, they can be trained faster, they are easier to debug and they may be more robust to changes in the game if they have fewer dependencies, such as game visual inputs. The following video shows how a RL agent that has been trained to navigate towards a target location can be used to follow an explicit plan. The plan also has two rules: “Collect any collectables” and “Attack any enemies”. This agent observes its environment by using a grid sensor as well as the direction towards its current target. The target is specified by either the next node or the current rule of the explicit plan. 

This RL agent was not trained to open doors, attack enemies, or interact with shrines. However, by combining it with an explicit plan and like in the normal EPA, running the designated actions for the nodes and rules when the agent arrives at the target, the agent can complete the plan and validate those actions. 


Video 5A RL agent trained to go to a target location by following an explicit plan.

Going beyond a naive training regime of navigating towards a target, we can take advantage of the explicit plan itself as the basis for defining a RL task. As QA professionals will create a suite of explicit plans for their functional tests, those plans can be reused to define a large set of tasks to train agents on. By training these models using RL or imitation learning to follow suites of explicit plans, we can readily use these agents as a replacement for the classic game AI techniques mentioned earlier.

As simple machine learning models may not be able to follow some complex explicit plans, the accompanying tests must be manually performed by humans. However, by incorporating the evaluation of machine learning models into a game testing framework, we can directly evaluate new models and use the ones with maximal test coverage.



Video 6Agent following explicit plans by jumping on some houses in a village then running up a mountain to activate a shrine. 

Although scripted bots have been used to help QA validate game mechanics in AAA games, the construction of these bots and the validation steps themselves can be tedious and daunting as game development accelerates. To assist QA professionals in this growing problem, the Automated Game Testing team explored event-based validation and explicit plans for representing the reproduction steps of a test. Using NLU, a playsession recorder, and a detailed manual editor to rapidly create explicit plans allows for functional test suites to be easily built and maintained. Expanding beyond using classic game AI for executing these tests, explicit plans can be used as a task specification for RL agents, allowing for seamless integration of RL methods for testing games while being robust to numerous changes that occur during video game development without requiring a massive training architecture.



Eidos Sherbrooke would like to thank Matsuko for their continued collaboration within the Automated Game Testing team. A special thanks to Martin Čertický and Charles Bernardoff-Pearson.



Jaden Travnik 

Jaden Travnik joined Eidos in 2018 as a Machine Learning Specialist. He obtained his MSc in Computing Science in 2018 under the supervision of Dr. Patrick Pilarski where his research activities were focused on reinforcement learning applied to prosthetic devices. Jaden sees how thoughtful applications of Machine Learning can remove obstacles for people while giving them more control over the things that matter to them. Video game development has many such obstacles, and it is exciting to see how Machine Learning can address these challenges. 

Vincent Goulet  

Vincent Goulet joined Eidos as a Machine Learning Specialist in 2018. Originally trained in physics and engineering, he obtained his MSc in Computing Science for which he researched constraint optimisation problems applied to mechanical generation. During his graduate studies, his interest was caught by the rapidly evolving field of natural language processing, in line with a long-standing interest in languages and linguistic. He is now dedicating his work to bringing industrial practices in videogames up to pace with, and beyond, the newest advances of NLP technologies. 


Legal Notices

Super Mario Bros™ 1999 Nintendo of America Inc.