Automated Game Understanding

Mark Riedl
10 min readMay 29, 2017


Updated: link to IJCAI’17 paper

Automated understanding, also known as common sense learning or hypothesis generation, is the problem of taking a sequence of events, building some model that explains the events, and using the model to predict future events or answer questions about the sequence of events.

We introduce a new automated understanding problem: Automated Game Understanding, in which an artificial intelligence system learns knowledge about video game systems for the purposes of game play, design, or critique. We define Automatic Game Understanding as the problem of developing formalized, complete models of the underlying processes in games.

Forms of automated game understanding allow artificial intelligent systems to learn to play games. Recent successes in automated game playing include agents that can play Atari games or Doom at near (and sometimes greater than) human level. Artificial intelligence researchers have long contended that playing games represents fundamental AI research — computer games act as stepping stones to more complicated environments such as the real world.

Automated game playing agents such as the Atari playing agents that use deep neural networks and reinforcement learning only have partial understanding of the games they are playing. These agents use deep convolutional neural networks to learn how to the play Atari games from pixel input. However, they only learn which actions in which states maximize long-term expected reward (i.e., score). These agents are “model free” in that they don’t learn how the game works. When model-based reinforcement is used, it only learns as much of the rules as necessary to play the game optimally. For example, in Super Mario Bros., the agent may learn that jumping on certain enemies increases the score and make the enemies disappear. But the agent will never learn that the clouds move or what Goombas do once they move off the end of the screen.

We demonstrate Automated Game Understanding on Super Mario Bros. We outline our general approach below. Details are to appear in the Proceedings of the 2017 International Conference on Artificial Intelligence (IJCAI’17). The paper is available here.

Game Engine Search

We introduce an approach to Automated Game Understanding called Game Engine Search that learns how a game engine works from gameplay data. A game mechanic is a discrete rule that maps a cause to an effect (e.g., if the player is falling and hits the ground then the player stops falling). We define a game engine as the backend set of a game’s mechanics.

Our Game Engine Search algorithm scans through the output of the game engine, represented as video, and iteratively improves a hypothesized engine through greedy search until it can predict what each subsequent frame of video will look like with minimal error.

Our Game Engine Search algorithm has three main phases:

  1. We scan each input video frame to determine the set of objects present per frame.
  2. We run a greedy matching algorithm across pairs of adjacent frames to determine how objects change between frames.
  3. We parse each frame and run an engine search algorithm when the second frame differs from the predicted next frame by more than some set threshold

Parsing Frames

We use OpenCV, an open-source machine vision toolkit, to determine the number and placement of sprites in each frame of the video. The sysetm is given the set of sprites ahead of time (for Super Mario Bros., these are readily available online). Given a sequence of frames, where each frame is defined as the set of sprites and their locations, we can run a simple greedy matching algorithm to match each sprite to its closest neighbor (both visually and spatially) in each adjacent frame.

We define a set of percept types — hand-authored rules that produce facts about how the frames have changed. These are common to many games:

  • Animation: The animation fact is simply a list of each sprite image according to its original filename, width, and height (e.g. if an image “mario1.png” of size [26px, 26px] was found at position 0,0 that would create the Animation fact ⟨mario1, 26, 26⟩).
  • Spatial: The spatial fact is the sprite’s location in the frame, listed as the sprite’s filename and it’s x and y position in the frame.
  • RelationshipX: The relationship in the x-dimension of each pair of sprites in the initial frame, of the form (sprite1’s filename, sprite1’s closest edge to sprite2, distance in pixels, sprite 2’s filename, sprite 2’s closest edge to sprite1). This condition allows the system to learn collision rules. For example, Mario’s x-velocity changes from positive to zero when Mario’s right side hits some other sprite’s left side.
  • RelationshipY: The exact same as the above condition but for the y-dimension. This allows the system to learn collision rules in the y-dimension. Such as Mario stops falling when his feet (bottom of his sprite) hit the top of some other sprite.
  • VelocityX: This fact captures information about a sprite’s velocity in the x-dimension and is derived according to the greedy matching to the next frame’s sprite. For example if Mario is at position [0,0] in frame 1 and [10,0] in frame 2, then frame 1 will include the fact VelocityX: ⟨mario, 10⟩.
  • VelocityY: This fact type captures the same information as the prior type but for the y-dimension.
  • CameraX: This is a unique fact type that simply stores a value representing how far along the camera is in the level. We included this as original attempts at this technique had the issue that when the camera moved one direction, all stationary objects would appear (according to the other facts) to move in the opposite direction. We derive this fact by looking at the average shift in pixels of each sprite from frame to frame, therefore avoiding the issue of sliding in the opposite direction.

Notably each fact can be linked back to the characteristics of a sprite that it arose from. In other words if the spatial fact ⟨mario1, 0, 0⟩ changes to ⟨mario1, 100, 100⟩ the system can move the sprite mario1 to position [100, 100]. This concept is the primary method by which rules (represented in our system by changes in facts) act upon a frame.

Engine Learning

A game engine is a set of rules where each rule is a single IF-THEN production with the If represented by a set of conditional facts and the Then representing a change as a pair of facts. For example, a rule might change a VelocityX fact from ⟨mario1, 0⟩ to ⟨mario1, 5⟩ for a given frame (THEN) given the set of facts that make up its conditions are present in that frame (IF).

At a high level, the game engine learning approach scans through the sequence of parsed frames and begins a search for a set of rules that explains any sufficient difference between the predicted and actual frame. If a game engine is found that reduces the difference to some threshold, then the scan begins again to ensure that this new engine has not lost the ability to accurately predict prior frames.

Our engine search algorithm can be understood as a greedy search to find a set of rules that creates a predicted frame within some threshold of the actual ground truth frame. The primary means of accomplishing this is in the generation of neighbors for a given engine (set of rules so far). Neighbors are generated via (1) adding rules, (2) modifying a rule’s set of condition facts, (3) modifying a rule to cover an additional sprite, and (4) modifying a rule into a control rule.

Adding a rule to a game engine requires picking out a pair of facts of the same type, one from the current frame and one from the goal frame, which differ in some way (e.g. one VelocityX fact with values ⟨mario1, 0⟩ and one with values ⟨mario1,5⟩). This pair of facts represents the change that the rule handles. All other facts from the current frame make up the initial condition set for this rule, which ensures it’ll be activated on this current frame. This is initially a very specific set of facts, and will include facts that are not truly necessary to activate the rule.

The set of conditions in an initial rule can later be minimized by modifying a rule’s set of condition facts. The intersection of an existing rule’s condition facts and the current set of condition facts is taken and set as the set of a condition facts for a modified rule.

Control rules are those that the player makes decisions upon. For example, the input that controls the player character (moving Mario left, right, and jumping). Note that we never know what controls a player is pressing — we only have the video output of the game being played. If we detect that an If can lead to two different possible frames because the user has say into what a sprite (usually the protagonist avatar) does, we recognize a control rule.


  • While our algorithm is an offline process, meant to only run once, the running time can still be prohibitive.
  • This algorithm requires that each instance of a sequence of input data be represented as a series of facts. This requires authoring to determine the space of possible facts that may only work for a limited set of games or genres of games. This also makes the algorithm less applicable to more complex domains (e.g. video of the real world). We identify the ability to automatically derive the space of possible facts as an important avenue for future work.
  • We note that the current technique only takes as training a single sequence of video, but that it can be extended to addi- tional videos by treating the start of each new video as a new point to start over at when the algorithm has reached the end of one video. Further, we anticipate that this would only im- prove the end model as a single sequence of video cannot be expected to contain all gameplay events.


Given the novelty of the research problem, there are no clear baseline to compare against. Instead we make use of two baselines to demonstrate two important characteristics of the learned engine: (1) its ability to accurately forward simulate and predict upcom- ing states and (2) the quality of those predictions to a game-playing task.

Frame Accuracy Evaluation

We compare our system’s ability to predict the next frame of a video against prior work in predicting frame transitions with convolutional neural networks (CNN) for a baseline. We selected two examples of game-play footage of two distinct players playing Level 1–1 of Super Mario Bros.

We trained a CNN on the same training data as the engine learning approach to demonstrate their relative performance. Better performance by the CNN might be achieved with more data.

The PriorFrame baseline is simply choosing the prior frame as the next frame.

This is what it looks like. The following three images show the ground truth game play from the original engine and predicted frames from our learned engine.

The ground truth game play.
Our learned engine’s prediction of the next frame, given input of a human playthrough.

Gameplay Learning Evaluation

In this evaluation, we seek to address the extent to which the learned game engine represents procedural knowledge valuable to other tasks besides predicting frames. We measure the learned engine’s ability to afford accurate gameplay by comparing its performance in training a game playing agent.

For a baseline we drew on the Infinite Mario engine used for the Mario level playing AI competition and constructed a simple reinforcement learning agent for the engine utilizing the value iteration algorithm. We further constructed two variations on this agent.

  1. Truth: We used the native forward simulation available in the Infinite Mario engine to construct an initial policy rather than relying on the typical approach of a uniformly random initial policy.
  2. Naive: we trained a standard reinforcement learning engine to play Super Mario Bros. in the Infinite Mario engine with no a priori forward model.

In comparison, we constructed an agent that instead made use of forward simulation according to the learned game engine (with numerical values and names converted to the appropriate equivalent in the Infinite Mario engine).

Since each of these agents in made use of slightly different reward functions, we compare the number of iterations required for each agent to reach the goal state. We ran 100 agents of each type over a Infinite Mario version of Super Mario Bros Level 1–1. and compare the iteration where each first reached the goal state.


Our technique has drawbacks, notably we do not represent player death or level transitions, which makes these key types of mechanics impossible to learn. We hope to address these shortcomings in future work. In addition, we are interested in possible applications of this approach to procedural content generation and full game generation.

In this paper we present a novel AI research problem — Automated Game Understanding — and an initial baseline technique to derive a game engine from input gameplay video. Our technique relies on a relatively simple search algorithm that searches through possible sets of rules that can best predict a set of frame transitions. To our knowledge this represents the first technique to learn a game engine, a set of rules capable of simulating a game world.



Mark Riedl

AI for storytelling, games, explainability, safety, ethics. Professor @GeorgiaTech . Associate Director @MLatGT . Time travel expert. Geek. Dad. he/him