Improvisational Computational Storytelling in Open Worlds
(The following is adapted from a paper by Lara Martin and Brent Harrison, Improvisational Computational Storytelling in Open Worlds, presented at the 2016 International Conference on Interactive Digital Storytelling.)
Introduction
Storytelling has been of interest to artificial intelligence researchers since the earliest days of the field. Artificial intelligence research has addressed story understanding, automated story generation, and the creation of real-time interactive narrative experiences. Specifically, interactive narrative is a form of digital interactive experience in which users create or influence a dramatic storyline through actions by assuming the role of a character in a fictional virtual world, issuing commands to computer-controlled characters, or directly manipulating the fictional world state. Interactive narrative requires an artificial agent to respond in real time to the actions of a human user in a way that preserves the context of the story and also affords the user to exert his or her intentions and desires on the fictional world.
Prior work on interactive narrative has focused on closed-world domains — a virtual world, game, or simulation environment constrained by the set of characters, objects, places, and the actions that can be legally performed. Such a world can be modeled by finite AI representations, often based on logical formalizations. We propose a grand challenge of creating artificial agents capable of engaging with humans in improvisational, real-time storytelling in open worlds.
Improvisational storytelling involves one or more people constructing a story in real time without advanced notice of topic or theme. Improvisational storytelling is often found in improv theatre, where two or more performers receive suggestions of theme from the audience. Improvisational storytelling can also happen in informal settings such as between a parent and a child or in table-top role-playing games. While improvisational storytelling is related to interactive narrative, it differs in three significant ways.
- Improvisational storytelling occurs in open worlds. That is, the set of possible actions that a character can perform is the space of all possible thoughts that a human can conceptualize and express through natural language.
- Improvisational storytelling relaxes the requirement that actions are strictly logical. Since there is no underlying environment other than human imagination, characters’ actions can violate the laws of causality and physics, or simply skip over boring parts. However, no action proposed by human or agent should be a complete non sequitur.
- Character actions are conveyed through language and gesture.
Background: Interactive Narrative
For an overview of AI approaches to interactive narrative circa 2013: Interactive Narrative: An Intelligent Systems Approach. The most common form of interactive narrative involves the user taking on the role of the protagonist in an unfolding storyline. The user can also be a disembodied observer— as if watching a movie—but capable of making changes to the world or talking to the characters. A common solution is to implement a drama manager. A drama manager is an intelligent, omniscient, and disembodied agent that monitors the virtual world and intervenes to drive the narrative forward according to some model of quality of experience. Typically this means generating story content.
There are many AI approaches to drama management. One way is to treat the problem of story generation as a form of search such as planning, adversarial search, reinforcement learning, or case-based reasoning. All of the above approaches assume an a priori known domain model or ontology that defines what actions are available to a character at any given time. Note that most of these techniques are still appropriate for games since they are closed worlds.
Closed-world systems can sometimes appear open. The interactive drama, Façade, allows users to interact with virtual characters by freely inputting text. This gives the appearance of open communication between the human player and the virtual world; however, the system limits interactions by assigning the natural language input to dramatic beats that are part of the domain model. Open-world story generation attempts to break the assumption of an a priori-known domain model. Notable exceptions include Scheherazade-IF, which attempts to learn a domain model in order to create new stories and interactive experiences in previously unknown domains. However, once the domain is learned, it limits what actions the human can perform to those within the domain model. Say Anything is a textual case-based-reasoning story generator, meaning that it operates in the space of possible natural language sentences. It responds to the human user by finding sentences in blogs that share a similarity to human-provided sentences.
Background: Improv Theatre
Humans have the ability to connect seemingly unrelated ideas together. If a computer is working together with a user to create a new story, the AI must be prepared to handle anything the human can think of. Even when given a scenario that appears constrained, people can — and will — produce the unexpected. Magerko et al. conducted a systematic study of human improv theatre performers to ascertain how they are able to create scenes in real time without advanced notice of topic or theme (I’m a co-investigator on the study). The primary conclusion of this research is that improv actors work off of a huge set of basic scripts that compile the expectations of what people do in a variety of scenarios. These scripts are derived from common everyday experiences (e.g., going to a restaurant) or familiar popular media tropes (e.g., Old West shoot out).
Open-World Improvisational Storytelling
In order to push the boundaries of AI and computational creativity we argue that it is essential to explore open-world environments because (a) we know that humans are capable of doing so, especially with training (actors, comedians, etc.), and (b) natural language interaction is an intuitive mode of human-computer interaction for humans that is not constrained to finite sets of well-defined actions. Once the mode of interaction between a human and an artificial agent is opened up to natural language, it would be unnatural and ultimately frustrating for the human to restrict their vocabulary to what the agent can understand and respond sensibly to. An intelligent agent trained to work from within a closed world will struggle to come up with appropriate responses to un-modeled actions. On the other hand, limiting the user’s actions and vocabulary also limits the user’s creativity.
We believe there are two general problems that must be addressed to achieve open-world improvisational storytelling.
- An intelligent improvisational agent must have a set of scripts comparable in scope to that held by a human user. We loosely define a script as some expectation over actions. To date, no system has demonstrated that it has learned a comprehensive set of scripts; however, once a comprehensive set of scripts exists, these scripts can be used to anticipate human actions that are consistent with the scripts and generate appropriate responses.
- An intelligent improvisational agent must be able to recognize and respond to off-script actions. This means that the agent will need to generate new, possibly off-script actions of its own in order to respond to the player in a seemingly non- random way. The reasons why a human goes off script can be technical — the human’s script does not exactly match the agent’s script for the same scenario — or because the human wishes to express creative impulses or test the boundaries of the system.
Since humans normally tend to work off of some sort of script while improvising, whether it is explicit or not, the AI also needs to relate user utterances to a script through natural language understanding. Keeping track of a script is a matter of comparing the meanings of human utterances — or semantics — which is an open research question. Given language’s nearly infinite possibilities, it is very unlikely that two people would use the exact same words or syntax to express the same idea. It is just as unlikely that a user would create a similar sentence as the creators of the agent would. Beyond understanding the meaning of individual sentences, there is still the matter of what the semantics of the sentence mean within the context of the entire story — also known as the pragmatics — since context is important to maintaining coherence in a conversation.
In the next sections I will provide two possible approaches to open-world improvisational storytelling.
Approach 1: Plot Graphs
A plot graph is a script representation that compactly describes a space of possible stories (and thus expectations about stories) that can be told about a specific scenario such as going to a restaurant or robbing a bank.
A plot graph indicates what events tend to follow other events. The figure to the left shows a plot graph showing the start of a bank robbery. Solid arrows indicate temporal ordering (this must happen before that). Dashed arrows show events that cannot happen in the same story. It is not a finite state machine. Instead it compactly represents all possible stories in which the temporal orderings are not violated (and in which mutually exclusive events do not co-occur). Plot graphs can be learned by automatic means.
Given a hypothetical system that had learned a large number of plot graphs, the handling of user inputs could happen as follows:
The user says something to advance the story. The system searches for an appropriate plot graph in its memory. The user’s action illicit one of three response strategies:
- Constituent. The user’s move matches the expectations of the plot graph. The system picks the appropriate move from the plot graph and proceeds. For example, if the AI is a bank teller being robbed by gunpoint, the response might be to press the silent alarm.
- Consistent. The user’s move is not part of the plot graph but doesn’t make it impossible for the plot graph to generate future moves. For example, the player ties his or her horse to a hitching post. In this case, the system can generate a response acknowledging the move (“nice horse!”) and proceed with the plot graph.
- Exceptional. The user’s move makes it impossible for the plot graph to continue executing into the future. For example, the user’s move is to freeze the bank teller with a freeze-ray, making it impossible for him or her to press the silent alarm or do anything else. In this case, the system must abandon the script and find a sequence of moves that either restores the executability of the script or transitions to a new plot graph.
A few notes about this approach. It relies on a relatively robust ability to understand the semantics of natural language — an open research problem in AI. Handling exceptions requires a relatively sophisticated planning ability in the space of all plans that can be expressed in natural language. This is also an unsolved problem as any technique we know will become overwhelmed by the number of possibilities.
Approach 2: Deep Neural Networks
Plot graphs are not the only way to learn about story plot expectations. An alternative would be to model story expectation using a recurrent neural network (RNN) such as a long short-term memory (LSTM) network. These neural networks can be used to learn to predict successive sentences based on a history of observations of prior sentences. They can also be used to generate successors.
One of the advantages of a neural network approach is that it could, in principle, be trained by feeding in a large corpus of stories about a lot of different types of plots. The network would have to model the entire space of stories. The advantage of having a single model of all possible stories is that one wouldn’t require distinct strategies for handling on-topic and off-topic human moves. To date it is yet unclear how well a neural network can do in generating plausible story response.
A hypothetical system might look like this the figure on the left. A human would generate a sentence that moves the story forward. The sentence would be converted into a representation that facilitates the neural network.
In related work, we find that extracting the verb, the subject of the verb, the direct object of the verb, and a preposition (if any) can increase the predictive power of a neural net generating story responses. We call this an event because it captures the gist of what is changing in the story world.
Event2Event is a placeholder for a neural network that generates the response to the user’s move. The response is also an event, so it must be translated into natural language by Event2Sentence. This process will likely lose some of the detailed context of the story (like character names), so the Working and Long-Term memory module makes pragmatic fixes before feeding a sentence reflecting the system’s move back to the user.
In terms of our initial proposed architecture, this means that an RNN is well suited to handle constituent actions that the user may take, where constituent would mean the user performed an event that the RNN was expecting to see with high probability. However, consistent and exceptional events — events performed by the user that are not high-probability transitions and may also create logical inconsistencies later — may present challenges to RNNs. As with the prior approach, consistent and exceptional events can be handled by turning improvisation into a planning problem. Looking into possible story futures may help a system determine what moves are best in terms of setting up good future moves. One technique would be to use Markov Chains Monte Carol to sample possible futures. Another technique would be to use deep reinforcement learning. Both techniques require a model (or reward function) of good storytelling practices to guide and constrain the system’s imagination of possible futures. Using a planning-based framework, these approaches could better handle consistent and exceptional actions. If the user takes off-script actions, then the system will still generate events that will maximize its long-term reward. Thus, the system’s behavior is largely dependent on how this reward function is defined. For example, if the reward function prioritized staying on-script then it would strive to return to a state where future events are predicted with high probability.
As before, full semantic understanding of natural language is still an open research challenge. The eventification described above is probably not the ultimate solution. It is also unclear how to produce reward functions for good storytelling practices at a level of specificity necessary to guide a deep reinforcement learning improvisational storytelling system. Finally, the space of possible plans in language space — the space of all possible plans that can be expressed in natural language — is likely to be large to the point that training a deep reinforcement learning system will be very challenging.
Wrapping Up
Improvisational storytelling occurs when one or more people construct a story in real time without advanced notice of topic or theme. As human-AI interaction becomes more common, it becomes more important for AIs to be able to engage in open-world improvisational storytelling. This is because it enables AIs to communicate with humans in a natural way without sacrificing the human’s perception of agency. We hope that formalizing the problem and examining the challenges associated with improvisational storytelling will encourage researchers to explore this important area of work to help enable a future where AI systems and humans can seamlessly communicate with one another.