By using a multi-agent simulation as part of the process we can make use of data points such as a character’s history, their goals and emotions, simulation events and localities to generate scenes and image assets more coherently and consistently aligned with the IP story world. The IP-based simulation also provides a clear, well known context to the user which allows them to judge the generated story more easily.
Moreover, by allowing them to exert behavioral control over agents, observe their actions and engage in interactive conversations, the user’s expectations and intentions are formed which we then funnel into a simple prompt to kick off the generation process.
Our simulation is sufficiently complex and non-deterministic to favor a positive disconfirmation. Amplification effects can help mitigate what we consider an undesired “slot machine” effect which we’ll briefly touch on later. We are used to watching episodes passively and the timespan between input and “end of scene/episode” discourages immediate judgment by the user and as a result reduces their desire to “retry”. This disproportionality of the user’s minimal input prompt and the resulting high-quality long-form output in the form of a full episode is a key factor for positive disconfirmation.
While using and prompting a large language model as part of the process can introduce “several challenges”. Some of them, like hallucinations, which introduce uncertainty or in more creative terms “unexpectedness”, can be regarded as creative side-effects to influence the expected story outcome in positive ways. As long as the randomness introduced by hallucination does not lead to implausible plot or agent behavior and the system can recover, they act as happy-accidents, a term often used during the creative process, further enhancing the user experience.
The Issue of ‘The Slot Machine Effect’ in current Generative AI tools
The Slot Machine Effect refers to a scenario where the generation of AI-produced content feels more like a random game of chance rather than a deliberate creative process. This is due to the often unpredictable and instantaneous nature of the generation process.
Current off-the-shelf generative AI systems do not support or encourage multiple creative evaluation steps in context of a long-term creative goal.
Their interfaces generally feature various settings, such as sliders and input fields which increase the level control and variability. The final output however, is generated almost instantaneously by the press of a button.
This instantaneous generation process results in immediate gratification providing a dopamine rush to the user. This reward mechanism would be generally helpful to sustain a multi-step creative process over long periods of time but current interfaces, the frequency of the reward and a lack of progression (stuck in an infinite loop) can lead to negative effects such as frustration, the intention-action gap or a loss of control over the creative process.
The gap results from behavioral bias favoring immediate gratification, which can be detrimental to long-term creative goals.
While we do not directly solve these issues through interfaces, the contextualization of the process in a simulation and the above mentioned disproportionality and timespan between input and output help mitigate them. In addition we see opportunities in the simulation for in-character discriminators that participate in the creative evaluation process, such as an agent reflecting on the role they were assigned to or a scene they should perform in.
The multi-step “trial and error” process of the generative story system is not presented to the user, therefore it doesn’t allow for intervention or judgment, avoiding the negative effects of immediate gratification through a user’s “accept or reject” decisions. It does not matter to the user experience how often the AI system has to retry different prompt chains as long as the generation process is not negatively perceived as idle time but integrated seamlessly with the simulation gameplay.
The user only acts as the discriminator in the very end of the process after having watched the generated scene or episode. This is also an opportunity to utilize the concept of Reinforcement Learning through Human Feedback (RLHF) for improving the multi-step creative process and as a result the automatically generated episode.
Large Language Models
LLMs represent the forefront of natural language processing and machine learning research, demonstrating exceptional capabilities in understanding and generating human-like text. They are typically built on Transformer-based architectures, a class of models that rely on self-attention mechanisms. Transformers allow for efficient use of computational resources, enabling the training of significantly larger language models. GPT-4, for instance, comprises billions of parameters that are trained on extensive datasets, effectively encoding a substantial quantity of worldly knowledge in their weights.
Central to the functioning of these LLMs is the concept of vector embeddings. These are mathematical representations of words or phrases in a high-dimensional space. These embeddings capture the semantic relationships between words, such that words with similar meanings are located close to each other in the embedding space. In the case of LLMs, each word in the model’s vocabulary is initially represented as a dense vector, also known as an embedding. These vectors are adjusted during the training process, and their final values, or “embeddings”, represent the learned relationships between words.
During training, the model learns to predict the next word in a sentence by adjusting the embeddings and other parameters to minimize the difference between the predicted and actual words. The embeddings thus reflect the model’s understanding of words and their context. Moreover, because Transformers can attend to any word in a sentence regardless of its position, the model can form a more comprehensive understanding of the meaning of a sentence. This is a significant advancement over older models that could only consider words in a limited window. The combination of vector embeddings and Transformer-based architectures in LLMs facilitates a deep and nuanced understanding of language, which is why these models can generate such high-quality, human-like text.
As was mentioned previously, transformer-based language models excel at short-term general tasks. They are regarded as fast-thinkers. [Kahneman].
Fast thinking pertains to instinctive, automatic, and often heuristic-based decision-making, while slow thinking involves deliberate, analytical, and effortful processes. LLMs generate responses swiftly based on patterns learned from training data, without the capacity for introspection or understanding the underlying logic behind their outputs. However, this also implies that LLMs lack the ability to deliberate, reason deeply, or learn from singular experiences in the way that slow-thinking entities, such as humans, can. While these models have made remarkable strides in text generation tasks, their fast-thinking nature may limit their potential in tasks requiring deep comprehension or flexible reasoning.
More recent approaches to imitate slow-thinking capabilities such as prompt-chaining (see Auto-GPT) showed promising results. Large language models seem powerful enough to act as their own discriminator in a multi-step process. This can dramatically improve the ability to reason in different context