Topic 35: What are World Models?

A deep dive into the history and current advancements in world models and why they are an important puzzle piece for the future of AI

Humans’ brains don’t process every tiny detail of the world. Instead, we rely on abstract representations formed from past experiences – mental models – to guide our decisions. Even before events occur, our brains continuously predict outcomes based on these models and prior actions.

This is precisely the concept behind world models in AI.

Rather than learning directly through trial and error in the real world, an AI agent uses a "world model" – a learned simulation of its environment – to imagine and explore possible sequences of actions. By simulating these actions internally, the AI finds paths leading toward desired outcomes.

This approach has significant advantages. Firstly, world models drastically reduce the resources required by avoiding the physical execution of every possible action. More importantly, they align AI more closely with how the human brain actually functions – predicting, imagining scenarios, and calculating outcomes. Yann LeCun once stated that world models are crucial to achieving human-level AI, though it may take approximately a decade to fully realize their potential.

Today, we're working with early-stage world models. Properly understanding their mechanisms, recognizing the capabilities of the models we currently have, and dissecting their inner workings will be essential for future breakthroughs. Let's begin our grand journey into the fascinating world of world models.

What’s in today’s episode?

  • Historical background of first world models

  • What’s necessary today to build a world model?

  • Notable world models

    • Google DeepMind’s DreamerV3

    • Google DeepMind’s Genie 2

    • NVIDIA Cosmos World Foundation Models

    • Meta and Navigation World Model (NWM)

  • Conclusion: Why are world models important?

  • Sources and further reading

Historical background of first world models

While the term "world model" gained popularity in the last few years, the underlying concept has antecedents in earlier AI research. This idea is dating back to 1990 with the Richard S. Sutton’s Dyna algorithm. It’s a fundamental approach to model-based reinforcement learning (MBRL) that integrates learning a model with planning and reacting, so agents using Dyna can:

  1. Try actions and sees what works (trial and error though RL).

  2. Over time, learn the model of the world and build it to predict what might happen next (learning).

  3. Uses this mental model to try things out in its “head” without having to actually do them in the real world (planning).

  4. If something happens, react immediately based on what it has already learned – no pause to plan every time (quick reaction).

Image Credit: Dyna original paper

A later study from 2018, called “The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces”, tested Dyna in the Arcade Learning Environment, which is a collection of Atari 2600 games used to train AI agents from raw pixel images. It showed, for the first time, that a learned model can help improve learning efficiency in environments with high-dimensional inputs, like Atari games, and suggested that Dyna is viable planning method.

A key milestone was the 2018 paper “World Models” by David Ha and JĂĽrgen Schmidhuber. They built a system that actually works on simple environments. They trained a generative recurrent neural network (RNN) to model popular RL environments, like a car racing game and a 2D first-person shooter-like game, in an unsupervised manner. Their world model learned a compressed spatial representation of the game screen and temporal dynamics of how the game evolves. ​More precisely, this system consists of three parts:

  • Vision: Variational Autoencoder (VAE) compresses high-dimensional observations (pixel images) into a lower-dimensional latent representation​.

  • Memory: Mixture-Density Recurrent Network (MDN-RNN) predicts the next latent state given the current latent and the agent’s action.

  • Controller: Takes the latent state and RNN hidden state and outputs actions. In the original implementation, it was a simple linear policy trained with an evolutionary strategy to maximize reward.

Image Credit: World Models original paper

Ha and Schmidhuber showed that a policy (controller) could be trained entirely within the learned model’s “dream” and then successfully transferred to the real game environment. It was a stepping stone to building smarter agents that can dream, plan, and act just like humans and sparkled an interest in model-based approaches.

A lot has changes since then. What do we have today? How do latest world models work? Do they understand the physical world? Let’s explore.

What’s necessary today to build a world model?

Upgrade if you want to be the first to receive the full articles with detailed explanations and curated resources directly in your inbox. Simplify your learning journey →

Reply

or to participate.