Anastasis Germanidis

Imagine the following scenario. You enter a room and you are asked to wear a VR headset that has a camera and supports a passthrough mode (which can display a real-time feed of your surroundings). Once you put on the headset, you find yourself in what appears to be the exact same room.

For the next five minutes, you are asked to walk around, interact with objects, and have conversations with people who enter the room.

Finally, while still wearing the headset, you are asked the question: do you believe that you are viewing the real world via your headset’s passthrough mode, or is everything you're experiencing generated by an AI model?

As the quality of world simulators further improves, we’ll need to increasingly focus on evaluations that include the ability to interact with the simulated environment. This is a harder task than simply generating physically plausible videos, where the models can “cheat” by avoiding generating difficult futures.

DeepMind’s initial pitch was to “solve intelligence and then use it to solve everything else.” This has been the guiding philosophy of AGI labs for the past decade. I believe that rather than solving intelligence, we should be focusing on solving the “everything else” part first. That is, we should be building systems for universal simulation.

The simple idea of universal simulation is to train models on a far wider variety of data and modalities than we currently do, which may include human-generated data, but also observational data from scientific experiments and raw sensor data, all at the same time, with the goal of predicting an increasingly large percentage of all observations in the world. Those models can then be used to run virtual experiments much faster than we ever could in physical reality, and thus accelerate scientific progress.

Large-scale model training so far has had a strong anthropomorphic bias. We’ve only applied sufficient scaling on data generated by, and legible to, humans, such as text, images, videos, and sounds. This constraint makes sense if the aim is building agents primarily for communicative purposes, i.e. to interact with humans the same way humans interact with one another. But for other purposes, it's far too limiting. Following this anthropomorphic approach, AGI labs often describe a future where their models can solve difficult scientific problems by suggesting wet-lab experiments to run and interpreting the results. This fundamentally misidentifies the main bottleneck, which is not coming up with ideas for experiments, but rather running them. The best way to accelerate science is to provide an alternative to real-world experiments, by simulating them instead.

Besides "AI scientists," the other common approach to applying deep learning for scientific discovery is to train highly specialized single-task models. The most well-known example in this category is AlphaFold, which does protein structure prediction. Due to the single-task nature and small data scale, those types of models tend to incorporate very strong inductive biases and, while certainly useful, cannot generalize to other tasks. The reality is simply that for most individual scientific domains, there isn’t enough data to make effective use of deep learning. Aggregating data across domains involves both coordination and modeling challenges, but those challenges are solvable.

There is ample evidence that training on one modality can benefit performance on another.¹ The efforts to train multimodal "foundation models" in science have been limited, but there are some promising attempts, especially in biology. Consider models trained on sequencing data from millions of human cells.² Or models trained on a large number of health measurements over time from a group of individuals.³ Whereas web-scale datasets focus on breadth (small set of outputs from a huge number of individuals), what happens when we aim for depth instead (huge number of data points from a small set of individuals)? Could we use those models, for example, to more accurately predict the effects of new drugs and other health interventions?

Taking this further, one can imagine a single model that combines data from different domains (across physics, biology, etc), operating at different scales of space (from the subatomic scale to the entire observable universe) and time (from particle interactions in femtoseconds to the lifespan of stars in billions of years). Would it be useful to combine modalities that seem so unrelated? It’s definitely prudent to start with training on related modalities. But jointly modeling very different domains and tasks can be beneficial, allowing gradient descent to discover common "programs" useful for all of them.⁴

We’ve made great progress over the past decade in understanding how to scale neural networks. Having worked on pushing the state-of-the-art in deep learning for video generation, I’ve witnessed first-hand its often surprising effectiveness, as well as its emerging capabilities of world modeling that come from increasing compute and data scale. It’s now time to direct deep learning towards understanding the universe and solving humanity’s most pressing challenges, by training models to achieve universal simulation.

Lucid Dream Test

Towards Universal Simulation