Animated GIF

I live in New York and I am the co-founder & CTO of Runway.

Towards Universal Simulation

October 11, 2024permalink

DeepMind’s initial pitch was to “solve intelligence and then use it to solve everything else.” This has been the guiding philosophy of AGI labs for the past decade. I believe that rather than solving intelligence, we should be focusing on solving the “everything else” part first. That is, we should be building systems for universal simulation.

The simple idea of universal simulation is to train models on a far wider variety of data and modalities than we currently do, which may include human-generated data, but also observational data from scientific experiments and raw sensor data, all at the same time, with the goal of predicting an increasingly large percentage of all observations in the world. Those models can then be used to run virtual experiments much faster than we ever could in physical reality, and thus accelerate scientific progress.

Large-scale model training so far has had a strong anthropomorphic bias. We’ve only applied sufficient scaling on data generated by, and legible to, humans, such as text, images, videos, and sounds. This constraint makes sense if the aim is building agents primarily for communicative purposes, i.e. to interact with humans the same way humans interact with one another. But for other purposes, it's far too limiting. Following this anthropomorphic approach, AGI labs often describe a future where their models can solve difficult scientific problems by suggesting wet-lab experiments to run and interpreting the results. This fundamentally misidentifies the main bottleneck, which is not coming up with ideas for experiments, but rather running them. The best way to accelerate science is to provide an alternative to real-world experiments, by simulating them instead.

Besides "AI scientists," the other common approach to applying deep learning for scientific discovery is to train highly specialized single-task models. The most well-known example in this category is AlphaFold, which does protein structure prediction. Due to the single-task nature and small data scale, those types of models tend to incorporate very strong inductive biases and, while certainly useful, cannot generalize to other tasks. The reality is simply that for most individual scientific domains, there isn’t enough data to make effective use of deep learning. Aggregating data across domains involves both coordination and modeling challenges, but those challenges are solvable.

There is ample evidence that training on one modality can benefit performance on another.1 The efforts to train multimodal "foundation models" in science have been limited, but there are some promising attempts, especially in biology. Consider models trained on sequencing data from millions of human cells.2 Or models trained on a large number of health measurements over time from a group of individuals.3 Whereas web-scale datasets focus on breadth (small set of outputs from a huge number of individuals), what happens when we aim for depth instead (huge number of data points from a small set of individuals)? Could we use those models, for example, to more accurately predict the effects of new drugs and other health interventions?

Taking this further, one can imagine a single model that combines data from different domains (across physics, biology, etc), operating at different scales of space (from the subatomic scale to the entire observable universe) and time (from particle interactions in femtoseconds to the lifespan of stars in billions of years). Would it be useful to combine modalities that seem so unrelated? It’s definitely prudent to start with training on related modalities. But jointly modeling very different domains and tasks can be beneficial, allowing gradient descent to discover common "programs" useful for all of them.4

We’ve made great progress over the past decade in understanding how to scale neural networks. Having worked on pushing the state-of-the-art in deep learning for video generation, I’ve witnessed first-hand its often surprising effectiveness, as well as its emerging capabilities of world modeling that come from increasing compute and data scale. It’s now time to direct deep learning towards understanding the universe and solving humanity’s most pressing challenges, by training models to achieve universal simulation.

[2] scGPT is an early example of an approach that can potentially scale much further.
[3] In a recent paper, a group of biologists trained an autoregressive transformer on continuous glucose monitoring (CGM) data and diet data from about 10K individuals, finding that it can predict health outcomes for those individuals four years in advance, and generalize to other datasets.
[4] TimesFM hints at this generalist approach: a time-series model trained on data from a mix of domains (e.g. traffic, weather, page-views, hospital data) is able to perform as well out of the box on unseen tasks as specialist models that have been trained just on those tasks.

— Anastasis