https://substack.com/inbox/post/153106976
A model that can generate the next frame of a 3D scene based on the previous frame(s) and user input, trained on video data, and running in real-time.
The key insight is that by training on video data, these models learn not just how to generate images, but also:
- the physics of our world (objects fall down, water flows, etc)
- how objects look from different angles (that chair should look the same as you walk around it)
- how things move and interact (a ball bouncing off a wall, a character walking on sand)
- basic spatial understanding (you can’t walk through walls)
Some companies, like World Labs, are taking a hybrid approach: using World Models to generate static 3D representations that can then be rendered using traditional 3D engines (in this case, Gaussian Splatting). This gives you the best of both worlds: the creative power of AI generation with the multiview consistency and performance of traditional rendering.