entropytown @2025 / Twitter (X)

Analysis

Why Fei-Fei Li, Yann LeCun and DeepMind Are All Betting on “World Models” — and How Their Bets Differ

Gaussian splats, SIMA 2, JEPA and Genie 3 — and why “world model” now means three very different things at once.

November 13, 2025

AI has finally reached the “we need to model the whole world” phase.

In the same season, Fei-Fei Li’s World Labs shipped Marble, a “multimodal world model” that turns prompts into walkable 3D scenes in your browser; DeepMind announced SIMA 2, an embodied agent that plays, reasons and learns with you inside virtual 3D worlds; and reports emerged that Meta’s chief AI scientist Yann LeCun is leaving to build a world-model startup of his own. DeepMind, meanwhile, is also calling its interactive video engine Genie 3 a world model.

Same phrase. Three very different bets.

Decades before Gaussian splats and tweetable demos, though, the phrase “world model” belonged to psychology and cognitive science rather than to AI marketing. In 1943, Scottish psychologist Kenneth Craik argued in The Nature of Explanation that the brain builds a “small-scale model of external reality” in order to predict events, reason about consequences and test hypothetical actions. That idea helped set up the post-war cognitive revolution: thinking as simulation, mind as something that runs models of the world rather than just reacting to stimuli. Later work on mental models, predictive processing and internal forward models in motor control all recycled the same basic intuition: intelligence comes from having some internal machinery that can stand in for the outside world long enough to let you think ahead instead of just flinching.

When AI people talk about “world models” today, they’re borrowing that lineage — but in practice the term has become a Rorschach test. For some, it means a neat internal latent state inside a control system; for others, a videogame-like simulator where agents can learn; for still others, any 3D pipeline that outputs something you can walk around in. The startup ecosystem has happily leaned into the ambiguity: if you can show investors a glossy video and say it’s a “world model”, few will press you on whether it actually supports prediction, planning or generalisation. That’s how we end up with three companies, in the same week, all claiming to ship “world models” that share a name, a vibe and almost nothing else under the hood.

The week “world models” went mainstream

World Labs has spent the year rolling out a neat narrative stack: Fei-Fei Li’s manifesto, From Words to Worlds: Spatial Intelligence Is AI’s Next Frontier, argues that language-only systems (LLMs) are a dead end and that the real frontier is “spatial intelligence” and “world models” that understand 3D space, physics and action. On top of that sits the launch of Marble, which promises anyone can now generate editable 3D worlds from text, images, videos or simple layouts.

At almost the same time, outlets like Nasdaq reported that LeCun is preparing to leave Meta and raise money for a company “focused on world models” in the very different sense he’s been sketching since his 2022 paper A Path Towards Autonomous Machine Intelligence (Nasdaq, paper PDF).

On Hacker News, the Marble launch thread is full of arguments about Gaussian splats and game engines (HN). The LeCun thread is full of arguments about whether Meta has chosen “AI slopware” over proper research. Same word, different fights.

To understand why, we have to start with the only thing anyone can actually click.

World Labs’ world model: Gaussian splats for humans

Marble, as shipped today, is a full-stack 3D content pipeline:

  • It takes text prompts, single images, short videos or blocky 3D layouts.
  • It hallucinates a 3D representation of a scene.
  • It lets you walk around that scene in a web or VR viewer and tweak it with an in-browser editor called Chisel.
  • It exports as Gaussian splats, standard meshes (OBJ/FBX) or flat video for downstream tools (Marble docs, RadianceFields explainer).

For people who ship VR apps or game levels, a pipeline that goes “prompt → 3D world → export to Three.js / Unity” is extremely useful. World Labs even ships its own Three.js renderer, Spark, specifically tuned for splats (Spark release).

But it’s very much a 3D asset story. On Marble’s own blog, “world model” sits in the same sentence as “export Gaussian splats, meshes and videos”; there is no robot in sight.

Hacker News users clocked that immediately. One early top-level comment, contrasting Marble with DeepMind’s video-based Genie, reads:

“Genie delivers on-the-fly generated video that responds to user inputs in real time. Marble renders a static Gaussian Splat asset (like a 3D game engine asset) that you then render in a game engine.”

Another says, with the particular baffled politeness of an ML engineer:

“Isn’t this a Gaussian Splat model? I work in AI and, to this day, I don’t know what they mean by ‘world’ in ‘world model’.”

Reddit is less shy. In a thread about the first demo from the “$230m startup led by Fei-Fei Li” in r/StableDiffusion, one commenter sums it up as:

“Taking images and turning them into 3D environments using gaussian splats, depth and inpainting. Cool, but that’s a 3D GS pipeline, not a robot brain.”

(Reddit thread)

That doesn’t make Marble bad. It does make its use of “world model” slightly ambitious. To see how, you need a quick primer in what a Gaussian splat actually is.

If you’re not a 3D person, 2025’s splat discourse can sound like hand-waving. In practice, there are three characters here:

  • Photogrammetry – The old guard. Take hundreds of overlapping photos of a real thing, reconstruct a polygon mesh (a shell made of tiny triangles), and bake textures on top. Great if you want to measure, collide or 3D-print.

  • 3D Gaussian splatting – The new hotness. Represent the scene as millions of fuzzy coloured blobs (“Gaussians”) floating in space, and “splat” them onto the screen so they blend into an image. Excellent at foliage, hair and soft light; runs in real time on gaming GPUs. The canonical paper is Kerbl et al.’s 3D Gaussian Splatting for Real-Time Radiance Field Rendering.

  • Renderers – Engines like Three.js, Unity or Unreal that take a mesh or a splat cloud and turn it into pixels.

A photogrammetry practitioner on r/photogrammetry puts the trade-off like this:

“Use photogrammetry if you want to do something with the mesh itself, and Gaussian splatting if you want to skip all the steps and just show the scan like it is. It’s kind of a shortcut to interactive photorealism.”

(explainer thread)

Marble lives squarely in that world: it’s a shortcut to interactive photorealism. It generates splats/meshes and hands them to a renderer. The “world” it models is the part we can see and walk around in. It’s for humans (and game engines), not for machines to think with.

Fei-Fei Li’s essay, however, speaks in a different register.

She writes about “embodied agents”, “commonsense physics” and “robots that can understand and act in the world” — all the things you would want a robot’s internal model to support. Marble is presented as “step one” on that road. The tension, and the comic potential, comes from the fact that step one is currently a very polished 3DGS viewer.

Ironically, Fei-Fei Li’s original manifesto, From Words to Worlds, never once mentions 3D Gaussian Splatting — the very technique at the heart of Marble’s output pipeline.

If Marble were the only “world model” on offer, you could reasonably conclude that the term has been kidnapped by marketing. Unfortunately for your hot take, Yann LeCun exists.

LeCun’s world model: the brain in the middle

LeCun’s use of “world model” comes from control theory and cognitive science rather than from 3D graphics.

In A Path Towards Autonomous Machine Intelligence (PDF), he describes a system in which:

  • A world model ingests streams of sensory data.
  • It learns latent state: compressed internal variables that capture “what’s going on out there”.
  • It learns to predict how that latent state will evolve when the agent (or environment) acts.
  • A separate module uses that machinery to plan and choose actions.

You never see the world model directly. It doesn’t need to output pretty pictures. Its job is to let an agent think a few steps ahead.

JEPA-style models — “Joint Embedding Predictive Architectures” — are early instances of this approach: instead of predicting raw pixels, they predict masked or future embeddings and are trained to be useful representations rather than perfect renderings. LeCun has been giving talks about this since at least 2022 (YouTube).

When Nasdaq and others reported that he’s spinning out to build a world-model startup (Nasdaq), the reaction on HN wasn’t, “ooh, another 3D viewer.” It was:

  • does this mean Meta has given up on this line of research in favour of GPT-ish products?
  • can a JEPA-like architecture ever match LLMs in practical usefulness?
  • is there even a market for a world model that mostly lives in diagrams and robot labs?

Whether you think LeCun is right or wrong, you can’t really accuse him of chasing the same thing as World Labs. One “world model” is essentially a front-end asset generator. The other is a back-end predictive brain.

And then there’s DeepMind, happily occupying the middle.

DeepMind’s world model: worlds as video

DeepMind’s Genie 3 model is introduced, without much modesty, as “a new frontier for world models” (blog). Around the same time, DeepMind also unveiled SIMA 2, “an agent that plays, reasons, and learns with you in virtual 3D worlds”, explicitly positioning it as a generalist, game-like testbed for embodied world models.

From a text prompt, it generates an interactive video-like environment at 720p / 24 fps that you (or an agent) can move around in for several minutes. Objects persist across frames, you can “prompt” world events (“it starts raining”), and the whole thing functions as a tiny videogame rendered by a model instead of a traditional engine.

The Guardian describes it as a way for AI agents and robots to “train in virtual warehouses and ski slopes” before they ever touch the real world (Guardian). DeepMind is perfectly happy to connect it to the AGI narrative.

Where Marble generates assets and LeCun dreams of latents, Genie 3 produces simulators: online environments where you can act, observe consequences and learn. SIMA 2 sits on top of that simulator story: a policy layer that can explore many such worlds, build a transferable sense of physics and affordances, and follow natural-language instructions inside the generated environments.

On HN, when someone asks “how does Marble compare?”, a typical answer is:

“Genie is on-the-fly generated video that responds to user inputs in real time. Marble is a static Gaussian splat asset you render in a game engine.”

Again, not an insult — just taxonomy.

One word, three bets

Put all of this together and “world model” now covers at least three distinct ideas (plus one obvious hybrid):

  1. World models as interface
    Marble is a beautiful way to go from words and flat media to 3D environments humans can edit and share. The “world” is whatever your Quest headset needs next.

  2. World models as simulator
    Genie-style models produce continuous, controllable video worlds where agents can try things, fail, and try again. The “world” is whatever keeps the game loop coherent.

    SIMA 2-style agents are built on top of those worlds: they treat them as a curriculum of toys, labs and games in which to acquire general skills — navigating, manipulating, following instructions — that might one day transfer to physical robots.

  3. World models as cognition
    LeCun-style architectures are about internal predictive state. The “world” lives inside an agent as latent variables and transition functions.

Fei-Fei Li’s writing borrows heavily from bucket (3) — embodied agents, intuitive physics — while Marble, so far, mostly occupies bucket (1). LeCun’s plans live squarely in (3), with the hope that someone, someday, builds a good version of (2) on top. Genie lives between (2) and (3), with occasional marketing holidays in all of them. SIMA 2 is DeepMind’s attempt to show what happens when you drop a generalist agent into those simulated worlds and ask it to actually play.

If you only look at Marble’s demo, it’s tempting to say “world model” is just 3DGS with better PR. If you only read LeCun, it’s tempting to believe language models were a historical detour and JEPA will save us all. If you only read DeepMind, it’s simulated ski slopes all the way down.

The truth is they’re all building different parts of the same vague ambition: give machines some structured way to think about the world, beyond next-token prediction. One group starts from the rendering (Marble and its splats), one from the physics and simulator loop (DeepMind’s Genie and SIMA), one from the internal code (LeCun’s JEPA-style architectures and whatever his startup ships next).

Until the jargon catches up, the safest move when you see a “world model” headline is to ask three questions — all variations on Craik’s old claim that a mind is something that can run little models of the world:

  1. Is this a thing for humans to look at, a place for agents to train, or a box inside a diagram — an actual internal model the rest of the system consults?
  2. Does it output static assets, real-time frames, or mostly latent states that drive prediction and control?
  3. If you knock over a virtual vase, does anything in the system remember — and use that memory to update its future expectations — for more than one frame?

If the answers are “for humans”, “static assets” and “not really”, you’re basically looking at a very nice Gaussian splat viewer with a vintage cognitive-science label. If they’re “for agents”, “real-time” and “yes, in latent space”, you’re closer to the kind of world model Craik was gesturing at, LeCun has been sketching, and DeepMind is trying to simulate — the one that, very inconveniently for demo culture, doesn’t fit in a single tweetable GIF.

It’s also worth separating the rhetoric from the plumbing. Fei-Fei Li and LeCun often sound like card-carrying LLM haters, but a more charitable reading is that they’re objecting to LLMs as the only game in town, not to large neural sequence models per se. In practice, multimodal LLMs with structured outputs — code, programs, API calls, scene graphs — are likely to be one of the main ways we specify tasks, describe environments and glue world models to products. Even if you believe that real intelligence lives in spatial prediction and internal latent state, it’s hard to deny that GPT-style models are what blew the funding window open: without the shock of ChatGPT, there is no Marble launch, no Genie 3 demo tour, and probably no appetite in the capital markets to bankroll a slow, unsexy research programme about agents learning to nudge virtual boxes around.

In the meantime, it’s a safe bet that more robotics labs, AV stacks and “agent platforms” will quietly relabel whatever they already have as a world model. After all, in this business it’s often cheaper to rename the diagram than to redraw it — especially when the money that pays for the ink arrived on the back of the very language models everyone now loves to dismiss.