Joe Dong | Physical AI Deep Dives

System Thinking and Model Thinking in Robot Learning

Joe Dong — Mon, 11 May 2026 14:02:39 GMT

Two things landed on my feed in the same week — Danfei Xu’s interview with whynotTV[1] and the latest Genesis release[2] — and the word that kept showing up in both was the same. Not a new architecture. Not a better loss. It’s the System.

Two framings

Model thinking treats robot learning as a model problem. What matters is architecture, loss, tokenization, training schedule. Push the model harder; the rest of the stack is given.

System thinking treats robot learning as a pipeline problem. Scenes, hardware, capture rigs, training stack, evaluation, deployment — all in scope. The model is one component; push every part of the stack.

Most teams do some of both. But each team has a default — the side they invest in when they have to choose.

The model school

The cleanest commercial expression is the π-series from Physical Intelligence[3][4][5]. Each release foregrounds a model-side mechanism — fast token serialization, knowledge insulation, conditioning on mixed-quality data — with the data engine treated as a (well-invested) input rather than the research target. Real contributions; π0.7 is one of the more thoughtful pieces of model work this year.

The academic counterpart is the VLA-versus-world-model debate. World models[6] consume internet video and simulation; VLA models inherit from instruction-tuned LLMs. Both are arguments about how best to abstract robot learning into a learnable function.

In every case, the unit of innovation is the model. Data and hardware are taken as given.

The system school

The clearest example is the UMI line out of Stanford — UMI[7], HoMMI[8], DexUMI[9]. Each paper introduces a new capture rig as its load-bearing contribution; the policy network is whatever standard architecture fits the data. The point is not “a better model” — it is “a way to collect data that didn’t exist before.”

Same pattern in Danfei Xu’s ego-centric direction[1], the EgoScale / EgoVerse consortium work[10][11], and commercial programs like Generalist and Sunday.

These results seed lineages. UMI led to DexUMI, then HoMMI. EgoScale [10] led to EgoVerse [11]. Capture rigs and protocols outlive any single result. Each becomes someone else’s starting point.

The common thread

On the surface the two schools look very different. Model thinking ships architectures; system thinking ships hardware. But ask what each is actually contributing, and the answer lands in the same place.

UMI’s gripper is a way to collect manipulation data without a robot in the loop. HoMMI scales whole-body mobile capture by integrating egocentric sensing and mixed-reality interfaces. EgoScale turns egocentric video into a real training signal. Hardware is the visible artifact; data is the contribution.

Retell the model side the same way. π0.7’s mixed-quality conditioning is a method for absorbing noisier human-and-robot data without performance collapse[5]. World models use non-robot video as a denser supervision substrate. VLA absorbs internet-scale image-language pretraining into a robot policy.

The variable being optimized is data. The system school competes by enlarging the supply. The model school competes by extracting more signal per unit of existing supply. Two bets on opposite sides of the same scaling law.

Why the system side has more room

Robot data is scarce by several orders of magnitude relative to internet-scale corpora[12]. When supply is that constrained, the biggest lever is enlarging it. Model-side innovations look smaller in absolute terms not because the work is worse — but because the supply they operate on hasn’t been allowed to grow yet.

This will not last. As capture programs, consortium datasets, and simulation pipelines mature, supply expands and the relative size of the two levers shifts. CV and NLP both went through this rhythm: architecture-driven before ImageNet and Common Crawl, extraction-driven after. The honest read is not “system has won.” It is “supply is small enough right now that the side enlarging it has the larger lever.” The regime will change.

Closing

Both schools have already conceded, implicitly, that data is the binding constraint on robot learning. They disagree only about which side of the data equation to push on. For builders, the question is which side has more room right now. Right now, the room is on the system side. The teams that invest in both, and shift weight when the regime changes, are the ones I’d bet on.

References

[1] Xu, D. Interview on system-centric robot learning. whynotTV, 2026.

[2] Genesis Embodied AI. Latest release and accompanying blog post. 2026.

[3] Black, K., Brown, N., Driess, D., et al. π0: A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence, 2024. https://www.physicalintelligence.company/blog/pi0

[4] Physical Intelligence. π0.5: Knowledge Insulation for Robot Foundation Models. 2025.

[5] Physical Intelligence. π0.7: Conditioning on Mixed-Quality Robot Data. 2026.

[6] Hafner, D., et al. DreamerV3: Mastering Diverse Domains through World Models. 2023. See also LeCun, Y., A Path Towards Autonomous Machine Intelligence. 2022.

[7] Chi, C., Xu, Z., Pan, C., et al. Universal Manipulation Interface: In-the-Wild Robot Teaching Without In-the-Wild Robots. RSS 2024. https://umi-gripper.github.io/

[8] Xu, X., Park, J., Zhang, H., et al. HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations. arXiv:2603.03243, 2026.

[9] Stanford REAL Lab. DexUMI: Extending UMI to Dexterous Manipulation. 2025.

[10] EgoScale: Large-Scale Egocentric Pretraining for Manipulation Policies. 2025.

[11] Punamiya, Kareer, Liu, et al. EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World. 2026. https://egoverse.ai/

[12] Open X-Embodiment Collaboration. Open X-Embodiment: Robotic Learning Datasets and RT-X Models. 2024. https://robotics-transformer-x.github.io/

Robot Learning Should Be Goal-Driven and Data-First

Joe Dong — Tue, 28 Apr 2026 14:03:07 GMT

Generalist AI recently published a blog that sparked a lot of discussion in robotics. The main takeaway is simple: robotics research should be more goal-driven than idea-driven.

Instead of starting from a method label — world model, VLA, end-to-end policy, diffusion policy — we should start from the actual objective and constraints.

What are we trying to achieve? What are the constraints today? Given those constraints, what is the most practical path forward?

This sounds more modest than chasing a single grand idea. But in robotics, I think this mindset is especially important.

The Waymo vs. Tesla analogy

This reminds me of the difference between Waymo and Tesla in autonomous driving.

Tesla has always been very good at explaining its technical vision publicly: BEVFormer perception, neural planning, end-to-end learning, world models, and so on. It is a very method-forward narrative.

Waymo feels more goal-driven. The goal is to launch truly driverless service safely and reliably. Given that objective, Waymo combines whatever methods are needed: sensors, maps, SOTA foundation models, simulation, safety validation, and large-scale operations.

The approaches may look less ideologically pure, but it is practical and effective, as Waymo continuously launches truly driverless robotaxi services in more and more cities. I think robot learning needs more of this attitude. The goal is not to prove that one model class is the final answer. The goal is to make robots work under real-world constraints.

What are the real constraints in robot learning?

For robot learning today, the biggest constraint is still data.

Compared with language models, robotics is still in a relatively low-data regime. We do not yet have anything close to internet-scale, high-quality robot interaction data.

Because of that, highly adaptive and generalized robot policies are still elusive. It is still hard to build one model that can reliably handle many tasks, many objects, and many environments.

This is why, at Chestnut Robotics, we focus first on industrial setups.

Industrial environments are not easy. But they are more structured and controlled than household environments. The workspace is more predictable. The object distribution is narrower. The success criteria are clearer. Given the current state of robot learning, this makes industrial manipulation a more practical place to start.

To me, this is a goal-driven choice. We are not trying to solve the hardest version of general-purpose robotics on day one. We are trying to choose a deployment setting where today’s models, hardware, and data pipelines can compound into real progress.

Tesla was right about data

That said, Tesla also had a very important insight.

Even if one debates specific technical choices, Tesla understood early that autonomy would ultimately become a data-driven problem. Its product strategy allowed it to collect a huge amount of real-world driving data through deployed vehicles.

That data-first view is powerful. I think the same is true for robotics. The long-term winner in robot learning will not only have a better model. It will have a better data engine.

This is where robotics is harder than autonomous driving. High-quality robot data is expensive. Dexterous manipulation data is even more expensive. You need robots, operators, sensors, resets, maintenance, calibration, and a lot of trial and error. For robotic hands, the cost is even higher because contact-rich manipulation is fragile and high-dimensional.

So the key question becomes: How do we scale high-quality dexterous manipulation data?

Chestnut’s data strategy

At Chestnut, our answer is to rethink the data collection pipeline. Instead of collecting all data directly through robot operation, we use a human-wearable exoskeleton to capture high-quality human dexterous manipulation data.

But the critical point is not just collecting human data. The critical point is conversion.

Through our co-designed robotic hand and generative inpainting algorithm, we can eventually convert human hand interaction data into high-quality robot data. This is the key merit of the system: we are not only recording humans; we are building a scalable bridge from human dexterity to robot dexterity.

This matters because human dexterous data is much easier to scale than robot dexterous data. But the biggest downside of human data is the embodiment gap between robot and human. If we can reliably convert human dexterous data into robot-quality training data, then we can change the data economics of robot learning.

That is the deeper bet. It’s a better way to create the data that makes better policies possible.

A more modest view

My current view is that robotics probably will not be solved by one clean architecture choice.

World models are useful. VLAs are useful. End-to-end policies are useful. Classical planning, simulation, teleoperation, and structured industrial deployment can all be useful too.

The real question is how to combine them under today’s constraints.

For us, the most important constraints are clear: robot learning needs more high-quality data, especially for dexterous manipulation, and current robot data collection is too expensive to scale.

So our focus is simple: start from structured industrial environments, build a scalable data engine, and use human dexterity as a path toward robot dexterity.

This may not sound as grand as claiming one model paradigm will solve robotics. But it feels like a practical path forward.

π0.7: Prompt Engineering Comes to Robot Foundation Models

Joe Dong — Thu, 23 Apr 2026 15:00:51 GMT

π0.7 expands the input to a robot policy from a single task instruction into a full context for how to act: language subtasks, subgoal images, episode metadata, and control mode. That is why it starts to show something robot learning has long been missing: compositional generalization.

Overview

Physical Intelligence recently released π0.7: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities.

https://www.pi.website/blog/pi07

https://arxiv.org/pdf/2604.15483

My main takeaway is simple:

The key step in π0.7 is not just better performance. It is that the model no longer takes only a task instruction. It takes a full prompt about how to do the task. That includes language subtasks, subgoal images, episode metadata, and control mode. Because of that, it starts to show clear signs of compositional generalization.

What matters most in this paper is not one flashy demo.

What matters is the broader recipe: absorb all usable data, then use richer prompts and context to explicitly tell the model about strategy differences, quality differences, and goal differences inside that data.

That way, the model learns not only what to do, but also:

how to do it
how fast to do it
how well to do it
whether mistakes happened

Why this matters

1. Robot prompt engineering becomes a core paradigm

Many robot VLAs still use a very simple prompt. In practice, it is often just a task description.

π0.7 goes much further. Its prompt can include:

task and subtask language
subgoal images
episode metadata such as speed, quality, and mistakes
control mode such as joint control or end-effector control

This matters because robot data is naturally messy.

Even for the same task, one operator may be fast, another careful. Some trajectories are expert demonstrations. Some are failed trajectories. Some are autonomous rollouts.

Without extra context, all of these modes get mixed together. Then the model tends to learn an average behavior that looks like nobody.

That is the core idea of π0.7:

Use prompting to explicitly separate these modes.

My own read is that π0.7 is fundamentally a diverse prompting strategy for robot models. It feels similar to prompt expansion in language models and video models.

The contribution is not just scaling parameters. It is organizing the conditioning signal in a much richer and more controllable way.

2. Mixed-quality data is not just usable. It may be the better route.

One strong signal from this paper is that low-quality data, failed data, autonomous rollout data, and even non-robot data do not have to be treated as noise.

If the conditioning is done correctly, they can become useful training signals.

The ablations are especially important here. Without metadata, model performance can drop as the dataset gets larger but lower-quality on average. With metadata, π0.7 keeps improving as more mixed-quality data is added.

That is a very important result.

It suggests that the key problem in future robot learning is not simply filtering data harder. It is:

How do we organize messy but real-world data so that the model can actually learn from it?

3. Compositional generalization finally starts to look real

One major weakness of current robot foundation models is that semantic generalization is often better than compositional generalization.

A model may recognize many objects and many task words. But once you change the tool, change the order, or recombine skills in a new way, performance often collapses.

π0.7 is interesting because it starts to show several more convincing forms of compositional behavior:

it can perform some new short-horizon tasks out of the box
it can handle new long-horizon tasks through step-by-step human coaching
it can distill those coaching trajectories into a higher-level language policy for autonomous execution

That is much closer to how humans teach real skills.

We do not always collect thousands of low-level action labels first. Sometimes we explain the task step by step, then refine it into a repeatable skill.

The most impressive results

1. One generalist model can match specialists

One of the strongest results in the paper is that π0.7 can match, and in some cases exceed, task-specific specialist policies on several dexterous tasks.

Examples include:

espresso making
box building
laundry folding

This is important because the usual assumption is:

generalist model = broader, but weaker
specialist model = stronger, but narrower

π0.7 pushes back on that tradeoff.

It suggests that with the right data recipe and the right prompting strategy, a generalist model can absorb specialist capability without giving up much performance.

2. It is actually listening to language, not just replaying dataset bias

The paper includes a very nice test called Reverse Bussing.

In the training data, the normal pattern is:

trash goes into the trash
dishes go into the bussing bin

At test time, the instruction is reversed:

trash goes into the bussing bin
dishes go into the trash

There is also Reverse Fridge to Microwave, which similarly goes against the dominant direction in the dataset.

π0.7 performs much better than earlier models on these tasks.

That matters a lot.

Many robot models seem to follow language until language conflicts with the strongest bias in the training data. Then they fall back to the common pattern.

π0.7 is still not perfect, but at least it shows that language is no longer just decoration. It is starting to genuinely steer behavior.

3. Cross-embodiment transfer starts to look real

The paper also has a very interesting embodiment transfer result.

They collect shirt-folding and laundry-folding data on a smaller bimanual robot, then transfer the skill zero-shot to a much larger dual-UR5e platform.

What is interesting is that the new robot does not simply replay the same motion pattern.

Instead, it finds a different strategy that better fits its own geometry and dynamics. For example, the smaller robot may pick up clothes from the side, while the larger UR5e setup prefers a more top-down vertical grasp.

Even more striking, the zero-shot transferred policy gets very close to expert teleoperators on shirt folding.

That suggests the model is not just imitating source trajectories. It is learning something more like a higher-level manipulation skill.

Important details

1. The point is not just a bigger model

A key detail here is that π0.7 is not simply a giant video diffusion robot policy.

The main VLA is around 5B:

a Gemma3 4B VLM backbone
a MEM-style video history encoder
an 860M flow-matching action expert

It also continues PI’s earlier ingredients:

knowledge insulation
FAST tokens
flow matching objectives

The larger branch is the one used to generate subgoal images. That branch is initialized from BAGEL 14B and is responsible for turning language subtasks into near-future multi-view goal images.

So the important point is not “just make the model bigger.”

The important point is:

Build a system with clear division of labor, then connect language, world goals, actions, and data quality through prompting.

2. Subgoal images are genuinely useful

Goal conditioning is not new. But π0.7 uses it in a much more systematic way.

It does not rely on a single goal image. It uses multi-view subgoals, including base views and wrist views.

That is important because the goal needs to specify both:

what should happen to the object and the scene
what state the robot arms and grippers should reach

The paper also makes two practical engineering choices I like:

randomly dropping different parts of the prompt during training
mixing real future images with generated goal images during training

Both are very pragmatic.

The first makes the model robust to different prompt subsets at test time.

The second reduces the mismatch between training and deployment, since test-time subgoals may come from the world model rather than from ground-truth future frames.

3. The ablation story is unusually strong

I also like the ablations in this paper.

They are not just “remove one module and report the drop.” They ask more basic questions:

Can the model really learn from mixed-quality data?
Does data diversity actually improve generalization?

One especially good experiment compares:

removing the most diverse 20% of the data
removing a random 20%

The result is very clear: removing the most diverse 20% hurts much more.

That means the important variable is not only data scale. It is task diversity.

My take

1. Using all data sources, especially egocentric human data, is the 2026 direction

My view is pretty clear here.

Demonstrations, suboptimal data, autonomous rollouts, egocentric human video, and web multimodal data should all move into the same training pipeline.

The key question is no longer whether to mix them.

The key question is how to condition on them correctly.

2. “Dexterous task” is still a bit hand-wavy

I increasingly think dexterity should be split into at least two levels:

wrist dexterity: folding laundry, peeling vegetables, handling objects with richer contact
finger dexterity: pressing small buttons, using a mouse, fine in-hand manipulation, true multi-finger regrasp

π0.7 looks very strong on the first category.

It still looks much less convincing on the second.

That boundary matters, and I think many robotics papers still blur it too much.

3. “Embodiment” is also becoming hand-wavy

For UMI-style or end-effector-centric approaches, embodiment often mainly means the end-effector interface.

As long as the hand pose trajectory is aligned, the arm and body can often be abstracted away and handled by IK or other lower-level controllers.

But for PI-style joint-level policies, the robot arm itself is part of the embodiment.

That is because the model directly predicts joint-level control. It must understand how this specific robot moves in order to choose a good strategy.

The laundry-folding transfer result shows this nicely. A smaller arm and a larger UR5e do not just execute the same motion with different scales. They may prefer different grasping strategies altogether.

My own view is that if UMI-like approaches are going to scale well, the robot body may need to become more human-like, at least in upper-body geometry and reachable manipulation space. That would make human hand 6D motion easier to reproduce with lower-level controllers.

4. Goal conditioning is powerful, but I still have doubts about long-term scalability

To be honest, I think the goal-conditioned route is very effective, but also somewhat hacky.

It makes the learning problem easier. But it also separates the system into two parts:

one module generates the goal or subgoal
another module executes the trajectory conditioned on that goal

This is practical, and clearly useful.

But it can also introduce suboptimality. If the generated goal is not quite right, the downstream policy may still be pulled in the wrong direction.

So my current view is:

This is a strong near-term direction. I am just not yet convinced it is the final scalable form of a unified robot foundation model.

My open question

The biggest open question for me is this:

How well can π0.7-style generalization transfer from grippers and parallel-jaw grippers to truly dexterous hands?

I am still skeptical.

The gap from grippers to dexterous hands may be much larger than the arm-morphology gap shown in this paper.

It is not just about reach, inertia, or kinematics.

The action space changes. The contact modes change. The manipulation regime changes.

If that gap turns out to be too large, then many current “cross-embodiment” conclusions may still mostly hold inside the gripper world.

Final take

If I had to summarize π0.7 in one sentence, I would say this:

It is one of the clearest demonstrations so far that the next step in robot foundation models is not just larger models, and not just cleaner data, but richer prompts, messier but more realistic data, and stronger steerability.

What excites me most is not one benchmark result.

It is that π0.7 pushes robot learning away from the old interface of “collect trajectories and train a policy,” and toward a new interface of prompt, coach, steer, and compose.

That feels much closer to the path toward truly general robot models.

NVIDIA EgoScale: Pretraining Dexterous Manipulation with 20,000 Hours of Egocentric Human Data

Joe Dong — Tue, 14 Apr 2026 16:58:53 GMT

Overview

Recently NVIDIA GEAR released EgoScale. My one-line summary is:

Use egocentric human video plus wrist and hand motion as a scalable supervision signal to pretrain a VLA. Then use a small amount of aligned human–robot data plus robot data to make it work on real dexterous manipulation.

That is the core recipe.

The most important point is that this paper does not just show transfer. It shows a scaling law. EgoScale pretrains on 20,854 hours of egocentric human video. In the scaling study, average downstream task completion rises from 0.30 at 1k hours to 0.71 at 20k hours. The human-action validation loss follows a near log-linear trend, and that offline loss tracks real robot performance closely.

So my main takeaway is simple:

This paper argues that dexterous manipulation can be pretrained from large-scale human data first, then aligned and improved with a much smaller amount of robot data.

Why this matters

1. It systematically shows that human data can scale for dexterous manipulation

Earlier human-to-robot work already showed that transfer is possible. EgoScale is the scale-up version. The paper explicitly argues that prior work was limited by smaller human datasets and by lower-DoF hands. EgoScale pushes pretraining to more than 20,854 hours, which is over 20× larger than prior efforts, and it tests transfer on a 22-DoF dexterous hand rather than a simple gripper.

So the bigger question here is:

Can dexterous manipulation be pretrained from large-scale human data?

EgoScale’s answer is yes. That is why I think this paper matters. It suggests that dexterous foundation models may move from a robot-data-first route to a human-data-first route.

This is also why I see EgoScale as a natural extension of work like EgoMimic, EgoVLA, DexWild, and related human-to-robot transfer papers. The difference is not only better engineering. The difference is scale.

2. It shows a strong link from offline metric to real robot performance

This is the part I care about most. The paper shows that human-action validation loss is not just an offline number. It correlates strongly with downstream real-robot task completion. That makes pretraining much more predictable. If the validation loss keeps improving at scale, downstream robot performance is likely to improve too.

That is a big deal. In many robot papers, offline metrics and real-world performance are only loosely connected. Here the paper is saying something stronger: the offline pretraining metric is actually useful.

Important details

1. Action representation

The paper makes a very clear action-representation choice.

For the arm / wrist: use relative wrist motion between consecutive timesteps. This avoids relying on a fixed global coordinate frame.
For the fingers: retarget human hand motion into a 22-DoF dexterous hand joint space.
Why this matters: the wrist-only version performs badly. The fingertip-based version is better, but still inconsistent. The retargeted joint-space hand action is the most stable choice across tasks.

My read is simple:

Wrist motion + finger motion is becoming the main representation.
The robot’s native joint format matters less.
The most transferable interface is how the wrist moves and how the fingers move.

That is not the paper’s exact wording. But it is the clearest design message I take from it.

2. The data pyramid

The training recipe is the paper.

Stage I: very large human pretraining.
The model is pretrained on 20,854 hours of egocentric human data. The main corpus spans 9,869 scenes, 6,015 tasks, and 43,237 objects. Most of it is in-the-wild first-person video. The supervision is noisy but scalable. The mixture is also complemented by 829 hours of EgoDex with more accurate wrist and hand tracking.
Stage II: small aligned human–robot mid-training.
This dataset contains 344 tabletop tasks. Each task has about 30 human trajectories and 5 robot trajectories, totaling about 50 hours of human data and 4 hours of robot data. The setup is carefully aligned: matched viewpoints, calibrated intrinsics, Vive trackers for wrist pose, and Manus gloves for in-hand motion.
Stage III: task post-training.
The policy is then fine-tuned on five dexterous tasks on the Galaxea R1 Pro. Most tasks use 100 teleoperated robot demonstrations. Shirt rolling uses 20. The five tasks are shirt rolling, card sorting, tong-based fruit transfer, bottle-cap unscrewing, and syringe liquid transfer.

My own read of this recipe is:

Pretraining is mainly for learning broad manipulation priors: real-world dynamics, semantic grounding, robustness, and scene generalization.
Mid-training is for alignment. It grounds the pretrained representation in the robot’s sensing and control space.
Post-training is for task performance. It turns the prior into reliable task completion.

So I do not read EgoScale as “robot data is no longer needed.” The paper’s recipe is more precise than that:

Use huge, cheap human data for scale.
Use small aligned data for grounding.
Use a limited amount of robot task data for success rate.

There is another important implication here. For Stage I, the paper suggests that scale and diversity matter more than perfect precision. The data is noisy and not sensor-aligned, but it still beats the small aligned-only baseline across most tasks. That is a very foundation-model-like result.

3. One-shot adaptation is a strong result

The paper also evaluates adaptation to unseen skills under very limited robot supervision.

Start from the human-pretrained model.
Do aligned mid-training.
Post-train on a new task with only 1 robot demonstration plus 100 aligned human demonstrations.

In that setting, the model reaches 0.88 success on fold-shirt and 0.55 success on unscrewing water bottles. Models that remove either large-scale human pretraining or aligned mid-training fail in this setting. This is a strong sign that the pretrained representation is doing real work.

4. Wrist camera is very important

This is another point I strongly agree with.

The robot uses three RGB cameras:

one head-mounted egocentric camera
two wrist cameras facing the palm

This is not elegant. But it is practical. For dexterous tasks, the wrist view is extremely useful. It sees contact details that the head camera often misses. That matters for cards, tongs, bottle caps, and syringe operations, where small contact errors cause the whole task to fail.

There is also an interesting gap here. The large-scale Stage I human pretraining is mostly ordinary egocentric video, while the robot and aligned mid-training setup use the richer head-plus-wrist camera configuration. So one obvious question is:

If we also add wrist-view human data at scale, would transfer improve further?

Cross-embodiment transfer

The paper also tests transfer to a different embodiment: the Unitree G1 with a 7-DoF tri-finger hand.

The G1 has a shorter arm and different kinematics.
The tasks are Pen in Bin and Dish in Rack.
Human pretraining plus embodiment-specific mid-training improves performance substantially over using the G1 data alone.

The paper’s intended message is clear: large-scale human motion learns a reusable motor prior. It is not completely tied to one dexterous hand. I think that claim is directionally right. But my own view is still narrower: this kind of pretraining should work best when the downstream robot also has reasonably dexterous hands. Human data transfers most naturally when the robot can express similar finger behaviors.

My 2026 take

My takeaway is straightforward.

Very low-cost human video will become a major pretraining source for embodied models.
Dexterous hands will become the most important end effector for transferring human manipulation data well.
Finger actions and wrist 6D motion will become the key representation layer.
The main bottleneck will move to collecting high-quality robot data cheaply for alignment and post-training.

EgoScale does not prove every one of those points by itself. But it strongly points in that direction. The paper already shows that large, noisy human data can supply strong priors, that aligned mid-training makes those priors executable, and that small amounts of robot data then go much farther than they would from scratch.

This is why I see EgoScale as a paradigm-shift paper. The pretraining data can be low precision, cheap, and massive. The later-stage data can be smaller, more expensive, and more precise. That looks much closer to the foundation-model playbook than to the old robot imitation-learning playbook.

My questions

I still have a few open questions.

What if large-scale pretraining also includes wrist cameras?
The current Stage I data is mostly ordinary first-person video, while the robot setup depends heavily on wrist views. A human data collection pipeline that includes wrist-view video might reduce this perceptual gap.
Can the mid-training stage be made much lighter?
Right now it still requires matched viewpoints, Vive trackers, and Manus gloves, with humans acting in a robot-compatible workspace. That is much heavier than the Stage I recipe. It clearly works, but it still looks expensive and unnatural for large-scale collection.
If high-quality robot trajectory collection becomes cheap, do we still need mid-training?
Today, the paper shows that mid-training is very useful. It is what enables one-shot transfer in combination with human pretraining. But if robot data collection becomes dramatically cheaper and higher quality, it is worth asking whether some of the alignment role of mid-training could be replaced by more robot data directly.

A related technical question is whether better scalable data systems could improve Stage II and Stage III at the same time. The paper already shows that alignment data and teleoperated robot data matter. So the next bottleneck may simply be building a cheaper and higher-quality way to collect them.

Final take

I think the main message of EgoScale is very simple.

Dexterous manipulation may be moving from “robot data first” to “human data first.”

The “ego” in EgoScale is not just the camera viewpoint. It is the scalable supervision source. Use egocentric human video to pretrain broad motor priors. Use a small amount of aligned data to connect that prior to the robot. Then use a limited amount of robot data to drive task success.

That is the recipe. And this paper shows that the recipe actually scales.

Nvidia DreamZero0: Why World Models May Become Policies

Joe Dong — Wed, 08 Apr 2026 23:57:54 GMT

DreamZero0 is not just a stronger robotics benchmark. It is a serious argument for a new robotics stack.

Key points

1. DreamZero changes the learning objective. It predicts future video and actions together. That is the core idea behind its World Action Model (WAM).

2. The bigger contribution is the data thesis. Broad, messy, non-repetitive robot data may be better for generalization than repeated demos of a narrow task set.

3. The transfer result matters. DreamZero improves from short video-only demonstrations from humans or another robot. Action labels may no longer be the only scarce resource.

4. This has direct startup implications. The next robotics moat may come less from narrow task tuning and more from world-model quality, diverse operational data, sensor standardization, and inference systems.

5. The paper is impressive, but not magic. The model is still short-horizon. It is expensive to serve. High-precision manipulation is still open.

Why this paper matters

I wanted to start this Substack with DreamZero because it makes a claim that feels bigger than a benchmark win.

Many recent robotics foundation models still follow the same recipe: images and language in, actions out. Those models can improve semantic understanding and object coverage. But they often still fail when the required motion is new. They fall back to the dominant action prior in the dataset: grasp, move, place.

DreamZero argues for a different path.

Its core claim is simple: a robot policy may generalize better if it is trained to model how the world should change, not only which action to emit next.

That shift matters because robotics is not only a semantics problem. It is also a dynamics problem. A robot must understand not only what an object is, but how the scene evolves under contact, motion, error, and recovery.

From VLA to WAM

The paper introduces DreamZero as a World Action Model rather than a standard Vision-Language-Action model.

A standard VLA is trained in a direct way:

observation + language -> action

DreamZero changes this to something closer to:

observation + language -> future world + action

The model imagines the next visual states of the world while also producing the actions that would make those states happen.

This matters for one reason: future video is dense supervision.

Every future frame constrains geometry, object motion, contact, temporal consistency, and scene evolution. That is a much richer signal than action labels alone. The action head is not learning in isolation. It is learning inside a model that must maintain a plausible visual future.

This is the strongest idea in the paper. It makes the old phrase “world models can become policies” feel like a practical design choice.

Technically, the system is large. DreamZero uses a 14B robot foundation model built on a pretrained Wan video-diffusion backbone, with robot-specific state and action modules added on top. The key point is the design principle: start from a strong video model, keep its spatiotemporal prior, and align robot action with predicted world evolution.

The result that matters most: unseen motion generalization

The headline numbers are strong. But the value is in what they mean.

DreamZero performs well not only on seen tasks in new environments, but also on tasks that were absent from training. That is the hard case. Many models look competent when the object changes or the wording changes. Fewer models keep working when the motion pattern itself changes.

The paper reports that on AgiBot G1, DreamZero reaches 62.2% average task progress in zero-shot evaluation on seen tasks with unseen environments and objects. A strong pretrained VLA baseline reaches 27.4%. On 10 tasks not present in training, DreamZero reaches 39.5% average task progress, while VLA baselines drop sharply and often revert to generic pick-and-place behavior.

This is the right failure mode to focus on.

If a model hears a new instruction and responds with a familiar manipulation template, that is not robust generalization. It is action prior collapse. DreamZero appears to reduce that problem.

The deeper thesis is about data, not architecture

The most important part of the paper may be the data recipe.

DreamZero is pretrained on about 500 hours of teleoperation data across 22 real environments. The key point is not only the scale. It is the distribution.

The data is intentionally non-repetitive.

Instead of collecting many clean repetitions of the same task, the team collects long episodes with many coarse tasks and many transitions between them. Tasks are rotated out after enough collection, which pushes the dataset toward breadth and long-tail coverage.

This is an important claim.

A lot of robot data collection still assumes that better policies come from cleaner labels and more repeated demonstrations. DreamZero suggests that for world-model-style policies, repetition may be less valuable than diversity. The model benefits from seeing many kinds of scene changes, many kinds of interaction failures, and many kinds of partial task structure.

The paper includes a clean ablation. With the same amount of data, DreamZero trained on diverse data outperforms DreamZero trained on repetitive data, improving task progress from 33% to 50% on the same evaluation.

If this result holds more broadly, it changes how we should think about data moat.

The most valuable robotics dataset may not be the cleanest benchmark dataset. It may be the operational dataset with the widest support over real work: tidying, carrying, folding, wiping, sorting, recovering, and switching tasks inside the same episode.

What this means for startups and companies

DreamZero changes the answer to a basic question:

What should a robotics company scale?

Under the standard VLA view, the answer is often: collect more action-labeled demos across more tasks.

Under the DreamZero view, the answer shifts. A company should try to scale:

broad multi-view interaction data,
enough action data to ground control,
strong world-model pretraining,
and a deployment stack that can serve a large generative policy in closed loop.

That has three clear implications.

First, data strategy changes. A company in logistics, retail, hospitality, or home robotics may gain more by collecting diverse workflow fragments than by over-optimizing one benchmark skill.

Second, video becomes more valuable. If video-only demonstrations can improve performance, then human demonstration video and cross-fleet recordings become strategic assets.

Third, systems engineering becomes part of the moat. If the policy is also a large generative video model, then inference speed, GPU scheduling, quantization, caching, and asynchronous execution all become first-class product concerns.

This is why DreamZero matters for investors as well as researchers. It points to a different stack, and therefore a different kind of company.

The transfer story matters

The cross-embodiment section is the most strategic part of the paper.

DreamZero improves unseen-task performance using short video-only demonstrations from another robot and from humans. Those demonstrations do not include action labels. They only provide visual evidence of task dynamics.

If a world model already captures much of the task structure, then video can teach the model what successful interaction should look like even when the action interface changes. In the paper, just 20 minutes of video from another robot or 12 minutes of human video substantially improves performance on unseen tasks.

The paper also shows few-shot adaptation to a new bimanual robot with around 30 minutes of play data.

I would not overclaim here. This is not proof that transfer across radically different robots is solved. But it is strong evidence that the world prior may transfer faster than the action interface.

Two caveats that matter

The first caveat is inference cost.

DreamZero is a large video-diffusion policy. That is inherently expensive. The paper reaches about 7 Hz closed-loop control through a long stack of optimizations. That is impressive. It is also a reminder that WAMs are not just a modeling choice. They are an infrastructure choice.

The second caveat is task horizon and precision.

The paper is fairly explicit that DreamZero is still closer to a short-horizon “System 1” model than a long-horizon planner. It also does not solve fine, high-precision manipulation. Tasks like tight insertion or very delicate contact remain difficult.

So I do not read this paper as “VLAs are obsolete.” I read it as something more useful: the motion-generalization frontier has moved, and direct action regression is no longer the only serious path.

Final take

My main takeaway is direct.

DreamZero matters because it proposes a better scaling story for robotics.

The paper suggests that predicting the world may be a better route to robust action than predicting actions alone. It also suggests that broad and messy real-world data may be more valuable than narrow repeated demos. And it hints that video-only transfer may become a real lever for embodiment adaptation.

If those claims continue to hold, the implications are large.

The value of diverse precise robot data goes up.
The value of human video goes up.
The value of sensor standardization goes up.
The value of inference engineering goes up.
And the value of narrow demo farms may go down.

That is why I think DreamZero is one of the most important robotics papers in this cycle.

Not because it solves general robotics.

But because it changes what a scalable solution might look like.