Saturday, April 11, 2026

NVIDIA trains robots on 44,000 hours of human video

NVIDIADreamDojoRoboticsComputer VisionKnowledge DistillationReal-time AIOpen SourcePhysics Prediction

April 11 · 1 videos

NVIDIA DreamDojo is here.

It skips the simulation bottleneck.

44,000 hours of human video trained the model.

That is 4 billion frames of raw reality.

The robot learns physics from pixels.

“The main reason is that simulations are often just not good enough. They often mimic reality, but they are not a substitute for reality.”

NVIDIA’s New AI Shouldn’t Work…But It Does

Károly Zsolnai-Fehér · Two Minute Papers · 9 min

Watch on YouTube →

Károly Zsolnai-Fehér explains NVIDIA's DreamDojo research which uses massive unlabeled video datasets to teach robots real-world physics. This approach bypasses the limitations of traditional 3D simulations by learning directly from human actions.

The model trained on 44,000 hours of human video containing 4 billion frames and 1 quadrillion pixels.
DreamDojo uses autonomous storytelling to generate labels for actions within the unlabeled video data.
Relative action mapping replaces global coordinates to help the AI understand movement across different physical structures.
An anti-cheating mechanism feeds actions in 4-frame blocks to ensure the model learns cause and effect.
Knowledge distillation accelerated the model by 4x to reach 10 frames per second for real-time interaction.
The research is being released as open-source to provide a free brain for local robotics deployment.
The AI successfully predicts complex physics like paper crumpling that simulations often struggle to replicate.

References

PeopleKároly Zsolnai-Fehér (https://cg.tuwien.ac.at/~zsolnai/)

ToolsDreamDojo · Action-Conditioned Frame Prediction · Relative Action Space