τ0-WM: The Largest Open-Source Embodied World Model Unveiled
A research team from Shanghai Academy of Smart Systems and Agibot has introduced τ0-WM, the world's largest open-source embodied world model with 5 billion parameters and approximately 30,000 hours of pre-training data, including 17,800 hours of real-world remote operation data. With test-time computation, robots can evaluate and refine multiple action candidates before execution.
It has been about two years since embodied intelligence (EI) gained significant attention. A groundbreaking achievement from China is challenging the industry norm that “real-world data is too expensive for pre-training.” The research team led by Associate Professor Jianlan Luo of the Shanghai Academy of Smart Systems, who also serves as Chief Scientist at Agibot, has made headlines by releasing τ0-World Model (τ0-WM), the world’s largest open-source pre-trained embodied world model. The standout feature of τ0-WM is the scale of its data. The model comprises 5 billion (5B) parameters, and its pre-training data totals approximately 30,000 hours, of which 17,800 hours are real-world remote operation data. This volume is equivalent to a single robot being remotely operated by humans non-stop for two years, 24 hours a day. Traditionally, many teams considered real-world data too challenging to scale and used it only during the final fine-tuning stages. However, τ0-WM defies this convention by leveraging real-world data as the backbone of pre-training. As a result, it boasts the largest dataset among pre-trained embodied world models currently available. In addition to enabling future video predictions and motion generation like other world models, τ0-WM integrates test-time computation. This allows robots to internally simulate multiple action candidates before execution, choose the optimal one, and, if necessary, refine actions through a simulator before proceeding. Based on this approach, τ0-WM outperformed benchmarks π0.5 and Fast-WAM in four long-duration, precision tasks: Toolbox, School Bag, Badminton, and Faucet. This success reflects the research team’s ongoing investment in post-training, which has now culminated in this remarkable achievement.
17,800 Hours of Real-World Data The
pre-training data for τ0-WM spans approximately 30,000 hours and is divided into three main categories: 1. Real-World Remote Operation Data (17,800 Hours): This dataset, collected using dual-arm robots and multi-view cameras, matches the action space and deployment environment of real-world applications. Since real-world data collection is resource-intensive, no prior examples of using such a vast amount of real-world data for pre-training existed. This dataset provides critical action supervision signals that underpin τ0-WM’s large-scale pre-training. 2. Universal Manipulation Interface (UMI) Data (6,500 Hours): UMI is a data collection method independent of specific robot platforms. Compared to real-world remote operation, it covers a much broader range of objects and scenarios. However, since the action space doesn’t perfectly align with real-world deployment, UMI data primarily enhances “behavior diversity.” Though its accuracy is slightly lower, it allows the model to experience various manipulation methods, objects, and rare scenarios. 3. Ego-Centric Human Data (3,000 Hours): This is the least expensive and most wide-ranging dataset. It includes many rare interaction behaviors and real-world scenes that are hard for robots to capture independently. However, this data lacks action labels, meaning it is used exclusively for learning video branching, without directly contributing to action prediction. It helps the model learn object movements, human-environment interactions, and changes in scene states. By integrating these three distinct data types into a single system through modality-specific supervision masks, τ0-WM optimizes the balance between data quality and quantity. Data with action labels is used to train both video and action models, while unlabeled data focuses solely on video branching.
Optimizing Robot Actions with “Slow Thinking”
In recent years, the dominant paradigm for robotic perception and control has been a reactive, end-to-end strategy. Neural networks recognize visual inputs and immediately output actions, akin to human reflexes. While this approach has succeeded with standard tasks such as grasping or simple setups, it often fails in complex scenarios requiring prolonged, precise operations or when significant occlusions are present. A single misstep can cause a cascade of failures. τ0-WM adopts an approach where a robot “imagines” its actions before executing them. It predicts what the future might look like if a specific action is performed—how the environment will change—and uses test-time computation to simulate multiple potential actions simultaneously in an internal “virtual sandbox.” The robot then repeats comparisons and corrections before execution. This approach can be seen as teaching robots “slow thinking.” Instead of responding reflexively to visual input, the robot evaluates and deliberates on the most reliable course of action before committing to it, much like a human would.
A Three-Stage Pipeline:
Propose, Simulate, Evaluate The online inference process of τ0-WM is divided into three steps: 1. Proposal: The Video Action Model (VAM) samples multiple candidate actions based on current multi-view observations, language instructions, and the robot’s state. Simultaneously, it generates blurred future visuals corresponding to these actions, akin to the robot quickly envisioning several possible approaches. 2. Simulation: The action-conditioned video simulator generates detailed multi-view future visuals for each candidate action. Since robotic operations often involve occluded views, the model supplements with “imaginative completion” of future states from side or overhead perspectives to accurately assess the outcomes of actions. 3. Evaluation and Refinement: The system assigns a Re-Denoising Consistency Score (RCS) to each action by adding noise to the candidates, re-denoising them through the model, and observing reconstruction errors. Smaller errors indicate higher reliability. If the optimal action’s score is still insufficient, the secondary mechanism, Low-Quality Action Refinement (LAR), activates. The system sends all candidate actions to the video simulator, predicts the corresponding future states and task progress, selects the best future outcome, and instructs VAM to regenerate actions based on this “optimal future.” Ultimately, the model outputs the best action through this three-stage pipeline. Unlike many world models that ignore future prediction during deployment to prioritize inference speed, τ0-WM maintains explicit future imagination even during inference. These future visuals are actively used for scoring, selection, and refinement, making “future imagination” an integral part of the robot’s decision-making process.
Two Shared Video Diffusion Backbones
Underpinning this three-stage pipeline are two shared video diffusion backbone components: VAM, responsible for action proposals, and the action-conditioned video simulator, responsible for sandbox simulations. VAM is based on the Wan2.2-5B video generation model and outputs both future video latent representations and action chunks. The video simulator specializes in evaluating future states and task progress. During training, these components process data from three different sources. Data with action labels train both video and action models, while label-free data is used solely for video branching. This flexible design maximizes the utility of diverse data types.
The Integration of Pre-Training and
Post-Training The Luo team’s continuous investment in post-training has paid off, not only in accumulating sufficient real-world data but also in gaining experience in utilizing such data for large-scale pre-training. This achievement unites the previously separate paths of pre-training and post-training. This breakthrough significantly broadens the potential for data scaling in embodied intelligence research. By centering pre-training efforts on real-world data—traditionally seen as too costly and scarce—the model’s generalization capabilities and reliability have been substantially improved. τ0-WM is open-source, enabling researchers and developers worldwide to access the largest pre-trained embodied world model. It is poised to become a new benchmark in robotics and AI research. Its approach—using test-time computation to simulate and refine multiple action candidates—holds the potential to redefine the paradigm of robotic control.
Frequently Asked Questions
- What exactly is τ0-WM?
- τ0-WM is the world's largest open-source embodied world model, developed by the Shanghai Academy of Smart Systems and Agibot. It features 5 billion parameters and is trained on approximately 30,000 hours of data, including 17,800 hours of real-world remote operation data. Its hallmark feature is test-time computation, allowing robots to simulate and select optimal actions before execution.
- How does τ0-WM differ from traditional robot control methods?
- Traditional methods rely on reactive, end-to-end strategies, where actions are output immediately after processing visual input. In contrast, τ0-WM employs "slow thinking," simulating and evaluating multiple action candidates in a virtual sandbox before execution. This significantly improves success rates in complex, long-duration tasks requiring precision.
- Where can τ0-WM be accessed?
- According to the research team, τ0-WM is available as open-source software. Specific details on accessing it can be found in articles from Chinese tech media outlets like Quantum Positioning or through the research team's announcements. Its model weights and code may be available on platforms like GitHub.
Comments