What is the three-tier architecture of Project Eden?

Project Eden separates its architecture into three layers: state inference, conditional interface, and generative rendering. This enables the world's state to exist independently of the screen, facilitating persistent environments and multi-agent interactions.

How is this different from traditional video-generation models?

Traditional models are more like "video predictors" that lose track of objects when they leave the camera’s view. Project Eden manages the world state independently, ensuring objects persist even when out of view and allowing multiple users to share the same world.

What are some potential applications of this technology?

Potential applications include long-term interactive games, simulations, collaborative virtual environments, and accurate digital twins of real-world settings. However, further research is needed for practical implementation.

VAST's Project Eden Brings "Save Data" Functionality to World Models

Project Eden, unveiled by VAST, introduces state persistence and multi-agent concurrency to world models, creating truly interactive environments beyond mere video generation.

June 1, 2026 9 min read Reviewed & edited by the SINGULISM Editorial Team

VAST's Project Eden Brings "Save Data" Functionality to World Models — Photo by Elijah Mears on Unsplash

Over the past year, “world models” have become one of the hottest keywords in the AI industry. An increasing number of organizations are claiming their models can simulate entire worlds. By entering a single sentence, users can prompt these models to generate continuous videos, where people, scenes, and objects within the frame respond dynamically to given actions or camera angles. To many, this appears as though AI has acquired a form of world-creating capability.

However, upon closer inspection, is generating seemingly continuous videos truly equivalent to constructing a world?

Limitations and Challenges of World Models

Currently, many so-called “world models” are, in essence, still closer to video predictors. They excel at predicting the next frame based on previous ones and generating short visual sequences in response to input actions. However, the state of the world itself is not maintained independently. In other words, the model only perceives a sequence of pixels, not a “world” that exists persistently, where multiple users can simultaneously interact and cause continuous changes.

This fundamental limitation raises several questions: If an object moves out of the frame, does it still exist within the model? When a user looks away and then back again, will the scene remain consistent? If multiple players enter the same space from different perspectives, are they truly perceiving the same world? Until these issues are resolved, these so-called world models are merely “videos that seem like worlds,” rather than true worlds themselves.

The technical approaches currently labeled as “world models” can generally be categorized into two groups. The first is action-conditioned video generation. Such models typically generate continuous videos based on text, images, action commands, or camera trajectories. Their advantage lies in intuitive visual effects, user-friendly results, and the ability to quickly demonstrate a sense of interaction. However, the problem is that these models fundamentally rely on predicting 2D pixel trajectories. Information about what is happening in the world, where objects are located, or how states evolve is often implicitly compressed into the recent frames of the video. When an object moves out of the camera’s view, the model doesn’t retain an independent “world state” for it. When the camera returns, the model has no choice but to regenerate or “reimagine” the object based on past context. This explains why many video generation models appear continuous over a short period but fail to maintain object persistence, structural integrity, or logical consistency when interactions become more complex or perspectives change.

The second category is static 3D scene generation. Such models can generate navigable three-dimensional spaces. Compared to monocular video generation, they are indeed closer to the concept of “space” itself. However, without the dimensions of time, physical logic, or state transition mechanisms, they can hardly qualify as true world models. A truly useful world is not just visible but must also be modifiable, persistently operational, and capable of simultaneous interaction by multiple users or agents.

The Unique Architecture of Project Eden

In response, VAST has made a clear judgment regarding world models. A satisfactory general-purpose world model must be capable of solving at least two fundamental issues simultaneously. First, what is the current objective state of the world? Second, how does that state continuously evolve through actions, time, and interactions? Only by addressing these two issues simultaneously can a world model advance from the “content generation phase” to the “interactive environment generation phase.”

Project Eden’s Architecture: Three-Tier Separation

The most important architectural choice in Project Eden is the native separation between foundational state inference and visual representation. In traditional video generation models, the state and screen are highly intertwined. The model sees pixels and predicts pixels. Information about what exists in the world, how objects change, and what actions trigger is implicitly embedded in the sequence of video frames. Project Eden takes a different approach. Instead of cramming space, events, perspectives, and visual appearances into pixel history, it separates “the world itself” from “how the world looks.”

The first layer is the structured state layer, which serves as the true foundation of the system. This layer forms a global structured representation that persists across time, is updatable through actions, and can be queried from any camera position. Instead of an unmanageably large 4D point cloud, this is a compact and implicit representation that balances efficiency with semantic richness. This layer answers the question “What exists in the world, and what has happened?” It is the objective foundation of the world, existing independently of any observer’s perspective.

The second layer is the conditional interface layer, which acts as a transformation hub between the state and rendering layers. This layer’s function is to convert the global world state into task-appropriate local conditional constraints (including semantic information, geometric cues, and local event changes) based on specific camera positions and observational perspectives. Rendering for all perspectives extracts conditions from the same foundational state, ensuring consistency among multiple viewpoints. Different players see not independent pixel histories but different windows into the same world.

The third layer is the generative rendering layer, which is responsible for producing detailed visual screens guided by the dual constraints of the foundational state and intermediate conditions. The rendering model at this top layer no longer needs to infer the structure of the scene, as that information is provided by the foundational state. Instead, the renderer can focus on its core strength: refining visual details like textures, lighting, materials, and high-frequency dynamic details to create a high fidelity visual output.

This three-tier architecture fundamentally transforms how world models are organized. States exist independently as stable, query-able, and evolvable foundations, rather than being dependent on the screen. Rendering, meanwhile, does not bear the full responsibility of logical reasoning but instead generates screens as needed based on the current state, perspective, and action conditions. As a result, Project Eden shifts the problem from predicting the next video frame to first inferring the next moment’s world state and then generating the screen that the user would see at that moment. The former resembles video sequence generation, but the latter is closer to true world simulation.

Three Revolutionary Capabilities

The architectural differences ultimately manifest as fundamental distinctions in system-level capabilities. Project Eden’s three-tier architecture naturally enables a set of capabilities that were unattainable with traditional video-generation approaches.

Long-Distance Persistence of Environments

One of the most intuitive and groundbreaking capabilities is long-distance persistence of environments. In Project Eden, if an object moves out of the camera’s field of view, it does not disappear from the world. It continues to exist in the foundational state and operates according to the world’s logic. When a user turns away and later returns, the system queries the same foundational world state (for example, if a player walks away and then looks back, the tree will still be there). The system does not regenerate a similar scene based on past video frames. This means the world can have true long-term memory, and users are no longer merely watching one-off generated videos but are immersing themselves in a persistently existing environment.

Free Scene Reuse and Deterministic Control

The second core capability is the free reuse of scenes and deterministic control. Traditional video generation operates on a one-off timeline—once generated, the history is fixed, and there is no way to rewind or branch off. In contrast, the decoupled architecture allows the foundational state to be readable, writable, and modifiable. Changes made by a user in a scene—be it destruction, construction, or alteration—are written into the foundational state. Other users entering the same scene will see a world state that fully reflects those changes. This means generated content transitions from being one-off videos to reusable, editable, and persistently evolving interactive spaces.

For example, if a user destroys a specific object, moves a building, or changes the state of an area within the scene, those changes will remain in the world. When another user enters the same scene later, they will see consistent results.

Native Multi-Agent Concurrency

The third capability is native multiplayer and multi-agent concurrent interaction. For traditional video-based world models, handling multiple players is a significant challenge. Each player has a unique perspective, actions, and screen history, which would require generating independent video contexts for each player, leading to exponential computational costs and inconsistencies.

With the decoupled architecture, only one foundational state exists, shared among all agents. The rendering layer generates screens independently based on each agent’s position and perspective, reducing computational costs from exponential to linear. If N players are online simultaneously, the system only has to maintain one foundational state and N renderings, instead of N completely independent generation systems. This is not just a performance optimization; it is a prerequisite for large-scale commercial deployment.

Strategic Importance of Data Construction

The data construction logic behind Project Eden is equally worth exploring. VAST is said to have proposed a unique layered data pipeline. Training world models requires not just video data but high-quality 3D data that includes spatial structures, physical interactions, and state changes. This data pipeline extracts structured world-state representations from raw video data and pre-processes it in a manner suited to the three-tier architecture. This enhances the model’s ability to learn and maintain the objective state of the world.

Traditional video datasets primarily focus on capturing temporal continuity at the pixel level. However, to construct a persistent and interactive world, as Project Eden aims to do, deeper semantic information is needed in the data, such as relationships between objects, the physical laws of the environment, and the outcomes of actions. VAST’s approach provides an infrastructure for efficiently extracting and supplying this complex information to the model, potentially accelerating the practical application of world models.

Summary and Outlook

Project Eden, as announced by VAST, represents a significant evolution in the concept of world models. By addressing the fundamental limitations of traditional video generation models—namely, the lack of state persistence and the difficulty of supporting multi-user environments—through its three-tier architecture, it aims to transform worlds from merely “visible” videos to interactive spaces that “exist, evolve, and can be shared.”

As this technology matures, it could find applications in a wide range of fields, including gaming, simulation, education, and virtual twins. For example, games where multiple players can interact in a shared virtual world over an extended period, or simulation platforms that faithfully replicate real-world environments in digital form, become conceivable. However, challenges such as computational costs, data requirements, and real-time performance remain hurdles to overcome.

Project Eden marks an important step in the evolution of world models from “generative AI” to “simulation AI,” and its future developments will be closely watched.

Frequently Asked Questions

What is the three-tier architecture of Project Eden?: Project Eden separates its architecture into three layers: state inference, conditional interface, and generative rendering. This enables the world's state to exist independently of the screen, facilitating persistent environments and multi-agent interactions.
How is this different from traditional video-generation models?: Traditional models are more like "video predictors" that lose track of objects when they leave the camera’s view. Project Eden manages the world state independently, ensuring objects persist even when out of view and allowing multiple users to share the same world.
What are some potential applications of this technology?: Potential applications include long-term interactive games, simulations, collaborative virtual environments, and accurate digital twins of real-world settings. However, further research is needed for practical implementation.

Source: 虎嗅网

Written by SINGULISM AI Editorial Team AI-assisted

Edited & reviewed by SINGULISM Editor-in-Chief

This article was drafted by AI (large language models) and reviewed by a human editor for fact-checking, structure, and style before publication.

Last updated: June 1, 2026

If you find any factual errors or inaccuracies, we will promptly publish a correction. Please contact us via the contact form to request a correction.

Comments

← Back to Home