GalaxyBot Integrates VLA and World Models, Unveils New Foundational Model "LDA"
GalaxyBot has announced a groundbreaking foundational model, "LDA," integrating VLA (Visual-Language-Action models) and world models within a latent space. This innovation, which unifies diverse datasets, is garnering attention as a technology poised to accelerate advancements in embodied intelligence.
Former Microsoft Researcher Discusses the Future of Embodied Intelligence
“Since joining GalaxyBot, I’ve been working non-stop, except for meals and sleep,” says Dr. Zhang Zhizheng, co-founder and Chief Model Officer of GalaxyBot. Formerly a senior researcher at Microsoft Research, where he contributed to the development of large-scale models such as Copilot, Dr. Zhang joined GalaxyBot in 2023 to lead the development of embodied intelligence—a field focused on enabling AI to interact and operate within the physical world.
When he made the decision to leave Microsoft, the director of Microsoft Research Asia asked him, “Have you really thought this through? You’re joining a startup that might not even be able to pay your salary.” Dr. Zhang responded, “If something is meaningful enough and I bear a significant responsibility for it, there’s no need to worry.” True to his words, GalaxyBot has since grown into a leading player in the field of embodied intelligence.
Bridging the Gap Between VLA and World Models
Dr. Zhang and his team have unveiled a new foundational model called “LDA” (Latent Dynamics Action). This innovative model integrates “VLA” (Visual-Language-Action models) and “world models,” which were traditionally treated as separate technological approaches, within a unified latent space.
Previously, VLA models focused on direct policy learning (decision-making strategies), while world models specialized in predicting environmental state transitions. LDA, however, aims to harmonize these approaches by enabling joint learning of “what actions to take” and “how the environment will change” within a unified latent space, enhancing both processes through mutual feedback.
“The industry can leverage LDA as the foundational framework for large-scale embodied intelligence models, combining various types of data to scale up in stages,” said Dr. Zhang. This research has received top honors at the leading robotics conference “RSS” and has been made open source.
A Key Difference from Non-Embodied World Models
How does LDA differ from other world model research, such as that by Fei-Fei Li or Yann LeCun? According to Dr. Zhang, their world models primarily focus on fundamental questions like “how to represent, predict, and generate the world.” Policy learning, in these cases, is considered a downstream task.
On the other hand, LDA goes a step further by tailoring the problem for embodied intelligence. It focuses on “how changes in the environment can directly inform action generation” and leverages the full spectrum of embodied data for large-scale learning. For instance, in a tennis scenario, instead of indiscriminately predicting all environmental changes, LDA selectively anticipates information relevant to the task, such as the ball’s trajectory and the opponent’s position, based on the goal (hitting the ball).
A Unified Framework for Learning Four Tasks
The core of LDA lies in its ability to integrate four tasks—forward dynamics (predicting the next state based on a given action), inverse dynamics (estimating actions required to reach a target state), policy learning, and visual prediction—into a single learning framework within a unified latent space.
“Integrating these four tasks doesn’t simply quadruple the computational cost,” Dr. Zhang explains. While the initial stages of training may increase costs due to more objectives, once the model’s foundational capabilities are established, the marginal cost of learning new skills drops significantly. “By investing more in the early stages, we enable faster and cheaper large-scale skill learning in later stages,” he adds, emphasizing the core logic behind their foundational model research and development.
The Key to Overcoming the “Data Wall”
What makes LDA crucial for the field of embodied intelligence is its ability to harness diverse and varied datasets in a unified manner. It effectively incorporates data from simulations and real-world environments (reality-simulation fusion), human-robot collaboration, varying data quality, and even datasets with or without action labels—areas traditionally treated separately.
This approach offers a promising solution to the “data wall” challenge in embodied intelligence: how to gather sufficient high-quality real-world data. As foundational models like LDA mature, robots may become capable of learning new skills with only a small amount of task-specific data, greatly expanding their potential applications.
Dr. Zhang Zhizheng concludes, “AI will never truly be AGI (Artificial General Intelligence) if it remains confined to computers.” LDA could be a pivotal step in liberating AI from the digital realm and enabling it to operate effectively in the physical world.
Frequently Asked Questions
- What is LDA?
- LDA (Latent Dynamics Action) is a new foundational model developed by Chinese robotics company GalaxyBot. It aims to integrate VLA (Visual-Language-Action models) and world models within a unified latent space to enable simultaneous learning of "what actions to take" and "how the environment will change." Its key feature is the ability to utilize diverse datasets in a unified framework.
- How is LDA different from other world model research?
- The primary difference lies in LDA's focus on "embodied intelligence." While researchers like Fei-Fei Li and Yann LeCun focus on representing and predicting the world, LDA emphasizes how predicted environmental changes directly inform action generation, making it more practical for robots operating in the physical world.
- What kind of company is GalaxyBot?
- GalaxyBot is a Chinese tech startup specializing in humanoid robots and embodied intelligence. It was founded by Wang He and Dr. Zhang Zhizheng, the latter of whom is a former senior researcher at Microsoft Research. The company is dedicated to advancing AI technologies that enable physical-world interaction. Dr. Zhang was also named a "Beijing Model Worker" in 2025.
Comments