What architecture does HiDream-O1-Image-1.5 adopt?

It adopts the native full-modal architecture "Unified Transformer (UiT)." This enables complex layout control and multilingual text rendering, which traditional text-to-image generation models struggle with.

Can this model be used commercially in practice?

Yes, it is released as a commercial version and is designed for commercial scenarios such as advertising marketing, e-commerce visuals, game content, and IP creation. An open-source version (Dev-2604) is also available for developers to try for free.

In what aspects did it surpass models from Google and NVIDIA?

On the Artificial Analysis Text to Image Leaderboard, it achieved an ELO score of 1265. It received high overall ratings in particular for text rendering, complex image composition, multi-object control, and quality of mixed-language text.

HiDream-O1-Image-1.5 Ranks Second Globally in Image Generation Leaderboard

The image generation model HiDream-O1-Image-1.5, released by HiDream.ai, has secured second place globally on the Artificial Analysis Text to Image Leaderboard, trailing only OpenAI and surpassing models from Google and NVIDIA.

June 11, 2026 3 min read Reviewed & edited by the SINGULISM Editorial Team

HiDream-O1-Image-1.5 Ranks Second Globally in Image Generation Leaderboard — Photo by Justin Morgan on Unsplash

Chinese AI company HiDream.ai announced its commercial image generation model “HiDream-O1-Image-1.5” in early June 2026, which has achieved the second-highest score globally on the independent AI model evaluation platform Artificial Analysis’s Text to Image Leaderboard, second only to OpenAI. This marks the first time a Chinese image generation model has reached the top spot.

Ranking Second Worldwide

The Artificial Analysis leaderboard employs anonymous comparisons, user voting, and an ELO dynamic ranking mechanism, eliminating brand bias in its evaluation system. HiDream-O1-Image-1.5 achieved an ELO score of 1265 based on over 4,000 sample comparisons. It outperformed major domestic and international models such as Google’s Nano Banana 2 (Gemini 3.1 Flash Image Preview), NVIDIA’s Cosmos3-Super-Text2Image, and ByteDance’s Seedream 4.0.

Just two weeks earlier, the open-source version of the same series, “HiDream-O1-Image-Dev-2604,” had claimed the top spot in the open-source category. This consecutive top ranking has drawn significant attention.

High General-Purpose Image Generation Capabilities

HiDream-O1-Image-1.5 demonstrates stable quality across a wide range of generation scenarios, from photorealistic portrait generation and dynamic animal expressions to spatial hierarchy control in natural landscapes and adaptation to diverse artistic styles. What sets it apart is its exceptional text rendering and layout control capabilities.

For e-commerce poster generation, it can seamlessly integrate product subjects, layout structures, and mixed text in Chinese, Japanese, and English. In multi-layered complex text rendering tasks, it naturally embeds text into scenes such as posters, pitch decks, exploded views, and dashboards, balancing readability and layout stability. In IP character design, it maintains multi-view generation and character consistency.

For multi-grid and storyboard generation, it understands continuous narratives and produces logically coherent multi-frame outputs. These capabilities enhance its practical utility in commercial scenarios such as advertising marketing, brand design, e-commerce visuals, game content, video storyboarding, and IP creation.

Advantages of the Proprietary Architecture

Underpinning HiDream-O1-Image-1.5 is the industry-first native full-modal architecture “Unified Transformer (UiT).” The company has advanced UiT from “technical validation” to “production validation,” first demonstrating the architecture’s effectiveness to the community through the open-source version and then elevating it into a full-fledged production tool with the commercial version.

The HiDream-O1 series follows a clear capability evolution curve across the 8B-parameter open-source version, Pro version, and 1.5 commercial version, showcasing architectural innovation and rapid iteration ability. The realization of complex layouts and multilingual text rendering—long-standing challenges for traditional text-to-image generation—underscores the effectiveness of the UiT architecture.

Editorial Opinion

In the short term, the dominance of a Chinese-developed image generation model at the top of the global leaderboard has injected new competitive momentum into the once OpenAI-dominated image generation market. The fact that it outperforms models from tech giants like Google and NVIDIA could drive a redistribution of market share.

From a long-term perspective, competition between base models offered by hyperscalers and independent vendors, including Chinese companies, is expected to intensify. HiDream.ai’s strategy of deploying its proprietary architecture in both open-source and commercial forms may serve as a powerful means to engage both the developer community and enterprise customers.

From an editorial standpoint, the key question moving forward is how closely the Text to Image Leaderboard evaluations align with real-world commercial workflow requirements—such as consistency, text quality, and layout control. Additionally, it is worth monitoring how Chinese AI companies continue model development and operations under U.S. export restrictions, as well as the sustainability of their progress.

References

quantum位 (QubitAI) “China’s First, World’s Second! HiDream-O1-Image-1.5 Tops the Text-to-Image Leaderboard” https://www.qbitai.com/2026/06/434196.html — Published 2026-06-10

Frequently Asked Questions

What architecture does HiDream-O1-Image-1.5 adopt?: It adopts the native full-modal architecture "Unified Transformer (UiT)." This enables complex layout control and multilingual text rendering, which traditional text-to-image generation models struggle with.
Can this model be used commercially in practice?: Yes, it is released as a commercial version and is designed for commercial scenarios such as advertising marketing, e-commerce visuals, game content, and IP creation. An open-source version (Dev-2604) is also available for developers to try for free.
In what aspects did it surpass models from Google and NVIDIA?: On the Artificial Analysis Text to Image Leaderboard, it achieved an ELO score of 1265. It received high overall ratings in particular for text rendering, complex image composition, multi-object control, and quality of mixed-language text.

Source: 量子位

Written by SINGULISM Editorial Team

Edited & reviewed by Kenichiro Yamamoto

Last updated: June 10, 2026

If you find any factual errors or inaccuracies, we will promptly publish a correction. Please contact us via the contact form to request a correction.

Comments

← Back to Home