DocLang: Standardizing a Document Format for AI Applications Begins
The DocLang Working Group, featuring members like IBM and NVIDIA, is launched to standardize an AI-native document format to replace PDF and Markdown. Based on an XML specification, it excels in token efficiency and structural retention.
The LF AI & Data Foundation, part of the Linux Foundation, has established a working group to standardize “DocLang,” a new file format designed for AI models to process documents more efficiently, as reported by The Register.
DocLang’s founding members include IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis. The initiative arises from the recognition that existing document formats were designed for human readability and are not well-suited for AI models to accurately interpret structure and meaning.
Issues with Existing Formats
DocLang’s developers criticize widely used formats like PDF, Markdown, HTML, and LaTeX for their fundamental flaws. PDF retains layout information but often loses semantic structures, such as headings, paragraphs, and list hierarchies. Markdown has limited expressiveness and struggles with complex tables or equations. HTML is verbose, adding unnecessary noise when tokenized by AI models. LaTeX, while highly flexible, often leads to ambiguities during parsing.
These formats were all created with “human-readable rendering” as their primary purpose. Consequently, when AI models convert documents into token sequences, semantic information, structural relationships, and geometric context are often lost. This limitation has become a bottleneck, especially in industries like enterprise, where precise structural understanding is critical for contracts, technical documents, regulatory materials, and more.
The Design Philosophy of DocLang
To address these challenges, DocLang is designed as a markup language optimized for LLM tokenizers. Its key technical innovation is establishing a one-to-one mapping between DocLang elements and LLM tokens, enabling models to directly recognize document structures without unnecessary interpretation.
The specification relies on a constrained XML vocabulary, supporting common graphical elements such as tables, equations, charts, and multimodal content. It also enables lossless transformations, ensuring no information is removed when rendering for human readability. Thus, DocLang documents remain easily readable by AI while fully preserving the original information.
Maxime Vermeir, VP of AI Strategy at ABBYY and a working group member, stated, “DocLang is designed to solve fundamental issues in enterprise AI. It provides a minimal, standardized, AI-native representation of document structure, layout, meaning, and governance, creating a more deterministic foundation for modern AI systems.”
DocLang builds on the open-source toolkit “Docling,” developed by IBM and set to launch in late 2024. Docling converts various file formats, such as PDFs and images, into structured data. Similar to Microsoft’s MarkItDown (Microsoft MarkItDown LLM Markdown Conversion Tool) and the Marker project, Docling’s output is standardized by DocLang, enabling interoperability across different systems.
Contribution to Cost Reduction
One notable advantage of DocLang is its potential to reduce AI inference costs. According to estimates cited by The Register from AI Cost Check, running OCR on a PDF with an AI model consumes approximately 1,200 input tokens and 150 output tokens per document. While small for single instances, this token consumption becomes a significant cost factor when enterprises process thousands or tens of thousands of documents.
With DocLang, redundant rendering information is eliminated, and text is expressed in a structure directly understandable by AI models. This allows the same content to be processed with fewer tokens. Since token pricing varies significantly among providers, enterprises face the risk of unforeseen costs, which DocLang aims to mitigate by reducing uncertainty.
Ecosystem and Future Development
DocLang is being developed as an open standard. By forming a working group under the LF AI & Data Foundation, the goal is to establish vendor-neutral specifications and encourage broad industry participation.
Founding members include platform vendors like IBM, NVIDIA, and Red Hat, alongside companies specializing in AI document processing, such as ABBYY, HumanSignal (formerly Label Studio’s developers), and Forgis (a consulting firm specializing in document analysis). This collaboration is expected to ensure interoperability across the entire toolchain.
The roadmap includes the detailed publication of specifications, the provision of reference implementations, and integration with existing document processing pipelines. Beyond format conversion supported by Docling, tools capable of outputting DocLang and wrappers for feeding DocLang into LLMs are likely to follow.
Editorial Opinion
In the short term, the establishment of the DocLang Working Group sends a clear signal to the enterprise AI industry. The move to standardize an “AI-native” format to replace PDF and Markdown could significantly improve efficiency in applications like retrieval-augmented generation (RAG) and agent systems. Over the next three to six months, we can expect the emergence of conversion tools and backend integrations compatible with DocLang, paving the way for broader evaluations.
From a long-term perspective, the key question is whether this effort can become the de facto standard for “AI-oriented formatting.” While formats like HTML or Markdown have been used both by humans and machines to some extent, DocLang is specifically optimized for machine readability. However, this specialization means additional costs for creating a parallel ecosystem to handle rendering for human users. Over the next one to three years, the widespread adoption of DocLang will depend on whether major document management systems and CMS platforms start supporting its input and output natively.
From the editorial team’s point of view, DocLang is particularly worth considering for enterprises with large-scale document processing needs—such as law firms, financial institutions, and pharmaceutical companies. Its potential for token cost reduction and quality improvement is significant. Nevertheless, it is important to recognize that the working group has only just been established, and the actual performance and community support remain to be seen. The crucial question is whether the industry will truly embrace this standard or if it will deem improvements to existing tools sufficient—this remains to be determined.
References
- A modest proposal: Reformat everything to make documents more palatable to AI — The Register — Published on 2026-06-16
- Docling — IBM — Open-source document conversion toolkit
- Microsoft MarkItDown LLM Markdown Conversion Tool — Previously covered by our site
- LF AI & Data Foundation — DocLang — Information on the working group (planned)
Frequently Asked Questions
- Does DocLang completely replace existing formats like PDF or Markdown?
- DocLang is not designed as a rendering format for direct human reading but as an intermediary format for AI models to process documents efficiently. It is intended to be used as part of an AI processing pipeline and can be converted back into traditional formats like PDF or HTML for human-readable purposes. It is not a complete replacement for existing formats.
- What kind of XML structure does DocLang use?
- The currently disclosed details about the specification are limited, but DocLang employs a constrained XML vocabulary that supports tables, equations, charts, and multimodal content. Each element is designed for a one-to-one correspondence with LLM tokens, eliminating redundancy.
Comments