Which AI models were tested in this research?

The research tested frontier models such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4. Despite being state-of-the-art, these models experienced an average content loss of 25% during long-duration workflow processing.

Why did only Python programming meet the benchmark?

Python programming involves structured code generation tasks, which are an area of strength for LLMs. In contrast, other domains require handling ambiguity and complex contextual understanding, leading to performance declines.

How might these findings impact the practical use of AI?

The study suggests the necessity of human oversight when using AI agents for long-duration autonomous tasks. Companies should consider phased implementations and establish checkpoints instead of fully automating processes.

Microsoft Research Highlights Challenges of AI Models in Long-Duration Task Processing

Researchers at Microsoft Research reveal that even the latest AI models encounter errors in long-duration workflows. Among 52 domains tested, only Python programming met the benchmark.

May 12, 2026 3 min read Reviewed & edited by the SINGULISM Editorial Team

Microsoft Research Highlights Challenges of AI Models in Long-Duration Task Processing — Photo by Jonathan Kemper on Unsplash

Microsoft Research Evaluates AI’s Ability to Handle Long-Duration Tasks

Microsoft Research, the research division of Microsoft, has revealed that large language models (LLMs) and AI agents are prone to significant errors when managing long-duration, multi-step workflows. The researchers emphasized that companies must exercise extreme caution when integrating AI into automated workflows.

Testing 52 Domains with the DELEGATE-52 Benchmark

To evaluate how LLMs handle extended knowledge work tasks, the research team developed a benchmark called “DELEGATE-52.” This benchmark simulates multi-step workflows across 52 specialized professional domains, including programming, crystallography, and music notation.

For instance, in the accounting domain, an LLM was tasked with categorizing and reorganizing the accounting ledger of a non-profit organization based on an initial set of documents. This task required far more sophisticated processing capabilities than simple spreadsheet management.

Frontier Models Show Average 25% Content Loss

The results revealed that current LLMs introduce significant errors during document editing tasks. Even frontier models such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 lost an average of 25% of document content after 20 delegation interactions. The average degradation rate across all tested models reached 50%.

The research team established a benchmark for considering a model “ready” for specific tasks: achieving an accuracy of 98% or higher after 20 interactions. However, only Python programming met this standard among the 52 tested domains. In all other domains, the LLMs fell significantly short of the benchmark.

Significant Performance Decline in Natural Language Tasks

While LLMs showed relatively strong performance in programming tasks, their capabilities declined notably in tasks involving natural language. The research paper, titled “LLMs Corrupt Your Documents When You Delegate,” has been published as a preprint.

These findings challenge the traditional expectations of AI agents being able to autonomously handle complex tasks. This is in stark contrast to claims by companies like Anthropic, which has advertised that its Claude Cowork can “complete tasks by interacting with computers, local files, and applications when given a goal,” or Microsoft’s promotion of Microsoft 365 Copilot as being able to “handle complex, multi-step research across work data and the web.”

Implications for the Industry and Future Challenges

These findings serve as a critical warning to companies looking to adopt AI for automated workflows. The study highlights the significant performance degradation of AI models in long-duration delegated tasks, underlining the need for human oversight and intervention.

Moving forward, the key challenges include improving LLMs’ long-term memory and context retention capabilities, as well as enhancing error detection and recovery mechanisms. Overcoming these technical hurdles will be essential for AI agents to become more reliable in practical environments.

Frequently Asked Questions

Which AI models were tested in this research?: The research tested frontier models such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4. Despite being state-of-the-art, these models experienced an average content loss of 25% during long-duration workflow processing.
Why did only Python programming meet the benchmark?: Python programming involves structured code generation tasks, which are an area of strength for LLMs. In contrast, other domains require handling ambiguity and complex contextual understanding, leading to performance declines.
How might these findings impact the practical use of AI?: The study suggests the necessity of human oversight when using AI agents for long-duration autonomous tasks. Companies should consider phased implementations and establish checkpoints instead of fully automating processes.

Source: The Register

SINGULISM Editorial Team — Reviewed & edited by the SINGULISM Editorial Team

Last updated: May 11, 2026

If you find any factual errors or inaccuracies, we will promptly publish a correction. Please contact us via the contact form to request a correction.

Comments

← Back to Home