Dev

Tool-Calling Issues in Anthropic’s Latest Models: Concerns Over RL Overfitting

Strange anomalies in tool-calling functionality have surfaced in Anthropic's latest Opus 4.8 and Sonnet 5 models, raising concerns of RL overfitting.

5 min read Reviewed & edited by the SINGULISM Editorial Team

Tool-Calling Issues in Anthropic’s Latest Models: Concerns Over RL Overfitting
Photo by Brecht Corbeel on Unsplash

According to a report on his blog by Armin Ronacher, known for developing the Flask framework, severe quality regressions in tool-calling functionality have been observed in Anthropic’s latest models “Claude Opus 4.8” and “Sonnet 5.” Issues emerged in Ronacher’s project “Pi,” where tool calls that worked flawlessly with older models failed in the latest versions. These models introduced extraneous, non-schema fields into the tool-calling process, causing errors that were not present in previous iterations.

Details of the Issue

The problem, as experienced by Ronacher, pertains to tool calls issued by the model to the Pi edit tool. The tool is designed to edit files using two parameters: path and edits. The edits parameter is an array of objects, each containing the fields oldText and newText. However, Opus 4.8 introduced additional keys into the edits array that were not defined in the schema. Although the content of the edits was often correct, the arguments did not align with the schema, prompting Pi to reject the tool calls and request retries from the model. This led to unnecessary round trips, reducing efficiency.

Ronacher emphasized that this issue is qualitatively distinct from the known phenomenon of models occasionally generating invalid tool calls. What makes this case particularly notable is that the issue is prominent in the latest state-of-the-art models, Opus 4.8 and Sonnet 5, and was not observed in older versions. In effect, the most advanced models in the Anthropic family have shown inferior performance with certain tool schemas compared to their predecessors. Ronacher refrained from testing the Fable model, expressing concerns that a classifier within Fable might silently downgrade requests to Opus.

How Tool Calling Works

To understand the root of the issue, it is essential to grasp how tool-calling works in large language models (LLMs). As Ronacher explains, tool-calling is not “magic” but relies on fairly rudimentary in-band signaling. The model takes in conversation history, a system prompt, and a list of available tools. The server converts these inputs into a large prompt interspersed with special marker tokens. During training and reinforcement learning (RL), the model learns specific format patterns and generates token sequences that signify “call this tool with these arguments” during text generation.

In Anthropic’s models, this format is represented by XML-like markers such as <antml:function_calls>. However, this is not true XML but a custom format optimized for tokenization and training. Notably, while top-level string parameters are expressed inline, arrays of objects are embedded via JSON serialization. This “embedded JSON” structure appears to be a significant factor behind the reported issue.

The Possibility of RL Overfitting

Ronacher speculates that the observed behavior might result from overtraining through reinforcement learning on Anthropic’s proprietary “Claude Code” harness. This closed-source system may differ subtly from the tool declarations available to external developers. The models, optimized for the Claude Code harness, might be generating descriptions that are “close but slightly off” when interacting with general-purpose tool schemas. In other words, patterns learned for the internal Claude Code schema may manifest as extraneous fields when applied to other tools.

If this hypothesis holds true, it suggests that Anthropic evaluated tool-calling quality within the closed Claude Code environment and optimized their evaluation metrics accordingly, potentially compromising the models’ generalizability. This could be considered a form of RL overfitting.

Implications for Developers

This issue has significant implications for developers leveraging LLMs for tool integration. Firstly, it emphasizes the need for automated testing to ensure tool-call compatibility whenever a model is upgraded. The assumption that “newer models are smarter” may no longer hold true, as this case demonstrates the risk of regression in specific tasks.

Secondly, the issue highlights the dependency of tool-calling formats on the RL processes within the model. Developers must consider how closely their tool definitions align with the model’s training data and RL processes. Particularly, when model providers optimize for their proprietary ecosystems, the resultant benefits may not be evenly distributed to external users.

Ronacher’s insights go beyond a mere bug report. They underscore that advancements in LLMs do not always translate into linear improvements in general-purpose performance. This serves as a cautionary tale about relying solely on specific benchmarks or use cases to evaluate model quality.

Editorial Opinion

This case serves as a critical reminder that advancements in AI models do not always manifest as improvements in practical tool integration. In the short term, it is likely that Anthropic will acknowledge this issue and revise its reinforcement learning strategies and evaluation methods. If the gap between the Claude Code harness and general API tool declarations is indeed the root cause, Anthropic should incorporate general tool test cases into its training pipeline. Additionally, the developer community is expected to adopt practices like incorporating regression testing for tool calls into continuous integration processes to verify model behavior for every version update.

From a long-term perspective, there is a growing need for standardized benchmarks to assess the quality of tool-calling in LLMs. Current benchmarks primarily focus on tasks like question answering and coding capabilities but lack metrics to evaluate strict adherence to tool definitions and schema requirements. Model providers should prioritize general-purpose tool-calling quality over ecosystem-specific optimizations. Failure to do so may result in further ecosystem lock-in, eroding external developer trust.

References

Frequently Asked Questions

Is this issue present in all of Anthropic’s models?
As of now, Ronacher has identified the issue in Opus 4.8 and Sonnet 5. Older models, such as Opus 4.5, have not exhibited similar problems. The Fable model was not tested.
How should developers address this problem?
Developers are advised to implement strict validation for tool-calling schemas and create error-handling mechanisms to manage extraneous fields generated by the model. Additionally, fixing model versions during deployment and conducting regression tests before transitioning to newer versions is recommended.
Source: Lobsters

Comments

← Back to Home