CRITICTOOL: Evaluating Self-Critique Capabilities
of Large Language Models in Tool-Calling Error Scenarios

Shiting Huang1*, Zhen Fang1,3*, Zehui Chen1, Siyu Yuan2, Junjie Ye2,
Yu Zeng1, Lin Chen1, Qi Mao3, Feng Zhao1†,
1 University of Science and Technology of China 2 Fudan University 3 Communication University of China

Abstract

The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning. In this work, we first extensively analyze the types of errors encountered during the function-calling process on several competitive tool evaluation benchmarks. Based on it, we introduce CriticTool, a comprehensive critique evaluation benchmark specialized for tool learning. Building upon a novel evolutionary strategy for dataset construction, CriticTool holds diverse tool-use errors with varying complexities, which better reflects real-world scenarios. We conduct extensive experiments on CriticTool, and validate the generalization and effectiveness of our constructed benchmark strategy. We also provide an in-depth analysis of the tool reflection ability on various LLMs, offering a new perspective on the field of tool learning in LLMs.


Figure 1. Overview of CriticTool.

Dataset

Tool calling is a challenging long-horizon task for LLM. This requires a robust planning ability (Fig. 1(a)) to develop a strategy for tool calling that guides subsequent actions. Models may make mistakes and also correct them. We believe that this recover from error—referring to the ability of a large language model (LLM) to successfully handle an error in a given step—represents the model's self-critique capability.

Through extensive experiments, we divide LLM errors in tool invocation into two categories: Internal Model-Driven Errors and External Environment Errors. Among them, Internal Model-Driven Errors can be more finely divided into Selection Errors, Tool Hallucination Errors, Parameter Key Errors and Parameter Value Errors(Fig. 2).

Figure 2. Examples of Errors in multi-step tool call tasks.

The data construction of CRITICTOOL is systematically structured in four phases to mirror real-world tool-use complexities. First, high-quality tool-use trajectories are collected from benchmarks like BFCL and T-Eval, with manual filtering to ensure data reliability and standardization of tool-call formats. Second, error diversification generates internal model-driven errors via error simulator and external environment errors (e.g., connection timeouts) through repeated API calls or API simulator for inaccessible APIs. Third, tool response handling employs cache retrieval, actual API execution, or GPT-4o simulation to provide context-specific feedback for error scenarios . Finally, the Scalable and Robust Mixed Self-Evolution (SRM) strategy enhances dataset realism by introducing long contexts, extra tools, noisy queries, and obfuscated API documentation. This pipeline yields a dataset of 1,490 base and 1,250 evolved examples for comprehensive LLM self-critique evaluation.

In summary, CRITICTOOL is the first benchmark evaluating LLMs' self-critique in tool-calling errors, built via data collection, error diversification, response handling, and evolution strategies.

Evaluation

Figure 3. The main result of CriticTool.Bold indicates the best performance across all models, while underline denotes the best performance within the same group and scale of models.

CRITICTOOL's evaluation metrics are designed to assess LLM self-critique capabilities through multi-dimensional, fine-grained assessments:

  • Reflect: Evaluates the model's ability to detect errors and identify their categories (e.g., tool selection, parameter key errors).
  • Correct: Measures the capacity to generate valid tool calls that resolve internal model-driven errors.
  • Retry: For external environment errors, gauges the model's strategy to retry failed calls within limits.
  • Skip/Finish: Assesses the ability to terminate or pivot to subsequent tasks after unsuccessful retries.
  • Overall score: Synthesizes performance to reflect holistic error management capabilities in tool-using tasks.

Result


Figure 3. The main result of CriticTool.Bold indicates the best performance across all models, while underline denotes the best performance within the same group and scale of models.


This webpage template was recycled from here.

Citation


@article{huang2025critictool,
  title={CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios},
  author={Huang, Shiting and Fang, Zhen and Chen, Zehui and Yuan, Siyu and Ye, Junjie and Zeng, Yu and Chen, Lin and Mao, Qi and Zhao, Feng},
  journal={arXiv preprint arXiv:2506.13977},
  year={2025}
}