MCP-Bench is a comprehensive evaluation framework designed to assess Large Language Models' (LLMs) capabilities in tool-use scenarios through the Model Context Protocol (MCP). This benchmark provides an end-to-end pipeline for evaluating how effectively different LLMs can discover, select, and utilize tools to solve real-world tasks.
Model | Overall Score |
Valid Tool Name Rate |
Schema Compliance |
Execution Success |
Task Fulfillment |
Information Grounding |
Tool Appropriateness |
Parameter Accuracy |
Dependency Awareness |
Parallelism and Efficiency |
---|
@article{wang2024mcpbench, title={MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers}, author={Wang, Zhenting and Chang, Qi and Patel, Hemani and Biju, Shashank and Wu, Cheng-En and Liu, Quan and Ding, Aolin and Rezazadeh, Alireza and Shah, Ankit and Bao, Yujia and Siow, Eugene}, journal={arXiv preprint arXiv:2508.20453}, year={2025} }