Progressive MCP Bench Results

GitHub

Generated: 17/12/2025, 11:31:28 am

Total Runs
78
Models
11
Strategies
6

1. Baseline Tool Calling Ability

Each model's performance on the minimal-tools strategy, where only the exact tools needed for each task are provided. This measures the model's fundamental ability to understand and correctly invoke tools without distractions or discovery overhead.

Select models:

2. Strategy Impact on Accuracy

How accuracy degrades as we introduce more tools and require discovery. From left to right: minimal-tools (best case), minimal-servers (some extra tools), distraction-64 (64 irrelevant tools), distraction-128 (128 irrelevant tools).

Reading order (left to right per model): Minimal Tools → Minimal Servers → Distraction 64 → Distraction 128 → Directory → Copilot● Circle = baseline▲ Triangle = Directory★ Star = Copilot

3. Advanced Discovery Strategies

How do intelligent discovery mechanisms (copilot routing, directory lookup) compare to the baseline spectrum? The four baseline strategies show the range from best to worst case, while copilot and directory show where smarter discovery lands within that range.

● Circle = baseline strategies▲ Triangle = Directory★ Star = Copilot

4. Token Efficiency

Accuracy vs total token usage for practical strategies. The ideal position is upper-left (high accuracy, low tokens). Directory strategy should show significantly lower token usage while maintaining competitive accuracy. Minimal-servers is shown at half opacity as it requires prior knowledge of which servers are needed.

Copilot — Dynamic tool discovery via copilot routing

Minimal Servers — Only required servers loaded (half opacity, not practical without prior knowledge)

Directory — Tool discovery via directory lookup

Raw Data & Detailed Charts
Models:
RerunModelStrategyRunSamplesScoreTime (mean)Total TokensModel CallsTool CallsCache Hit
GPT OSS 20BCopilot1w ago2000.6955.3s1.35M72252253.9%
GPT OSS 20BDirectory1w ago2000.6953.8s2.45M1388118868.7%
GPT OSS 20BMinimal Toolsyesterday2000.9451.3s552.0K49629646.1%
GPT OSS 20BMinimal Servers1w ago2000.9051.3s1.01M52232259.8%
GPT OSS 20BDistraction 641w ago2000.9101.8s4.21M50330368.3%
GPT OSS 20BDistraction 1281w ago2000.8902.3s6.55M50130169.0%
Emberglow LargeCopilot6d ago2000.9706.9s1.21M74554578.6%
Emberglow LargeDirectory6d ago2000.00558.5s88.27M170451687996.0%
Emberglow LargeMinimal Toolsyesterday2000.9552.2s1.38M55435486.7%
Emberglow LargeMinimal Servers6d ago2000.9502.5s2.72M62942988.2%
Emberglow LargeDistraction 646d ago2000.9453.0s8.60M61341395.1%
Emberglow LargeDistraction 1286d ago2000.9553.2s10.93M59039095.4%
GPT OSS 120BCopilot1w ago2000.5508.9s1.02M75555571.0%
GPT OSS 120BDirectory1w ago2000.7605.4s2.73M1564136476.6%
GPT OSS 120BMinimal Toolsyesterday2000.6801.5s468.9K53633665.5%
GPT OSS 120BMinimal Servers1w ago2000.6151.7s913.8K58438472.0%
GPT OSS 120BDistraction 641w ago2000.4902.1s3.73M53233279.1%
GPT OSS 120BDistraction 1281w ago2000.4452.6s5.68M53133178.6%
Minimax M2Copilot1w ago2000.2607.9s3.31M138111830.0%
Minimax M2Directory1w ago2000.7909.0s3.80M145512800.0%
Minimax M2Minimal Toolsyesterday2000.5653.7s1.40M49132880.3%
Minimax M2Minimal Servers1w ago2000.4904.1s2.04M5133410.0%
Minimax M2Distraction 641w ago2000.48512.6s7.18M5924320.0%
Minimax M2Distraction 1281w ago2000.47019.1s11.33M6114860.0%
Kimi K2Copilot1w ago2000.1003.0s343.3K33413480.2%
Kimi K2Directory1w ago2000.5254.2s1.63M87968167.9%
Kimi K2Minimal Toolsyesterday2000.6501.1s350.9K38021272.5%
Kimi K2Minimal Servers1w ago2000.6801.5s1.08M46028379.3%
Kimi K2Distraction 641w ago2000.7002.6s6.51M51432484.9%
Kimi K2Distraction 1281w ago2000.7055.8s10.47M51633586.3%
GPT-5 NanoCopilot1w ago2000.67037.1s1.78M728586
GPT-5 NanoDirectory1w ago2000.70039.1s2.91M12301074
GPT-5 NanoMinimal Toolsyesterday2000.83525.1s844.0K485367
GPT-5 NanoMinimal Servers1w ago2000.85524.7s1.48M573447
GPT-5 NanoDistraction 641w ago2000.86525.1s4.31M523404
GPT-5 NanoDistraction 1281w ago2000.87526.7s6.80M548444
GPT-5.1Copilot1w ago2000.7559.1s891.9K666515
GPT-5.1Directory1w ago2000.8209.5s1.60M1049923
GPT-5.1Minimal Toolsyesterday2000.8503.8s439.2K476334
GPT-5.1Minimal Servers1w ago2000.8404.4s790.9K478326
GPT-5.1Distraction 641w ago2000.8407.1s3.74M497347
GPT-5.1Distraction 1281w ago2000.8408.8s5.76M493365
Claude 4.5 HaikuCopilot1w ago2000.92013.7s2.18M861734
Claude 4.5 HaikuDirectory1w ago2000.89514.7s3.47M12661142
Claude 4.5 HaikuMinimal Toolsyesterday2000.8905.1s895.4K492363
Claude 4.5 HaikuMinimal Servers1w ago2000.9006.0s1.70M542436
Claude 4.5 HaikuDistraction 641w ago2000.9506.2s7.65M526414
Claude 4.5 HaikuDistraction 1281w ago2000.9456.3s12.02M525411
Claude 4.5 SonnetCopilot1w ago2000.94524.8s2.16M839712
Claude 4.5 SonnetDirectory1w ago2000.89028.3s3.40M12321098
Claude 4.5 SonnetMinimal Toolsyesterday2000.90510.4s939.9K508366
Claude 4.5 SonnetMinimal Servers1w ago2000.89012.2s1.65M505409
Claude 4.5 SonnetDistraction 641w ago2000.88513.4s7.80M536426
Claude 4.5 SonnetDistraction 1281w ago2000.89513.8s12.17M531435
Gemini 2.5 FlashCopilot1w ago2000.77520.9s1.28M747553
Gemini 2.5 FlashDirectory1w ago2000.87026.7s2.31M12371055
Gemini 2.5 FlashMinimal Toolsyesterday2000.91011.7s641.0K493326
Gemini 2.5 FlashMinimal Servers1w ago2000.82512.5s1.10M479305
Gemini 2.5 FlashDistraction 641w ago2000.84015.0s5.30M552336
Gemini 2.5 FlashDistraction 1281w ago2000.82522.9s8.15M701327
Gemini 3 ProCopilot6d ago2000.1201m 1s447.7K1135225
Gemini 3 ProDirectory6d ago1800.0001m09000
Gemini 3 ProMinimal Toolsyesterday2000.36053.6s332.5K999189
Gemini 3 ProMinimal Servers1w ago2000.0001m010000
Gemini 3 ProDistraction 641w ago2000.0001m010000
Gemini 3 ProDistraction 1281w ago2000.0001m010000
groq/emberglow/smallCopilot6d ago300.0003.4s0300
google/gemini-3.0-pro-previewDirectory3w ago500.0003.3s0500
groq/emberglow/smallDirectory1w ago2000.0002.2s02000
groq/llama-3.3-70b-versatileMinimal Tools2w ago50.8001.3s11.4K10100.0%
groq/emberglow/smallMinimal Tools1w ago2000.000184ms02000
groq/emberglow/smallMinimal Servers1w ago2000.000206ms02000
groq-responses/openai/gpt-oss-20bMinimal Servers2d ago2000.8802.1s295.9K00
groq-responses/emberglow/smallMinimal Servers2d ago590.000546ms000
groq-responses/openai/gpt-oss-120bMinimal Servers2d ago2000.9153.6s295.6K00
groq-responses/minimaxai/minimax-m2Minimal Servers2d ago2000.5103.7s194.6K00
groq/emberglow/smallDistraction 641w ago2000.000385ms02000
groq/emberglow/smallDistraction 1281w ago2000.000517ms02000