Each model's performance on the minimal-tools strategy, where only the exact tools needed for each task are provided. This measures the model's fundamental ability to understand and correctly invoke tools without distractions or discovery overhead.
How accuracy degrades as we introduce more tools and require discovery. From left to right: minimal-tools (best case), minimal-servers (some extra tools), distraction-64 (64 irrelevant tools), distraction-128 (128 irrelevant tools).
How do intelligent discovery mechanisms (copilot routing, directory lookup) compare to the baseline spectrum? The four baseline strategies show the range from best to worst case, while copilot and directory show where smarter discovery lands within that range.
Accuracy vs total token usage for practical strategies. The ideal position is upper-left (high accuracy, low tokens). Directory strategy should show significantly lower token usage while maintaining competitive accuracy. Minimal-servers is shown at half opacity as it requires prior knowledge of which servers are needed.
● Copilot — Dynamic tool discovery via copilot routing
◆ Minimal Servers — Only required servers loaded (half opacity, not practical without prior knowledge)
▲ Directory — Tool discovery via directory lookup
| Rerun | Model | Strategy | Run | Samples | Score | Time (mean) | Total Tokens | Model Calls | Tool Calls | Cache Hit |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT OSS 20B | Copilot | 1w ago | 200 | 0.695 | 5.3s | 1.35M | 722 | 522 | 53.9% | |
| GPT OSS 20B | Directory | 1w ago | 200 | 0.695 | 3.8s | 2.45M | 1388 | 1188 | 68.7% | |
| GPT OSS 20B | Minimal Tools | yesterday | 200 | 0.945 | 1.3s | 552.0K | 496 | 296 | 46.1% | |
| GPT OSS 20B | Minimal Servers | 1w ago | 200 | 0.905 | 1.3s | 1.01M | 522 | 322 | 59.8% | |
| GPT OSS 20B | Distraction 64 | 1w ago | 200 | 0.910 | 1.8s | 4.21M | 503 | 303 | 68.3% | |
| GPT OSS 20B | Distraction 128 | 1w ago | 200 | 0.890 | 2.3s | 6.55M | 501 | 301 | 69.0% | |
| Emberglow Large | Copilot | 6d ago | 200 | 0.970 | 6.9s | 1.21M | 745 | 545 | 78.6% | |
| Emberglow Large | Directory | 6d ago | 200 | 0.005 | 58.5s | 88.27M | 17045 | 16879 | 96.0% | |
| Emberglow Large | Minimal Tools | yesterday | 200 | 0.955 | 2.2s | 1.38M | 554 | 354 | 86.7% | |
| Emberglow Large | Minimal Servers | 6d ago | 200 | 0.950 | 2.5s | 2.72M | 629 | 429 | 88.2% | |
| Emberglow Large | Distraction 64 | 6d ago | 200 | 0.945 | 3.0s | 8.60M | 613 | 413 | 95.1% | |
| Emberglow Large | Distraction 128 | 6d ago | 200 | 0.955 | 3.2s | 10.93M | 590 | 390 | 95.4% | |
| GPT OSS 120B | Copilot | 1w ago | 200 | 0.550 | 8.9s | 1.02M | 755 | 555 | 71.0% | |
| GPT OSS 120B | Directory | 1w ago | 200 | 0.760 | 5.4s | 2.73M | 1564 | 1364 | 76.6% | |
| GPT OSS 120B | Minimal Tools | yesterday | 200 | 0.680 | 1.5s | 468.9K | 536 | 336 | 65.5% | |
| GPT OSS 120B | Minimal Servers | 1w ago | 200 | 0.615 | 1.7s | 913.8K | 584 | 384 | 72.0% | |
| GPT OSS 120B | Distraction 64 | 1w ago | 200 | 0.490 | 2.1s | 3.73M | 532 | 332 | 79.1% | |
| GPT OSS 120B | Distraction 128 | 1w ago | 200 | 0.445 | 2.6s | 5.68M | 531 | 331 | 78.6% | |
| Minimax M2 | Copilot | 1w ago | 200 | 0.260 | 7.9s | 3.31M | 1381 | 1183 | 0.0% | |
| Minimax M2 | Directory | 1w ago | 200 | 0.790 | 9.0s | 3.80M | 1455 | 1280 | 0.0% | |
| Minimax M2 | Minimal Tools | yesterday | 200 | 0.565 | 3.7s | 1.40M | 491 | 328 | 80.3% | |
| Minimax M2 | Minimal Servers | 1w ago | 200 | 0.490 | 4.1s | 2.04M | 513 | 341 | 0.0% | |
| Minimax M2 | Distraction 64 | 1w ago | 200 | 0.485 | 12.6s | 7.18M | 592 | 432 | 0.0% | |
| Minimax M2 | Distraction 128 | 1w ago | 200 | 0.470 | 19.1s | 11.33M | 611 | 486 | 0.0% | |
| Kimi K2 | Copilot | 1w ago | 200 | 0.100 | 3.0s | 343.3K | 334 | 134 | 80.2% | |
| Kimi K2 | Directory | 1w ago | 200 | 0.525 | 4.2s | 1.63M | 879 | 681 | 67.9% | |
| Kimi K2 | Minimal Tools | yesterday | 200 | 0.650 | 1.1s | 350.9K | 380 | 212 | 72.5% | |
| Kimi K2 | Minimal Servers | 1w ago | 200 | 0.680 | 1.5s | 1.08M | 460 | 283 | 79.3% | |
| Kimi K2 | Distraction 64 | 1w ago | 200 | 0.700 | 2.6s | 6.51M | 514 | 324 | 84.9% | |
| Kimi K2 | Distraction 128 | 1w ago | 200 | 0.705 | 5.8s | 10.47M | 516 | 335 | 86.3% | |
| GPT-5 Nano | Copilot | 1w ago | 200 | 0.670 | 37.1s | 1.78M | 728 | 586 | — | |
| GPT-5 Nano | Directory | 1w ago | 200 | 0.700 | 39.1s | 2.91M | 1230 | 1074 | — | |
| GPT-5 Nano | Minimal Tools | yesterday | 200 | 0.835 | 25.1s | 844.0K | 485 | 367 | — | |
| GPT-5 Nano | Minimal Servers | 1w ago | 200 | 0.855 | 24.7s | 1.48M | 573 | 447 | — | |
| GPT-5 Nano | Distraction 64 | 1w ago | 200 | 0.865 | 25.1s | 4.31M | 523 | 404 | — | |
| GPT-5 Nano | Distraction 128 | 1w ago | 200 | 0.875 | 26.7s | 6.80M | 548 | 444 | — | |
| GPT-5.1 | Copilot | 1w ago | 200 | 0.755 | 9.1s | 891.9K | 666 | 515 | — | |
| GPT-5.1 | Directory | 1w ago | 200 | 0.820 | 9.5s | 1.60M | 1049 | 923 | — | |
| GPT-5.1 | Minimal Tools | yesterday | 200 | 0.850 | 3.8s | 439.2K | 476 | 334 | — | |
| GPT-5.1 | Minimal Servers | 1w ago | 200 | 0.840 | 4.4s | 790.9K | 478 | 326 | — | |
| GPT-5.1 | Distraction 64 | 1w ago | 200 | 0.840 | 7.1s | 3.74M | 497 | 347 | — | |
| GPT-5.1 | Distraction 128 | 1w ago | 200 | 0.840 | 8.8s | 5.76M | 493 | 365 | — | |
| Claude 4.5 Haiku | Copilot | 1w ago | 200 | 0.920 | 13.7s | 2.18M | 861 | 734 | — | |
| Claude 4.5 Haiku | Directory | 1w ago | 200 | 0.895 | 14.7s | 3.47M | 1266 | 1142 | — | |
| Claude 4.5 Haiku | Minimal Tools | yesterday | 200 | 0.890 | 5.1s | 895.4K | 492 | 363 | — | |
| Claude 4.5 Haiku | Minimal Servers | 1w ago | 200 | 0.900 | 6.0s | 1.70M | 542 | 436 | — | |
| Claude 4.5 Haiku | Distraction 64 | 1w ago | 200 | 0.950 | 6.2s | 7.65M | 526 | 414 | — | |
| Claude 4.5 Haiku | Distraction 128 | 1w ago | 200 | 0.945 | 6.3s | 12.02M | 525 | 411 | — | |
| Claude 4.5 Sonnet | Copilot | 1w ago | 200 | 0.945 | 24.8s | 2.16M | 839 | 712 | — | |
| Claude 4.5 Sonnet | Directory | 1w ago | 200 | 0.890 | 28.3s | 3.40M | 1232 | 1098 | — | |
| Claude 4.5 Sonnet | Minimal Tools | yesterday | 200 | 0.905 | 10.4s | 939.9K | 508 | 366 | — | |
| Claude 4.5 Sonnet | Minimal Servers | 1w ago | 200 | 0.890 | 12.2s | 1.65M | 505 | 409 | — | |
| Claude 4.5 Sonnet | Distraction 64 | 1w ago | 200 | 0.885 | 13.4s | 7.80M | 536 | 426 | — | |
| Claude 4.5 Sonnet | Distraction 128 | 1w ago | 200 | 0.895 | 13.8s | 12.17M | 531 | 435 | — | |
| Gemini 2.5 Flash | Copilot | 1w ago | 200 | 0.775 | 20.9s | 1.28M | 747 | 553 | — | |
| Gemini 2.5 Flash | Directory | 1w ago | 200 | 0.870 | 26.7s | 2.31M | 1237 | 1055 | — | |
| Gemini 2.5 Flash | Minimal Tools | yesterday | 200 | 0.910 | 11.7s | 641.0K | 493 | 326 | — | |
| Gemini 2.5 Flash | Minimal Servers | 1w ago | 200 | 0.825 | 12.5s | 1.10M | 479 | 305 | — | |
| Gemini 2.5 Flash | Distraction 64 | 1w ago | 200 | 0.840 | 15.0s | 5.30M | 552 | 336 | — | |
| Gemini 2.5 Flash | Distraction 128 | 1w ago | 200 | 0.825 | 22.9s | 8.15M | 701 | 327 | — | |
| Gemini 3 Pro | Copilot | 6d ago | 200 | 0.120 | 1m 1s | 447.7K | 1135 | 225 | — | |
| Gemini 3 Pro | Directory | 6d ago | 180 | 0.000 | 1m | 0 | 900 | 0 | — | |
| Gemini 3 Pro | Minimal Tools | yesterday | 200 | 0.360 | 53.6s | 332.5K | 999 | 189 | — | |
| Gemini 3 Pro | Minimal Servers | 1w ago | 200 | 0.000 | 1m | 0 | 1000 | 0 | — | |
| Gemini 3 Pro | Distraction 64 | 1w ago | 200 | 0.000 | 1m | 0 | 1000 | 0 | — | |
| Gemini 3 Pro | Distraction 128 | 1w ago | 200 | 0.000 | 1m | 0 | 1000 | 0 | — | |
| groq/emberglow/small | Copilot | 6d ago | 30 | 0.000 | 3.4s | 0 | 30 | 0 | — | |
| google/gemini-3.0-pro-preview | Directory | 3w ago | 50 | 0.000 | 3.3s | 0 | 50 | 0 | — | |
| groq/emberglow/small | Directory | 1w ago | 200 | 0.000 | 2.2s | 0 | 200 | 0 | — | |
| groq/llama-3.3-70b-versatile | Minimal Tools | 2w ago | 5 | 0.800 | 1.3s | 11.4K | 10 | 10 | 0.0% | |
| groq/emberglow/small | Minimal Tools | 1w ago | 200 | 0.000 | 184ms | 0 | 200 | 0 | — | |
| groq/emberglow/small | Minimal Servers | 1w ago | 200 | 0.000 | 206ms | 0 | 200 | 0 | — | |
| groq-responses/openai/gpt-oss-20b | Minimal Servers | 2d ago | 200 | 0.880 | 2.1s | 295.9K | 0 | 0 | — | |
| groq-responses/emberglow/small | Minimal Servers | 2d ago | 59 | 0.000 | 546ms | 0 | 0 | 0 | — | |
| groq-responses/openai/gpt-oss-120b | Minimal Servers | 2d ago | 200 | 0.915 | 3.6s | 295.6K | 0 | 0 | — | |
| groq-responses/minimaxai/minimax-m2 | Minimal Servers | 2d ago | 200 | 0.510 | 3.7s | 194.6K | 0 | 0 | — | |
| groq/emberglow/small | Distraction 64 | 1w ago | 200 | 0.000 | 385ms | 0 | 200 | 0 | — | |
| groq/emberglow/small | Distraction 128 | 1w ago | 200 | 0.000 | 517ms | 0 | 200 | 0 | — |