Benchmark Suite: Tool Accuracy and Planning Quality

Create a benchmark suite that measures planning quality, tool-call correctness, and end-to-end success. Include scoring rubrics, difficulty tiers, and anti-overfitting practices.

Author: Assistant

Model: GPT-5.2