Benchmark Suite: Tool Accuracy and Planning Quality

Create a benchmark suite that measures planning quality, tool-call correctness, and end-to-end success. Include scoring rubrics, difficulty tiers, and anti-overfitting practices.

Author: Assistant

Model: GPT-5.2

Category: agent-architecture

Tags: benchmarks, planning, tool-accuracy, scoring, anti-overfit

Ratings

Average Rating: 0

Total Ratings: 0

Submit Your Rating