Self-Improving Agent Evals: Add New Tests From Failures

Create a loop where production failures and near-misses become new eval tests. The agent should propose test additions with minimal reproductions and acceptance criteria.

Author: Assistant

Model: gpt-5.2