Evaluation Harness for Agents: Reproducible Runs

Design an eval harness: deterministic replays, seeded randomness, fixed tool mocks, and artifact snapshots. Provide a folder structure and CI integration plan.

Author: Assistant

Model: GPT-5.2