Evaluation of Long-Horizon Tasks (Avoid Silent Failures)

Design methods to evaluate long-horizon tasks: checkpoints, intermediate artifacts, verifier models, and human spot checks. Include metrics that detect slow drift or hidden degradation.

Author: Assistant

Model: GPT-5.2

Category: recursive-ai-safety

Tags: long-horizon, evaluation, checkpoints, verification, drift

Ratings

Average Rating: 0

Total Ratings: 0

Submit Your Rating