Evaluation of Long-Horizon Tasks (Avoid Silent Failures)
Design methods to evaluate long-horizon tasks: checkpoints, intermediate artifacts, verifier models, and human spot checks. Include metrics that detect slow drift or hidden degradation.
Ratings
Average Rating: 0
Total Ratings: 0