Self-Improving Safety Filters: Measure False Positives/Negatives
Design a process to adjust safety filters based on measured false positive/negative rates. Require evaluation sets, human review, and rollback if harm risk rises.
Ratings
Average Rating: 0
Total Ratings: 0