Create a traceability matrix linking requirements/specs to tests and monitors. Provide a template and a worked example for a tool-using assistant system.
Describe a high-level approach to align reward signals with safe behavior: preference data guidelines, reward hacking risks, and validation. Keep it conceptual and focused on safety.
Create an annotation guide: definitions, examples, severity levels, and how to handle ambiguity. Include training exercises and a QA process for reviewer consistency.
Help me design a domain-specific safety benchmark: representative tasks, policy-sensitive cases, and adversarial cases. Include labeling guidelines and inter-annotator agreement checks.
Design drift detection: changes in user queries, outcome distributions, error types, and model behavior. Include thresholds and a playbook for when drift is detected.
Create a stress test plan: malformed inputs, long-context traps, conflicting instructions, and toxic content probes. Provide how to automate and score robustness over time.
Differential Privacy and Minimization Options (Conceptual)
Explain privacy-preserving options for feedback loops: minimization, aggregation, differential privacy (conceptually), and retention policies. Provide a practical selection guide.
Design safe A/B testing for AI changes: guardrails, user segmentation, sensitive cohorts, and safe metrics. Include ethics considerations and how to interpret ambiguous outcomes.
Design cost controls: budget caps, queue prioritization, cache policy, and abort rules for expensive runs. Include a method to estimate ROI of improvements before executing.
Generate a model/system card template: intended use, limitations, safety mitigations, eval results, and known failure modes. Include a changelog section for each iteration.
Create a post-mortem template tailored to AI regressions: data/prompt/model diffs, evaluation gaps, monitoring misses, and remediation tasks. Include a ‘lessons to tests’ section.
Run a pre-mortem for a recursive improvement project: list plausible failures, early warning signals, and prevention steps. Output prioritized mitigations and ‘watch items’.