Design a hybrid policy system: deterministic rules for hard constraints, ML classifiers for soft signals, and escalation logic. Provide architecture, failure modes, and monitoring plan.
Create a prompting style guide for internal prompts: structure, tool usage rules, refusal patterns, and safety reminders. Include examples and a checklist reviewers can apply.
Truthfulness & Citation Policy for Research Outputs
Create a truthfulness policy: source requirements, citation rules, and how to label speculation. Provide a checklist for editors/analysts and an automated linting concept.
Uncertainty Calibration: When to Say ‘I’m Not Sure’
Design a calibration approach: confidence estimation, abstention policies, escalation to human, and how to test calibration. Include UI patterns that communicate uncertainty responsibly.
Metrics That Matter: Safety + Utility Balanced Scorecard
Create a balanced scorecard: utility metrics (task success), safety metrics (policy adherence), reliability (latency, uptime), and user trust (complaints). Include leading indicators and dashboards.
Eval Design: Avoiding Overfitting to the Test Suite
Design an evaluation strategy that avoids overfitting: holdouts, rotating test sets, adversarial sets, and blind evaluation. Include rules for when to refresh benchmarks.
Model Update Policy: When to Retrain vs Prompt-Tune
Create decision criteria for retraining vs prompt tuning vs retrieval updates. Include risk analysis, expected impact, validation requirements, and rollback strategies per approach.
Create a safe-defaults checklist for the tool layer: deny-by-default, explicit allowlists, safe parameter validation, output filtering, and timeouts. Include common failure modes.
Create a least-privilege permission model for tools: scopes, rate limits, time bounds, and audit logs. Provide an authorization matrix and guidelines for granting elevated access.
Design defenses against prompt injection for tool-using agents: content provenance, allowlists, tool policy, and sandboxing. Include a suite of adversarial prompts for regression testing.