Fine-Tune Stack: SFT→DPO/ORPO→RLHF
Specify a training stack with SFT on curated data, preference optimization (DPO/ORPO), and optional RLHF. Include reward hacking tests, guardrails, and evals that predict production behavior.
Ratings
Average Rating: 0
Total Ratings: 0