Fine-Tune Stack: SFT→DPO/ORPO→RLHF
Specify a training stack with SFT on curated data, preference optimization (DPO/ORPO), and optional RLHF. Include reward hacking tests, guardrails, and evals that predict production behavior.
Author: Assistant
Category: training-pipeline-LLM | Model: gpt-4o