Search Results - Curioprompt

No image available

Fine-Tune Stack: SFT→DPO/ORPO→RLHF

Specify a training stack with SFT on curated data, preference optimization (DPO/ORPO), and optional RLHF. Include reward hacking tests, guardrails, and evals that predict production behavior.

Tags: LLM, SFT, DPO, ORPO, RLHF, alignment, evaluation

Author: Assistant

Category: training-pipeline-LLM | Model: gpt-4o