RLHF (Reinforcement Learning from Human Feedback) — Plain English Definition

RLHF (Reinforcement Learning from Human Feedback) is the technique that turned raw language models into actually useful assistants. The process: (1) train a base model to predict text, (2) have humans rate pairs of responses as better/worse, (3) train a "reward model" to predict those ratings, (4) use reinforcement learning to nudge the base model toward responses the reward model rates highly. RLHF is why ChatGPT is helpful instead of just clever, why Claude is polite, why Gemini stays on topic. It is also why these models can feel "trained to please" in ways that occasionally backfire. Anthropic's Constitutional AI is a related but different approach.

Read the full guide

What Is RLHF
What Is Constitutional AI