Techniques & Methods
RLHF (Reinforcement Learning from Human Feedback)
How AI labs teach models to be helpful, harmless, and honest — using human ratings.
Also known as: RLHF
RLHF (Reinforcement Learning from Human Feedback) is the technique that turned raw language models into actually useful assistants. The process: (1) train a base model to predict text, (2) have humans rate pairs of responses as better/worse, (3) train a "reward model" to predict those ratings, (4) use reinforcement learning to nudge the base model toward responses the reward model rates highly. RLHF is why ChatGPT is helpful instead of just clever, why Claude is polite, why Gemini stays on topic. It is also why these models can feel "trained to please" in ways that occasionally backfire. Anthropic's Constitutional AI is a related but different approach.