Techniques & Methods
AI Eval
Systematic tests that measure whether an AI model or app produces good output.
Also known as: eval,AI evaluation,LLM eval
An AI eval is a systematic test that measures whether a model or AI application produces good output for a given task. Evals are to AI what unit tests are to software — except the "correct" output is often subjective, so evals use LLMs-as-judges, human raters, golden datasets, or rule-based checks. Three flavours: offline evals (run on a fixed dataset before deployment), online evals (sample live traffic in production), and adversarial evals (red-team your model with edge cases). Eval engineering is now its own discipline; tools like LangSmith, Vellum, Braintrust, and Helicone are built around it. The Anthropic-coined phrase "evals are all you need" captures the mood — without evals you can't safely iterate on prompts, models, or fine-tunes.

