Techniques & Methods

AI Eval

Systematic tests that measure whether an AI model or app produces good output.

Also known as: eval,AI evaluation,LLM eval

An AI eval is a systematic test that measures whether a model or AI application produces good output for a given task. Evals are to AI what unit tests are to software — except the "correct" output is often subjective, so evals use LLMs-as-judges, human raters, golden datasets, or rule-based checks. Three flavours: offline evals (run on a fixed dataset before deployment), online evals (sample live traffic in production), and adversarial evals (red-team your model with edge cases). Eval engineering is now its own discipline; tools like LangSmith, Vellum, Braintrust, and Helicone are built around it. The Anthropic-coined phrase "evals are all you need" captures the mood — without evals you can't safely iterate on prompts, models, or fine-tunes.

Read the full guide

Tools that use this