Back to Blog

LLM Evals: How to Actually Test AI Systems in Production

B
Bharath Asokan

Every LLM application looks impressive in the demo. A non-trivial fraction of them silently regress the week after deployment, and no one notices until a customer complains. The reason is boring and preventable: there are no evals.

Traditional software tests a function by asking “given this input, does the output equal this value?” LLM outputs are stochastic, phrased differently every time, and “correct” lives on a fuzzy gradient rather than a pass/fail boundary. Unit tests cannot catch a prompt change that made the model 8% less helpful. Evals can. This post is how to build them.

Why Traditional Testing Fails

Three reasons LLM systems break the assumptions of traditional testing:

  • Non-determinism: Same input, different output. Even at temperature 0, model updates, tool variance, and retrieval changes make exact-match assertions useless.
  • Fuzzy correctness: A response can be factually correct but unhelpful, or helpful but slightly off. Binary tests cannot express that.
  • Silent degradation: A prompt tweak that improves one failure mode can regress three others. You do not notice until production traffic does.

The implication: your test suite has to operate on distributions, not assertions. You need hundreds of cases, a scoring function that captures what “good” means, and a baseline to regress against.

The Four Types of Evals

1. Reference-based Evals

You have a known-good output. The scorer compares model output to reference. Useful for classification, extraction, structured output, anything with a ground truth.

  • Strengths: Objective, reproducible, fast to run.
  • Weaknesses: Requires labeled data. Does not capture open-ended quality.
  • Use for: Categorization, entity extraction, SQL generation, code generation with test cases.

2. Reference-free Evals

No reference output. Scorer checks properties of the output on its own — toxicity, PII leakage, JSON schema compliance, length, language detection. Essentially assertions that do not need a ground truth.

  • Strengths: Scale cheaply, run on production traffic.
  • Weaknesses: Only catch property violations, not quality.
  • Use for: Guardrails, format compliance, safety checks.

3. LLM-as-Judge

A second LLM scores the output of the first. Given a rubric (“rate helpfulness 1–5,” “does this answer contradict the source document?”), the judge produces a score. When done carefully, judge-model correlation with human labels reaches 80–90%.

  • Strengths: Handles open-ended quality. Scales without humans.
  • Weaknesses: Biased by the judge model's own quirks. Needs careful rubric design.
  • Use for: Summarization quality, response helpfulness, tone adherence, factual grounding.

4. Human-in-the-loop

Actual humans scoring outputs. The gold standard, and the bottleneck. Nobody labels enough of it, so you use it sparingly — to validate automated scorers, to spot-check production traffic, to calibrate rubrics.

  • Strengths: Most trustworthy signal.
  • Weaknesses: Slow, expensive, does not scale.
  • Use for: Ground truth for other scorers, red-teaming, quarterly quality audits.
A good eval stack is a pyramid: many reference-free and reference-based checks, fewer LLM-as-judge evals, a small amount of human review at the top to keep the lower tiers honest.

Building an Eval Dataset

Start with 100–300 cases. Teams fail into two camps — those who say “we need 10,000 cases before we can start” and never start, and those who skip it entirely. Both are wrong. 200 representative cases catches the vast majority of regressions.

Where cases come from:

  • Production logs: Real user inputs. Rank by frequency and by risk. Sample both.
  • Domain experts: The team lead who knows the tricky cases nobody remembers to test.
  • Failure postmortems: Every production failure becomes a regression test.
  • Synthetic adversarial: Prompt injection attempts, edge inputs, boundary conditions.

Balance the set. Not just happy path. Include the gnarly cases at 20–30% of the dataset, otherwise your scores will hide the failures that actually matter.

Evals in CI/CD

The eval suite runs on every meaningful change — prompt edit, model swap, retrieval config change. In the CI job, compare current run scores to the main-branch baseline. Fail the build if any metric regresses by more than a defined threshold (typically 2–5%).

What to wire up:

  • Versioning: Prompts, model names, retrieval configs are all versioned artifacts. You should be able to reproduce any eval run.
  • Baseline: A reference score on main. Every PR compares against it.
  • Diff reporting: A per-case breakdown of which cases improved, which regressed, and by how much.
  • Threshold gates: Automatic failure when a metric drops past a boundary. Not “looks fine, merge it.”

Production Monitoring

The eval set is your pre-deploy safety net. Production monitoring is the other half.

What to monitor live:

  • Sampling: Log 1–10% of production conversations with full context. Run LLM-as-judge over them nightly.
  • Drift detection: Track scoring distributions over time. A drop in average helpfulness score is an early warning.
  • User feedback: Thumbs up/down. Freeform comments. Escalation rates. Feed these back into the eval dataset.
  • Cost and latency: Not quality, but often correlated. A sudden latency spike usually means something changed upstream.
  • Guardrail trip rates: If your PII redactor starts firing 5x more often, that's a signal worth investigating.

The Tooling Landscape

As of 2026, the ecosystem is finally usable. A quick taxonomy:

  • Braintrust: Strong eval SDK, good CI integration, solid UI for comparing runs.
  • LangSmith: Deep LangChain integration, strong on tracing, evals bolted on top.
  • Arize, Helicone, Langfuse: More observability-first, with eval features added over time.
  • Roll your own: A JSON file of test cases, a Python scorer, a CI job. Perfectly fine for teams of 1–5 with narrow workloads.

Tool choice matters less than habit. A clunky in-house eval that runs every PR beats a polished platform that nobody uses.

Common Failure Patterns to Catch

The regressions we see most often, and therefore the ones your evals should cover:

  • Format drift: Model starts returning prose when it should return JSON. Schema-compliance checks catch this in seconds.
  • Hallucinated citations: RAG system cites documents that do not exist. Scorer verifies citations against retrieved chunks.
  • Tone regression: Prompt change made responses more formal, or more casual, or more hedge-y. LLM-as-judge against a tone rubric catches it.
  • Over-refusal: Model starts declining legitimate requests. A set of “should-be-answered” cases catches this.
  • Context loss: Multi-turn conversations forget earlier turns. Conversation-level evals catch it; single-turn evals do not.
  • Tool-call regression: Agent stops calling a tool it used to call, or calls it with wrong arguments. Trace-level assertions catch it.

The Discipline

Evals are not a one-time project. They are a practice. Every production incident becomes a regression case. Every prompt change goes through the gate. Every model upgrade is validated against the baseline before rollout.

Teams that treat evals as infrastructure ship faster, not slower, because they stop being afraid of prompt changes. Teams that treat evals as something to do later spend their time debugging quality in production — which is the most expensive place to debug.

Need to Build an Eval Stack?

At t3c.ai, we've built eval pipelines for RAG, agents, and fine-tuned models in production. If your AI system is shipping without a safety net, let's fix that.

Get In Touch →