AI Systems Architecture — Mastery5 / 9

Evaluation Pipelines as Infrastructure

In AI systems, evaluation is not QA you do at the end — it's infrastructure you build first. Without it, every change is a prayer.

Published May 13, 20261 min readHaythem Rehouma · Claude Mastery

In normal software, tests are pass/fail and you write them as you go. In AI systems, "correct" is fuzzy and outputs vary — so evaluation stops being QA and becomes infrastructure you stand up before optimizing anything.

Offline: the eval set

A curated set of representative inputs with reference answers or rubrics. Run it on every prompt change, model swap, or retrieval tweak and you get a number — did this help or hurt? Include hard and out-of-scope cases, not just the happy path.

Online: production metrics

Offline can't catch everything. Track online signals — thumbs up/down, task completion, escalation rate, regeneration rate — and feed surprising production cases back into the offline set. The eval set is a living asset.

LLM-as-judge, with guardrails

A strong model can grade quality at scale, but:

Give it a strict rubric, not "is this good?"
Calibrate against human labels on a sample.
Use a different model/lens than the one being graded where bias matters.

Gate changes in CI

You can now measure. Next: making the system affordable — cost engineering.

Offline: the eval set

Online: production metrics

LLM-as-judge, with guardrails

Gate changes in CI

Related Claude skills you can install

Share this article

Series — AI Systems Architecture — Mastery

Keep learning

The Claude Mastery course