Your evals have a Rotten Tomatoes problem

You push a change to a prompt and your eval score drops from 0.91 to 0.84. Something got worse, but the score doesn’t tell you what. So you start re-running the pipeline, tweaking your inputs, pouring over outputs, trying to figure out which part of the system regressed and why. An hour later you’re no closer to understanding and you’re getting more frustrated. The problem isn’t that evals aren’t the right solution, the problem is how you’ve designed them.

Think Movies, Not Math Tests

Rotten Tomatoes will tell you a movie is 94% fresh. Great! But…what does that actually mean? It's the percentage of critics who gave a positive review. That's it. A movie that every critic thought was pretty good and a movie that most critics thought was a masterpiece look identical at 94%. The number is real, but it flattens out everything interesting–including whether you'll actually like it.

Many teams write evals that work exactly like this. One question, one score, everything interesting about the output compressed into a single number.

A more useful approach is to evaluate the movie along specific dimensions, each scored on a simple scale:

Romance
Humor
Character development
Action
Realism
Scary scenes
Adult content

Each of these is easier to score consistently on its own. You don't need a deep philosophical framework to answer "how much romance is in this movie?" the way you do to answer "is this movie good?" The questions are more decomposable, and with even a lightweight rubric, reasonable people will mostly agree on the answers.

And now you can build on this. Once you have these individual scores, you can compose them into more sophisticated judgments. If you believe that romance and jump scares don't mix well, you might rate a movie lower even if it scores high on both individually. If you're recommending a movie for a family movie night, you care about a very different profile than if you're recommending one for a date night. The dimensional scores are the raw material; the judgment about "good" is a function you build on top of them, tuned to the context.

But there's a less obvious advantage that might be the most important one: the process of designing these dimensional evals forces you to deeply understand the problem you're solving. You can't write an eval for "realism" until you've decided what realism means in the context of this particular movie. You can't score "character development" until you've defined what that looks like. Suddenly you're having conversations about criteria you hadn't fully thought through before. The decomposition work that goes into building good evals doubles as problem discovery. You come out the other side with sharper requirements, better-defined ground truth, and a shared vocabulary with your team about what you're actually trying to assess.

Bring This Back to LLMs

The same principle applies directly to evaluating LLM outputs. Instead of asking "is this response correct?", identify the specific qualities that matter for your use case and evaluate each one separately.

Depending on what you're building, those dimensions might include things like: schema compliance, factual grounding against source documents, tone and register, completeness of the response, conciseness, whether the model hallucinated entities that don't appear in the input, or whether it followed specific formatting instructions.

Here's a concrete example. Say you have an LLM call that takes a customer support ticket and produces a structured summary with a category, a priority level, and a proposed response. Instead of writing one eval that asks "is this summary good?", write three:

Category accuracy: Does the assigned category match a known-correct label? Assuming you're building labeled data as you go (and you should be), this is deterministic. No LLM judge needed. Score: 0 or 1.
Priority calibration: Is the priority level within one step of the expected value? Also deterministic. Score: 0, 0.5, or 1.
Response relevance: Does the proposed response address the actual issue in the ticket? This one is harder and might require an LLM judge, but it's a focused, well-scoped question. Even here, you can define a simple rubric: does it address the stated issue, does it propose a correct next step, does it avoid inventing policy that doesn't exist?

Now when something goes wrong, you know what went wrong. Say your category accuracy dropped this week. That's a very different problem than your response relevance dropping, and it points you toward a very different fix. A single "correctness" score would have told you nothing about where to look.

And just like with movies, the process of designing these evals is where a lot of the real work happens. You can't write an eval for "response relevance" until you've defined what a relevant response looks like for each ticket category. This means sitting down with your client and asking questions they probably haven't considered yet. What counts as relevant for a billing dispute versus a technical outage? How complete does the response need to be? The act of decomposing the problem into evaluable dimensions forces you to develop sharper requirements and targeted ground truth, and it builds a shared understanding of the problem that pays off long after the evals are written.

Shrink-Wrap Your LLM Calls

There's a related principle that compounds the value of dimensional evaluation: get your evals as close to the individual LLM call as possible.

Most LLM-powered systems aren't a single call–they're a pipeline. You might have one call that extracts information, another that reasons over it, and a third that generates a final output. When you write an eval that only checks the final output, you've created the equivalent of an integration test with no unit tests. When it fails, you're left guessing about which stage broke.

If you think of your evals like shrink wrap around each LLM call, you gain the ability to notice and localize problems that would otherwise be invisible. The most dangerous issues in LLM systems aren't the dramatic failures–those are obvious and get fixed quickly. The dangerous ones are slow degradations: a model update that makes your extraction step slightly less precise, a prompt change that subtly shifts tone in a way that compounds downstream. Without granular, dimensional evals close to each call, these small regressions slip through. You notice the system getting worse over time but can't pinpoint why, or worse, the degradation is small enough that it never feels worth investigating. Until it is. Tight evals make these trends visible early, and because they're scoped to a specific call and a specific dimension, they tell you exactly where to look.

This doesn't mean pipeline-level evals are useless. Some properties are only visible end-to-end. A multi-step reasoning chain might produce correct intermediate outputs that still combine into a wrong final answer. Pipeline-level evals catch that, and they have value as a high-level signal, especially early in a project when you're still getting oriented. But they shouldn't be your primary feedback mechanism.

Anti-Patterns Worth Avoiding

With this mental model in place, a few common mistakes become easier to spot.

Evaluating too much of the pipeline at once

When your eval spans multiple LLM calls and application logic, a failing score tells you something is wrong but not where. You end up re-running the whole pipeline with different inputs, squinting at intermediate outputs, trying to localize the issue manually. This is exactly the work your evals should be doing for you.

Reaching for LLM-as-judge too quickly

It's tempting to throw an LLM at evaluation because it feels easy: "Hey, is this result good?" But LLM-as-judge is slow, expensive, and non-deterministic. Worse, the apparent ease of it tempts you into asking a single eval to assess too many things at once, collapsing everything into a single axis of correctness. Start with deterministic checks first–schema validation, exact match, regex, set membership–and reserve LLM-as-judge for dimensions that genuinely resist deterministic evaluation

Conflating evaluation with validation

Evals aren't a pass/fail gate. They're an instrument panel. The goal isn't to get a green checkmark, the goal is to build a profile of your system's behavior that you can reason about, track over time, and use to make targeted improvements.

The Payoff

Consider the difference. The old approach: one "correctness" eval on your pipeline's final output. It scores 0.87. It drops to 0.81 next week. You don't know why.

The new approach: you have extraction precision at 0.94, citation grounding at 0.88, policy compliance at 0.97, and tone at 0.91. Next week, citation grounding drops to 0.72 while everything else holds. You know exactly what changed, you know which call to look at, and you probably already have a hypothesis about why.

Your evals get easier to write because each one is asking a focused, often deterministic question. They get easier to maintain because changes to one part of the system don't cascade failures across unrelated checks. And when something degrades, you know exactly where to look and what changed.

You stop asking "is the answer good?" and start asking "what does the answer look like?”, and this, it turns out, is a much more useful question.

Your evals have a Rotten Tomatoes problem

Think Movies, Not Math Tests

Bring This Back to LLMs

Shrink-Wrap Your LLM Calls

Anti-Patterns Worth Avoiding

Evaluating too much of the pipeline at once

Reaching for LLM-as-judge too quickly

Conflating evaluation with validation

The Payoff

Comments

More from this blog

Applied AI Digest: Volume 3

I wrote a profiling tool for agents

Applied AI Digest: Volume 2

Stop designing chatbots

Artifact-driven AI creation

Command Palette

Think Movies, Not Math Tests

Bring This Back to LLMs

Shrink-Wrap Your LLM Calls

Anti-Patterns Worth Avoiding

Evaluating too much of the pipeline at once

Reaching for LLM-as-judge too quickly

Conflating evaluation with validation

The Payoff

Comments

More from this blog