Boosting Reliability with Evaluators

LLMs are heuristic and opaque. Without the ability to selectively measure elements of correctness, you can’t be sure your system is behaving.

Every GenAI project starts with a phase of looking at individual LLM responses and tweaking prompts, data formats, and architectures. This is an invaluable step for building intuitions. But, an ‘anecdata’ approach lacks repeatability, and if carried too far, once you’re juggling a combination of intuitions, strategies, and models, you can be left searching for the particular combination of inputs, parameters, temperature, and wind direction that got your best results.

The solution is rigor, enforced by evaluators. By establishing clear criteria ahead of time, you can avoid wishful thinking and put boundaries around the uncertainty introduced by LLMs.

In the project I’ll use as an example, we were mapping short text excerpts from a marketing system into a hierarchical taxonomy. Careful definitions of correctness allowed us to get good functionality, and along the way smaller, purpose-specific, evaluators were used to drive investigation and build in guardrails to avoid regression. This blog post will show one example of each, and a later post will show yet more examples.

At Fractional one of the tools we use is BrainTrust, a suite of tools for (among other things) evaluating permutations of LLM performance against ground truth. It provides dashboards for arbitrary scoring, diffing of experiment runs, and correlation analysis for evaluators, letting us focus on the work. The overall concept could be implemented anywhere.

The Project

In this project, the input data was messy and inconsistent, and the set of output categories was enormous – a 10,000 member taxonomy about eight levels deep. For example:

Given notes typed by an online marketer, like 8451storename > DIRECT > LapsedBaconBuyer Custom Segment…
…we needed to come up with fb-2-12-2, which is a hierarchy code that corresponds to Food, Beverages & Tobacco > Food Items > Meat, Seafood & Eggs > Meat in the Shopify taxonomy of products and services as the most likely category this marketer was targeting.

The data was sparse and inconsistent because it came from many sources and was intended for other purposes. To see what we could glean from it we tried many strategies, from entity extraction to classifications, each with many permutations so we could begin identifying factors that helped drive up correctness.

But to compare results meaningfully, we needed to define correctness. We had some tagged data. So what makes an output of the system ‘correct’?

The naive version is simply equality: does the output generated exactly match the tagged ground truth. This is better than eyeballing examples sporadically, tweaking until behavior is ‘good enough’, but it’s far too brute-force for something like a hierarchy.

In this situation, though, we want to be able to express the idea that getting an initial level of classification wrong is much ‘more incorrect’ than getting a finer detail further down the tree wrong.

To continue with our DIRECT > LapsedBaconBuyer example, which should end up in fb-2-12-2, “Fresh or Frozen Meat”, we needed a way to capture the idea that stopping one level too high at fb-2-12 (“Meat”) was a great deal ‘less wrong’ than, something entirely unrelated, like rc-3-2, (“Religious & Ceremonial > Wedding Ceremony Supplies > Flower Girl Baskets”, though I support your right to have whatever sort of wedding you wish).

Using Braintrust we wrote a custom evaluator that compared how many of the levels were correct, penalizing for every difference. Since there were up to 8 levels, we penalized by a different amount for each level of difference, with lower levels being penalized less*.

def full_compare(output, expected):
    weights = [1.0, 0.7, 0.5, 0.4, 0.3, 0.2, 0.15, 0.1]
    output_parts = output.split("-")
    expected_parts = expected.split("-")

    if output_parts == expected_parts:
        return Score(name, 1.0)

    score = 0.0
    for weight, output_part, expected_part in itertools.zip_longest(
        weights, output_parts, expected_parts, fillvalue=""):
        if output_part != expected_part:
            score = 1 - weight
            break
        else:
            score = 1 - weight

    return Score('hierarchical', score)

This was enough to start helping us focus our inspection on errors that mattered more. (Full disclosure - all of the credit for the clever hierarchical evaluators in this project go to my colleague, Dan Girellini!)

We found that while most of our errors were in smaller details, there was a class of failures where even the root level was wildly wrong. We tossed in another evaluator that scored only the root category to see if we could determine the cause.

def root_correct(output, expected):
    output_parts = output.split("-")
    expected_parts = expected.split("-")

    if output_parts[0] == expected_parts[0]:
       return Score('root_correct', 1)
    else: 
        return Score('root_correct', 0)

Filtering our failures on this evaluator helped us isolate the problem: significant parts of the input data were of an entirely different type than others. They were not describing product categories at all, however poorly! Instead, they were phrased in a way that made sense only to the original user (eg, ”Purchase Prediction Segment(Product Category) - Shopping (90 days)”). Our poor model was desperately trying to find meaning where there was none to be found.

After verifying with the client that this sort of data was going to continue occurring, we were able to modify our architecture to detect this class of input data and triage it with an ‘uncategorizable’ tag rather than try to categorize it.

These are just two examples. By the time we were done we had used perhaps a dozen evaluators both to measure the impact of hyperparameters and to expose and diagnose specific details about the data. In particular, using evals to verify that the model correctly reported low confidence in cases where its answer ended up incorrect was invaluable. In a subsequent blog post, we will dig in to some of the other eval techniques we used.

By the end of this project, because we had numbers we were able to quantify the impact of various decisions and discuss tradeoffs with the client as they tuned the system for their needs. In the absence of methodically created, consistent evaluators we would have had nothing to point at – and, to be honest, nothing to steer by – but our intuitions.

* - In fact, evaluation of hierarchical taxonomies is its own field of research, and there’s no simply correct answer. The terminally curious can investigate the term “Hierarchical multiclass classification” for all the details they could ask for.