Classification w/Confidence Scores Using Logprobs

Fractional AI works on a lot of projects that contain some sort of classification task at their core; problems like "does this post violate our policies?" or "which catalog category does this product fit into?"

LLMs are a powerful tool for this type of task, but there's a lot of practical wrangling to be done. LLM APIs simply don't behave like classifiers — they're designed to output text rather than categories. OpenAI's structured output mode helps force the model to pick from a list of predetermined answers, but that's not always enough.

One thing that's particularly tough is getting confidence scores attached to your classification. You can ask an LLM to tell you how sure it was and you'll get something that looks like an answer, but whether or not it's true is a different story. You can add an "Unknown" category for the LLM to choose in hopes of filtering out low confidence classifications, but LLMs are notoriously bad at saying "I don't know".

To help solve this, I wrote some code to use OpenAI's structured output capabilities along with the lesser-known logprobs field (more info here) to measure the likelihood of different responses to help produce some measure of confidence. The code itself is here, and it can be used like this:

# Define your categories as an enum
class ArticleType(str, Enum):
    SPORTS = "Sports"
    POLITICS = "Politics"
    BUSINESS = "Business"
    TECHNOLOGY = "Technology"
    ENTERTAINMENT = "Entertainment"
    HEALTH = "Health"
    SCIENCE = "Science"
    # ... etc

classifications = classify_with_confidence(
    # News headline that spans multiple categories
    # Confidence should be low.
    "Scientific breakthrough improves football performance",
    ArticleType,
    openai_client
)

And the output looks like this:

  {
    ArticleType.SCIENCE: 0.5311,
    ArticleType.SPORTS: 0.4687,
  }

Under the hood, we're pinging the OpenAI API like this:

client.beta.chat.completions.parse(
    model=model,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": input},
    ],

    # Each response must be one of our categories (json_schema mode)
    response_format=classification_type,

    # Return info about the log probability of each output token
    logprobs=True,

    # Produce multiple different responses
    n=max_categories,
)

Although we're setting n=5, the responses aren't actually guaranteed to be different. The more confident the LLM is the more repeats we'll see among the 5 different responses. We dedupe these responses and then use the token logprobs to measure the probability of each alternative category.