OpenAI Structured Output + Pydantic: Adding Support for Default Values

OpenAI recently released strict structured output as a part of their API. We have made heavy use of function calling to extract structured output, but having top-level support for generating output conforming to a JSON schema both cleans up our interfaces and improves output performance.

We were super excited about this, but ran into a few snags using it out of the box.

Problem #1: no defaults allowed!

Many of our existing Pydantic models have default values, which are not supported by OpenAI. For example, this code:

class Article(BaseModel):
  title: str
  author: str | None = None
  text: str

article_text = """
Hello world!
By:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus a malesuada ex. Praesent efficitur, justo at suscipit efficitur, ex tortor blandit diam, consectetur malesuada lectus risus sed nisl.
"""

client = OpenAI()
client.beta.chat.completions.parse(
  model="gpt-4o-2024-08-06",
  messages=[
     {"role": "system", "content": "You are a helpful assistant. Extract the provided article."},
     {"role": "user", "content": article_text}
  ],
  response_format=Article
)

yields the following error response from the OpenAI API:

openai.BadRequestError: Error code: 400 - {'error': {'message': "Invalid schema for response_format 'Article': In context=('properties', 'author'), 'default' is not permitted", 'type': 'invalid_request_error', 'param': 'response_format', 'code': None}}

The issue is that the author field has a default value. While it's easy to just remove the default in this toy example, we have code which relies on the default being set. We decided it'd be better for the overall quality of our code to write some lightweight adapter code.

Problem #2: Unions with shared fields

OpenAI doesn't support unions that have multiple choices where the first field has overlapping values. This isn't very common and depends on the order of fields in the union member types.

We couldn't find supporting documentation, but we suspect it's because the Context Free Grammar next token constraint model that OpenAI uses for this feature needs to identify which sub-type in a union you're using based on the first field.

It's worth noting that we probably wouldn't have seen this at all except for our solution to Problem #1 introduced this scenario consistently when discriminators have default values.

Consider these types, for example:

class Article(BaseModel):
    type: Literal["article"] = "article"
    title: str
    author: str = "DEFAULT AUTHOR"
    text: str

class Tweet(BaseModel):
    type: Literal["tweet"] = "tweet"
    content: str
    author: str = "@sama"

class Content(BaseModel):
    content: Article | Tweet

If we ask OpenAI to respond following the schema defined by the Content type, we get this error:

openai.BadRequestError: Error code: 400 - {'error': {'message': "Invalid schema: Objects provided via 'anyOf' must not share identical first keys. Consider adding a discriminator key or rearranging the properties to ensure the first key is unique.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

Solution: adapter interface

We wrote some small helper methods to create a patched version of a Pydantic model with defaults removed. When fields do have a default, it allows an "unknown" placeholder value to indicate that a default should be filled in. A second helper translates an instance of the patched model to its original. The interface looks like this:

We wrote an adapter method called make_openai_compatible which takes in a Pydantic model and returns a patched model with compatibility issues fixed:

Fields have default values removed. field: Type = DEFAULT are translated to field: Union[Type, Literal[UNKNOWN_PLACEHOLDER]], giving OpenAI a way to indicate it doesn't know how to fill out this field.
Fields with Union types are translated into a structure with guaranteed compatibility regardless of the structure of their members.

A second method called patch_openai_value converts an instance of the patched model into an instance of the original model.

Usage looks like this:

class Article(BaseModel):
    title: str
    author: str = "DEFAULT AUTHOR"
    text: str

# Creates a patched model with default values removed
patched_model = make_openai_compatible(Article)
article = patched_model(
    title="Title", 
    author=UNKNOWN_PLACEHOLDER, 
    text="Text"
)

print(article)
# title='Title' author='__UNKNOWN' text='Text'

print(patch_openai_value(article, Article))
# title='Title' author='DEFAULT AUTHOR' text='Text'

Integrating with OpenAI client

To make this as easy to use as possible, we wrote a simple wrapper around the OpenAI client which handles this translation for us. We just call a method like this and everything is handled for us!

structured_ask(
    client,
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Extract the provided article.",
        },
        {"role": "user", "content": article_text},
    ],
    response_format=Article,
)