Your Evals Have a Rotten Tomatoes Problem
You push a change to a prompt and your eval score drops from 0.91 to 0.84. Something got worse, but the score doesn’t tell you what. So you start re-running the pipeline, tweaking your inputs, pouring

Search for a command to run...
Articles tagged with #evals
You push a change to a prompt and your eval score drops from 0.91 to 0.84. Something got worse, but the score doesn’t tell you what. So you start re-running the pipeline, tweaking your inputs, pouring

Today the latest LLMs have large context windows up to ~1 million tokens. There are many occasions when this larger context window can be useful: Context engineering: injecting rich system/user conte

Since the launch of Airbyte 1.0 with AI Assist, hundreds of new Airbyte connectors have been built. We recently joined our partners at Airbyte for the Data Bytes meetup where we answered questions from an audience of data engineers, AI enthusiasts, a...

How thousands of pull requests let Fractional automate annotating a ground truth set for a new product.
