Your evals have a Rotten Tomatoes problem
You push a change to a prompt and your eval score drops from 0.91 to 0.84. Something got worse, but the score doesn’t tell you what. So you start re-running the pipeline, tweaking your inputs, pouring
Feb 12, 20268 min read325


