The Prompt That Writes Itself

Prompt engineering is iteration. You read what failed, tweak the prompt, run it again, repeat. It's a tedious, but inextricable, part of building with LLMs. GEPA (Genetic-Pareto) takes that loop off your hands.

What GEPA does

You give it an initial prompt (a rough first draft works fine), your labeled dataset, and your scoring function. GEPA evolves that initial prompt into one that works, while you focus on other tasks.

The loop:

GEPA runs your prompt on a batch of examples from your dataset and scores each output produced.
Send the prompt, the examples where the model scored poorly, and the ground-truth answers to a second LLM called the reflector.
The reflector proposes a new version of the prompt based on the feedback it received.
Run the newly proposed prompt on the same batch as before. If it doesn't score better than the original prompt, throw it out and start over.
Otherwise, run the candidate prompt against every example in the test set, score each response, and record the cases where the candidate matched or beat the current top score.

How this is different from just looping an LLM

If GEPA were just an LLM rewriting a prompt in a loop, it would behave like a greedy optimizer. It would fix the latest failure, forget past variations that worked well, and often break something else in the process. The system would thrash between fixes and regressions instead of accumulating strengths over time. GEPA avoids this by preserving a Pareto front of prompts that each excel on different cases.

Each time the reflector proposes a new prompt, GEPA evaluates it on every case in the testing set. If the prompt achieves the best score on a case, it earns a place on the Pareto front for that case. A single prompt can hold the top score for many cases at once. Over time, this produces a library of specialized prompts instead of one mediocre all-rounder.

On the next round, GEPA picks a parent from the Pareto front and mutates it. A prompt that excels on one tricky case can get selected, mutated, and turn into something that excels across the board. That's the evolution piece: a population of specialists the reflector keeps drawing from.

A few warnings

GEPA isn't magic. Its failure modes all trace to the same root: GEPA learns from your dataset, your reflector prompt, and your scoring function, and nothing else.

Your dataset is your ceiling. If your 100 training examples don't cover a case, GEPA won't learn to handle it. On small or narrow datasets, this bites fast. The optimized prompt aces your scoring runs and falls apart the moment a real input includes something your test set missed.

GEPA overgeneralizes from sparse examples. If your training set doesn't cover an edge case, GEPA might write a rule that fits the surface pattern but breaks on the exception. Reasonable-looking, until a real input hits the exception and the prompt gets it wrong. Don't yell at the reflector in its instructions. Add an example of the exception to your dataset, or adjust your score so that specific failure costs something.

Score ceilings mean contradictory data. If the score climbs and then plateaus well below perfect, your dataset is disagreeing with itself. Two near-identical examples carry opposite labels because one labeler applied the rule and another missed it. GEPA can't learn a clean rule from data that doesn't have one. Cleaning the dataset unblocks the ceiling. This limitation is not unique to GEPA, this is a common issue where our evals are limited by the strength of our evals.