My name means daffodil. When I started fine-scaping, planting daffodils felt personal, especially because they always bloom around my birthday, and we had a lot of them growing up in Central Asia. I bought a big bag, followed the instructions, and waited. What came up was half the size of what I dreamed about having in my garden — still fragrant, still beautiful, but tiny. I spent a few weeks fussing with water and fertilizer trying to fix flowers that were not broken. They were dwarf daffodils. I had not known what I was buying. I had been measuring against the wrong flower the whole time.

Not Knowing What You Have

When I was a child in Central Asia, I used to look out our balcony window at the Buldanesh shrub in my neighbour's garden — the Snowball Bush, with huge round ball-like flowers. In winter the shrub would be covered in snow, or have icicles hanging from the branches with the dried flowers. I loved it. I did not know the name Snowball Bush then.

Then, one day I went to Home Depot and saw Annabelle hydrangea, which resembled the Buldanesh I recalled from childhood. Same round white blooms. Same shape from across the aisle. I bought it. The leaves should have told me something was off — Buldanesh has lobed, maple-like foliage, nothing like Annabelle's broad heart-shaped ones — but I was looking at the flower. Out of curiosity, I looked up the plant, and discovered that they are not related. Annabelle is a hydrangea varietal that looks nearly identical to Buldanesh at a glance, but it blooms on new wood and needs to be cut back hard in late winter, almost to the ground. Don't prune it, and you will get long branches with small blooms, which will look messy.

Most raters do not look up scientific names before pruning. Who would? You trust what you recognize, and most of the time, it works.

Eval design starts earlier than most people think. Before calibration. Before rubrics. Before any of that, you have to know what kind of plant you are looking at — who the user is, what they are actually trying to do, what context they are in. Skip that, and every rule you apply after is a pruning rule applied to an unidentified plant. The damage is invisible until the season ends, and then the season is over.

Las Vegas, Can Can, and the Dwarf

Three different daffodil varieties side by side — Las Vegas, Can Can, and dwarf
Las Vegas, Can Can, and the dwarf — same genus, three different jobs.

This spring, the bouquet on my desk has three daffodil varieties in it. Las Vegas is tall, with oversized white petals and a yellow trumpet — the one that stops you mid-step. Can Can is shorter, and always leaning forward because its center is so heavy it pulls the stem. The dwarf is small and asks for almost nothing — least water, least fertilizer, least attention — and shows up every spring whether I did anything or not. Three plants. One genus. Now, how do you rate them?

Is this a daffodil? The easy one. All three pass, and so would a dozen others. This is the question most eval rubrics actually answer, even when they think they are doing more — a basic identity check dressed up as a quality bar. It tells you the output belongs to the category. It tells you nothing about whether it is any good.

Is this a good daffodil? Better, but it depends on a standard nobody wrote down. Ask it and a rater measures all three against the average yellow trumpet — the most common daffodil, the one everyone pictures. Las Vegas gets flagged for being too large. Can Can gets flagged for leaning. The dwarf gets penalized for size. Every rating is wrong in a different direction, because every rating is measured against a reference no one agreed to. The raters are not being careless. They are doing exactly what the rubric implied.

Is this a good daffodil for the job? The only question that produces a useful answer. If the task is cut bouquets, I want Can Can — the heavy center is what makes it beautiful in a vase, and the lean is structural, not weakness. Heavy-centered flowers make the best arrangements, and nobody tells you that. If the task is a border, most gardeners reach for the dwarf, but I would pick Las Vegas every time — the scale is what makes a border worth looking at. If you want a tiny bouquet for a tiny vase, or better, a pot, the dwarf is your guy. Same three plants, three different answers, depending on what you were actually asking for.

Which means the rater has to know the job before they can judge the bloom. What they cannot see from the output alone is whether the user is a domain expert or a first-timer, whether they are debugging something urgent or just poking around. You have to know which one of those you are dealing with before you can say whether the answer was any good.

"So you specify the criterion before you show anyone an example. Not 'here is a good response' — instead, 'here is a response that does the thing the user was actually asking for.'" — Nargis Sakhibova

The example becomes an illustration, not the rule itself. This matters more than it looks. Examples without criteria feel concrete, but they break the moment something new shows up. The rater has seen nothing like it, and the examples give them nothing to reason from. A criterion travels to cases you did not show. An example only covers the one in front of you.

The Tulip Problem

Last spring I tracked down a Dutch nursery out of Pennsylvania and ordered five bags of Parrot tulips in different colors. Ruffled, wildly colorful, the kind of bloom you order imagining what your garden will look like in April. When Parrot tulips bloom right, the petals look like they were made by hand — slightly imperfect, twisted at the edges, more like a painter's idea of a flower than a flower. I had been picturing it all winter.

I put them in on schedule and waited. By late winter the shoots were up and the buds were forming. Then in early March the temperature in California jumped — ten days of heat, coming in fast, before the blooms were ready to open.

A wilted, spent daffodil lying in garden mulch among other plants — a bloom that came too soon
A bloom that came before it was ready. / Illustrated by NS

The heat did not just burn the buds, it rushed them. Flowers that were not ready opened anyway, on stems so short I could not even cut them for a vase, so I left most of them in the bed and watched them finish in place. Leaves scorched at the edges. The ones that did make it open came out wrong — color faded, petals deformed, the whole bloom crammed into a fraction of the time it should have had. One morning I cut what I could save and put it in a vase. A total of three tulips. I am still a little mad about it. I had waited all winter.

The heat was external. I did not cause it. The same thing happens in eval pipelines, except the pressure does not come from the weather. It comes from people.

Everyone wants a faster eval turnaround. The release is close, the model has improved, the eval cycle is the bottleneck. The ask is not unreasonable. You agree to accelerate — push raters harder, bring on new ones to cover volume, hit the timeline. Raters start skimming, the new ones are not calibrated yet, and the numbers coming back look like signal but are not.

The numbers do not get shipped on. They get stared at — everything freezes. You pull up the evals, sit with raters, call a code red. The raters become overly cautious, so they start hedging. Borderline outputs get flagged. Anything ambiguous gets rejected just to be safe. The pipeline is slower than before you accelerated, and noisier.

This is the trap. Acceleration at the wrong moment forces output that is not ready — shorter, deformed — and once it has started, there is no slowing it down. There are four ways out, ordered from immediate to structural:

Four ways out of the acceleration trap

1. Slow down and red-team the results. A small group of expert raters, separate from the rating pool, who check a sample before anything gets reported. Not to re-rate everything, but to catch systematic errors before they reach a decision, and give raters room to recalibrate without feeling like hedging is the safe answer.

2. Name what happened. Write up the cycle while it is still fresh — what triggered the acceleration, where the signal broke, how long it took to recover. The next time someone asks for a faster turnaround, the case is already written.

3. Make the cost visible. A faster eval that produces the wrong signal does not save time. Model improvements get attributed to prompt changes that had nothing to do with them. Surface those costs in the same conversation as the speed ask, not after.

4. Make quality a launch criterion. Defined in advance, with the same precision as latency or impact thresholds. A team that negotiates quality under pressure will always negotiate it down.

What Grows When You Are Not Watching

Dense nemesia in purple and pink blooming across a garden bed, crowding out the space around it
Nemesia looks healthy in isolation. In the garden, it needed to be tamed. / Photo by NS

This spring I lost my lupine plants to nemesia. Nemesia grows through the winter, quietly and steadily, while the lupines and the rest of the garden are dormant. At first it was welcome growth — something green and alive when everything else was still waiting. By spring the nemesia had spread into the surrounding beds and taken the space the lupines needed to come back. When the time came, the lupines did not come back. Instead of the vibrant mix of colors and shapes I wanted, I got overgrown nemesia beds.

On its own, the nemesia looked fine. In the garden, it needed to be tamed.

Criteria drift works the same way. You do not decide to change your eval standard. It grows into the available space. Raters see enough outputs that they start to accept certain errors as normal — the model hedges in a specific way, or always botches a particular kind of edge case. What started as a flag turns into a shrug, and then the shrug becomes the baseline. The rubric is the same as it always was. The tolerance has crept up. The raters have started rating the model against itself, instead of against what the user actually needed — and the user's actual need, like the lupines, does not disappear loudly. It just stops coming back.

The fix is a fixed exemplar set — examples collected before the model's outputs began shaping what raters consider normal, re-scored at the start of each rating batch. If their scores on those examples have drifted from what they were before, you know drift has set in.

Once that golden set exists and you trust it — the taxonomy, the rubric, the input classification, the exemplars — LLM-as-a-rater becomes a serious option. Running an LLM alongside human raters surfaces drift and rater errors faster than any manual review can. The model is not going to anchor to the last thing it saw, or start hedging when it is not sure.

But the operative phrase is once you trust it. Teams that reach for LLM-as-a-rater before the golden set is solid are grading their own homework and calling it an audit. The model will agree with whatever the rubric says, and it will do it fast. If the rubric is measuring the wrong thing, you now have a very efficient pipeline producing the wrong answer at scale.

The Category and the Bloom

The stakes compound when models are trained on signal generated from miscalibrated evals. The problem is not just that the measurement was wrong — it is that the next model learned from it. You pruned the wrong plant, and then planted ten thousand of whatever grew back.

An eval pipeline that holds up over time is not really a measurement tool. It is how you know whether the product is actually getting better. The first version of any eval will always measure something close to what you need, but not quite it. An eval pipeline that ages well is not the one that is most precise at launch. It is the one that knows what it does not know, and keeps asking.

The daffodils were not wrong. The picture in my head was. — Nargis Sakhibova

Nobody gets evals right on the first attempt. Raters need to be calibrated — not just trained on a rubric, but trained on a grounded, realistic definition of what good looks like for this product, these users, this use case. The goal is not an idealized response measured against the best demo you ever saw. It is a genuinely useful response for the person who actually sent that input.

The daffodils in my garden were not wrong. The image in my head was.


∑ · ✦ · π
NS

Nargis Sakhibova

Product Manager at Google. Economist. Writer. Five languages, one calendar. Previously Adobe, Analysis Group and international development organizations.

This essay is an expanded section from The Non-Deterministic Garden.